Open AccessJournal Article10.1109/TASLP.2021.3067161

Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking of Fundamental Frequency

- 18 Mar 2021

- IEEE Transactions on Audio, Speech, and ...

- Vol. 29, pp 1479-1490

TL;DR: In this paper, the harmonic structure of voiced speech was exploited to segment multiple overlapping speakers in a speaker diarization task, where a change in the speaker's utterance can be inferred from the change in pitch.

Abstract: This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker's utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What have the authors contributed in "Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency" ?

This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task.. The authors show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity.. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network ( BLSTM ) approach and requires training.. The authors also show that the estimated pitch tracks of their system can be used as features to the BLSTM to achieve further improvements of 1. 21 % in terms of coverage and 2. 45 % in terms of purity.. The authors explore how a change in the speaker can be inferred from a change in pitch.

Fig. 3. F0 tracks of a male and female speaker from the TIMIT corpus. The two speakers overlap between the two white dashed vertical lines.

TABLE II PARAMETER SETTING FOR THE PROPOSED AND SYSTEM BASELINE-1

Fig. 6. Baseline-1 system architecture presented in [49] with st: input signal, Φ̂t: peak detections, Ψ̂t: detection reliabilities, Zt: generated observations, Ti: selected track hypotheses, ot: overlapping speech onsets, Bt: strongest candidate track and ct: speaker change onsets.

Fig. 7. An illustrative example of overlapping speech detection using the baseline system architecture

Fig. 2. The individual pitch tracks generated from PEFAC by using the Kalman filter on the individual headset microphones separately. The dashed horizontal lines represent the mean of each speaker. The AMI speaker labels are also given in brackets where the first letter relates to the gender of the speaker, i.e. M: male and F: female.

Fig. 8. Illustrative example of the evaluation framework used in Exp-1 where the blue dashed lines represent the oracle speaker change boundaries and the grey regions correspond to the given collar. A ‘HIT’ is where a speaker change has been detected once. A ‘MISS’ is when a speaker change has not been detected and multi-hit, ‘MH’, is where a speaker change has been detected multiple times within its collar. A FA is when a detection falls outside of any speaker change collars.

Citations

Journal Article•10.48550/arXiv.2306.05812

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

Aidan O. T. Hogg, +4 more

- 09 Jun 2023

- arXiv.org

TL;DR: In this article , a convolutional super-resolution generative adversarial network (SRGAN) was used to transform the HRTF data for convenient use with a CNN, and the proposed method outperforms both baselines in terms of log-spectral distortion and localisation performance.

...read moreread less

Journal Article•10.1109/taslp.2024.3375635

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

Aidan O. T. Hogg, +5 more

- 01 Jan 2024

- IEEE/ACM transactions on audio, speech, ...

•Proceedings Article•10.1109/iwaenc53105.2022.9914796

Polynomial Eigenvalue Decomposition-Based Target Speaker Voice Activity Detection in the Presence of Competing Talkers

05 Sep 2022

TL;DR: In this article , a polynomial eigenvalue decomposition-based target-speaker VAD algorithm was proposed to detect unseen target speakers in the presence of competing talkers.

...read moreread less

Proceedings Article•10.1109/IWAENC53105.2022.9914796

Polynomial Eigenvalue Decomposition-Based Target Speaker Voice Activity Detection in the Presence of Competing Talkers

Vincent W. Neo, +4 more

- 05 Sep 2022

TL;DR: A polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers and is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR).

...read moreread less

Proceedings Article•10.1109/icassp43922.2022.9746116

A Multitask Learning Framework for Speaker Change Detection with Content Information from Unsupervised Speech Decomposition

Hang Su, +6 more

- 23 May 2022

TL;DR: This work proposes a novel framework for the SCD task, which utilizes a multitask learning architecture to leverage speaker information during the training stage, and adds the content information extracted from an unsupervised speech decomposition model to help detect the speaker change points.

...read moreread less

References

Journal Article•10.1115/1.3662552

A New Approach to Linear Filtering and Prediction Problems

R. E. Kalman

- 01 Mar 1960

- Journal of Basic Engineering

28.2K

Book Chapter•10.1109/9780470544334.CH9

A New Approach to Linear Filtering and Prediction Problems

Tamer Basar

- 01 Jan 2001

TL;DR: In this paper, the clssical filleting and prediclion problem is re-examined using the Bode-Shannon representation of random processes and the?stat-tran-sition? method of analysis of dynamic systems.

...read moreread less

22.7K

•Book

Design and Analysis of Modern Tracking Systems

Samuel S. Blackman, +1 more

- 01 Aug 1999

TL;DR: The Basics of Target Tracking and Multi Target Tracking with an Agile Beam Radar, and Multiple Hypothesis Tracking System Design and Application.

...read moreread less

3.5K

Journal Article•10.1109/TAC.1979.1102177

An algorithm for tracking multiple targets

Donald Reid

- 01 Jan 1978

TL;DR: An algorithm for tracking multiple targets in a cluttered environment is developed, capable of initiating tracks, accounting for false or missing reports, and processing sets of dependent reports.

...read moreread less

2.9K

Journal Article•10.1145/362342.362367

Algorithm 457: finding all cliques of an undirected graph

Coen Bron, +1 more

- 01 Sep 1973

- Communications of The ACM

TL;DR: Two backtracking algorithms are presented, using a branchand-bound technique [4] to cut off branches that cannot lead to a clique, and generates cliques in a rather unpredictable order in an attempt to minimize the number of branches to be traversed.

...read moreread less

2.6K

...

Expand

Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking of Fundamental Frequency

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What have the authors contributed in "Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency" ?

Figures

Citations

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

Polynomial Eigenvalue Decomposition-Based Target Speaker Voice Activity Detection in the Presence of Competing Talkers

Polynomial Eigenvalue Decomposition-Based Target Speaker Voice Activity Detection in the Presence of Competing Talkers

A Multitask Learning Framework for Speaker Change Detection with Content Information from Unsupervised Speech Decomposition

References

A New Approach to Linear Filtering and Prediction Problems

A New Approach to Linear Filtering and Prediction Problems

Design and Analysis of Modern Tracking Systems

An algorithm for tracking multiple targets

Algorithm 457: finding all cliques of an undirected graph

Related Papers (5)

An Adaptive Threshold Computation for Unsupervised Speaker Segmentation

Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation

End-to-End Neural Speaker Diarization with Self-attention

End-to-end neural diarization: Reformulating speaker diarization as simple multi-label classification

End-To-End Speaker Segmentation for Overlap-Aware Resegmentation