Fully Supervised Speaker Diarization
Aonan Zhang,Quan Wang,Zhenyao Zhu,John Paisley,Chong Wang +4 more
- 08 Jan 2019
- pp 6301-6305
304
TL;DR: A fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN), given extracted speaker-discriminative embeddings, which decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
read more
Abstract: In this paper, we propose a fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN). Given extracted speaker-discriminative embeddings (a.k.a. d-vectors) from input utterances, each individual speaker is modeled by a parameter-sharing RNN, while the RNN states for different speakers interleave in the time domain. This RNN is naturally integrated with a distance-dependent Chinese restaurant process (ddCRP) to accommodate an unknown number of speakers. Our system is fully supervised and is able to learn from examples where time-stamped speaker labels are annotated. We achieved a 7.6% diarization error rate on NIST SRE 2000 CALLHOME, which is better than the state-of-the-art method using spectral clustering. Moreover, our method decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
End-to-end neural speaker diarization with permutation-free objectives
Yusuke Fujita,Yusuke Fujita,Naoyuki Kanda,Shota Horiguchi,Kenji Nagamatsu,Shinji Watanabe +5 more
- 15 Sep 2019
TL;DR: Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.
309
•Posted Content
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.
Quan Wang,Hannah Muckenhirn,Kevin W. Wilson,Prashant Sridhar,Zelin Wu,John R. Hershey,Rif A. Saurous,Ron Weiss,Ye Jia,Ignacio Lopez Moreno +9 more
TL;DR: In this paper, a speaker recognition network that produces speaker-discriminative embeddings and a spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask.
293
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Hannah Muckenhirn,Ignacio Lopez Moreno,John R. Hershey,Kevin W. Wilson,Prashant Sridhar,Quan Wang,Rif A. Saurous,Ron Weiss,Ye Jia,Zelin Wu +9 more
- 15 Sep 2019
TL;DR: A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks.
289
End-to-End Neural Speaker Diarization with Self-Attention
Yusuke Fujita,Naoyuki Kanda,Shota Horiguchi,Yawen Xue,Kenji Nagamatsu,Shinji Watanabe +5 more
- 13 Sep 2019
TL;DR: In this paper, self-attention blocks instead of bidirectional long short-term memory (BLSTM) blocks are added to EEND to improve speaker diarization performance.
289
Speaker recognition based on deep learning: An overview
Zhongxin Bai,Xiao-Lei Zhang +1 more
TL;DR: In this article, the authors review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition with a focus on deep learning-based methods.
275
References
Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation
Kyunghyun Cho,Bart van Merriënboer,Caglar Gulcehre,Dzmitry Bahdanau,Fethi Bougares,Holger Schwenk,Yoshua Bengio,Yoshua Bengio,Yoshua Bengio +8 more
- 01 Jan 2014
TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
•Proceedings Article
Rectified Linear Units Improve Restricted Boltzmann Machines
Vinod Nair,Geoffrey E. Hinton +1 more
- 21 Jun 2010
TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Librispeech: An ASR corpus based on public domain audio books
Vassil Panayotov,Guoguo Chen,Daniel Povey,Sanjeev Khudanpur +3 more
- 19 Apr 2015
TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Front-End Factor Analysis for Speaker Verification
TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
VoxCeleb2: Deep Speaker Recognition.
Joon Son Chung,Arsha Nagrani,Andrew Zisserman +2 more
- 14 Jun 2018
TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.
2K
Related Papers (5)
Li Wan,Quan Wang,Alan Papir,Ignacio Lopez Moreno +3 more
- 15 Apr 2018
Arsha Nagrani,Joon Son Chung,Andrew Zisserman +2 more
- 20 Aug 2017