Fully Supervised Speaker Diarization

doi:10.1109/ICASSP.2019.8683892

Open AccessProceedings Article10.1109/ICASSP.2019.8683892

Fully Supervised Speaker Diarization

Aonan Zhang, +4 more

- 08 Jan 2019

- pp 6301-6305

304

TL;DR: A fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN), given extracted speaker-discriminative embeddings, which decodes in an online fashion while most state-of-the-art systems rely on offline clustering.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.21437/INTERSPEECH.2019-2899

End-to-end neural speaker diarization with permutation-free objectives

Yusuke Fujita, +5 more

- 15 Sep 2019

TL;DR: Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.

...read moreread less

309

•Posted Content

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

Quan Wang, +9 more

- 11 Oct 2018

- arXiv: Audio and Speech Processing

TL;DR: In this paper, a speaker recognition network that produces speaker-discriminative embeddings and a spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask.

...read moreread less

293

•Proceedings Article•10.21437/INTERSPEECH.2019-1101

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Hannah Muckenhirn, +9 more

- 15 Sep 2019

TL;DR: A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks.

...read moreread less

289

•Proceedings Article•10.1109/ASRU46091.2019.9003959

End-to-End Neural Speaker Diarization with Self-Attention

Yusuke Fujita, +5 more

- 13 Sep 2019

TL;DR: In this paper, self-attention blocks instead of bidirectional long short-term memory (BLSTM) blocks are added to EEND to improve speaker diarization performance.

...read moreread less

289

•Journal Article•10.1016/J.NEUNET.2021.03.004

Speaker recognition based on deep learning: An overview

Zhongxin Bai, +1 more

- 17 Mar 2021

- Neural Networks

TL;DR: In this article, the authors review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition with a focus on deep learning-based methods.

...read moreread less

275

...

Expand

References

•Proceedings Article•10.3115/V1/D14-1179

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Kyunghyun Cho, +8 more

- 01 Jan 2014

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less

28.6K

•Proceedings Article

Rectified Linear Units Improve Restricted Boltzmann Machines

Vinod Nair, +1 more

- 21 Jun 2010

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.

...read moreread less

18.4K

Proceedings Article•10.1109/ICASSP.2015.7178964

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, +3 more

- 19 Apr 2015

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

7.7K

Journal Article•10.1109/TASL.2010.2064307

Front-End Factor Analysis for Speaker Verification

Najim Dehak, +4 more

- 01 May 2011

- IEEE Transactions on Audio, Speech, and ...

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.

...read moreread less

4.4K

•Proceedings Article•10.21437/INTERSPEECH.2018-1929

VoxCeleb2: Deep Speaker Recognition.

Joon Son Chung, +2 more

- 14 Jun 2018

TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.

...read moreread less

2K