SUPERB: Speech processing Universal PERformance Benchmark

doi:10.21437/INTERSPEECH.2021-1775

Open AccessProceedings Article10.21437/INTERSPEECH.2021-1775

SUPERB: Speech processing Universal PERformance Benchmark

- 03 May 2021

459

TL;DR: The Speech processing Universal PERformance Benchmark (SUPERB) as discussed by the authors is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Sanyuan Chen, +16 more

- 26 Oct 2021

- arXiv: Computation and Language

TL;DR: WavLM as mentioned in this paper proposes a pre-trained model to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB speech recognition task.

...read moreread less

715

•Proceedings Article•10.21437/interspeech.2022-143

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

18 Sep 2022

TL;DR: XLS-R as discussed by the authors is a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0, which is trained with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages.

...read moreread less

269

•Journal Article•10.1109/jstsp.2022.3207050

Self-Supervised Speech Representation Learning: A Review

01 Oct 2022

- IEEE Journal of Selected Topics in Signa...

TL;DR: A review of self-supervised speech representation learning can be found in this paper , where the authors present approaches for self-Supervised Speech Representation Learning and their connection to other research areas.

...read moreread less

185

•Journal Article•10.1109/tpami.2023.3263585

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap

01 Jan 2023

- IEEE Transactions on Pattern Analysis an...

TL;DR: In this paper , the authors evaluated the influence of pre-training data on downstream performance, and showed that transformer-based architectures are more robust compared to a CNN-based baseline and fair with respect to gender groups, but not towards individual speakers.

...read moreread less

159

•Journal Article•10.1016/j.inffus.2023.101869

A review of deep learning techniques for speech processing

Ambuj Mehrish, +3 more

- 30 Apr 2023

- Information Fusion

TL;DR: A comprehensive overview of the key deep learning models and their applications in speech-processing tasks can be found in this paper , where the authors discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of Deep Learning for multimodal speech processing.

...read moreread less

146

...

Expand

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Proceedings Article•10.18653/V1/N18-1202

Deep contextualized word representations

Matthew E. Peters, +6 more

- 15 Feb 2018

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

...read moreread less

11.7K

Proceedings Article•10.1109/ICASSP.2015.7178964

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, +3 more

- 19 Apr 2015

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

7.7K

•Proceedings Article•10.18653/V1/W18-5446

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, +5 more

- 01 Nov 2018

TL;DR: The gluebenchmark as mentioned in this paper is a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models.

...read moreread less

7.3K

Journal Article•10.1007/S10579-008-9076-6

IEMOCAP: interactive emotional dyadic motion capture database

Carlos Busso, +8 more

- 05 Nov 2008

TL;DR: A new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California (USC), which provides detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios.

...read moreread less

3.8K

...

Expand

SUPERB: Speech processing Universal PERformance Benchmark

Chat with Paper

AI Agents for this Paper

Citations

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Self-Supervised Speech Representation Learning: A Review

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap

A review of deep learning techniques for speech processing

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Deep contextualized word representations

Librispeech: An ASR corpus based on public domain audio books

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

IEMOCAP: interactive emotional dyadic motion capture database

Related Papers (5)

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Librispeech: An ASR corpus based on public domain audio books

Representation Learning with Contrastive Predictive Coding

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

An Unsupervised Autoregressive Model for Speech Representation Learning.