SUPERB: Speech processing Universal PERformance Benchmark
Shu-wen Yang,Po-Han Chi,Yung-Sung Chuang,Cheng-I Jeff Lai,Kushal Lakhotia,Yist Y. Lin,Andy T. Liu,Jiatong Shi,Xuankai Chang,Guan-Ting Lin,Tzu-hsien Huang,Wei-Cheng Tseng,Ko-tik Lee,Da-Rong Liu,Zili Huang,Shuyan Dong,Shang-Wen Li,Shinji Watanabe,Abdelrahman Mohamed,Hung-yi Lee +19 more
- 03 May 2021
TL;DR: The Speech processing Universal PERformance Benchmark (SUPERB) as discussed by the authors is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.
read more
Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Sanyuan Chen,Chengyi Wang,Zhengyang Chen,Yu Wu,Shujie Liu,Zhuo Chen,Jinyu Li,Naoyuki Kanda,Takuya Yoshioka,Xiong Xiao,Jian Wu,Long Zhou,Shuo Ren,Yanmin Qian,Yao Qian,Michael Zeng,Furu Wei +16 more
TL;DR: WavLM as mentioned in this paper proposes a pre-trained model to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB speech recognition task.
715
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
18 Sep 2022
TL;DR: XLS-R as discussed by the authors is a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0, which is trained with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages.
269
Self-Supervised Speech Representation Learning: A Review
TL;DR: A review of self-supervised speech representation learning can be found in this paper , where the authors present approaches for self-Supervised Speech Representation Learning and their connection to other research areas.
Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap
TL;DR: In this paper , the authors evaluated the influence of pre-training data on downstream performance, and showed that transformer-based architectures are more robust compared to a CNN-based baseline and fair with respect to gender groups, but not towards individual speakers.
A review of deep learning techniques for speech processing
TL;DR: A comprehensive overview of the key deep learning models and their applications in speech-processing tasks can be found in this paper , where the authors discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of Deep Learning for multimodal speech processing.
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Deep contextualized word representations
Matthew E. Peters,Mark Neumann,Mohit Iyyer,Matt Gardner,Christopher Clark,Kenton Lee,Luke Zettlemoyer +6 more
- 15 Feb 2018
TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).
Librispeech: An ASR corpus based on public domain audio books
Vassil Panayotov,Guoguo Chen,Daniel Povey,Sanjeev Khudanpur +3 more
- 19 Apr 2015
TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang,Amanpreet Singh,Julian Michael,Felix Hill,Omer Levy,Samuel R. Bowman +5 more
- 01 Nov 2018
TL;DR: The gluebenchmark as mentioned in this paper is a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models.
IEMOCAP: interactive emotional dyadic motion capture database
Carlos Busso,Murtaza Bulut,Chi-Chun Lee,Abe Kazemzadeh,Emily Mower,Samuel Kim,Jeannette N. Chang,Sungbok Lee,Shrikanth S. Narayanan +8 more
- 05 Nov 2008
TL;DR: A new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California (USC), which provides detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios.