Conference

Spoken Language Technology Workshop

About: Spoken Language Technology Workshop is an academic conference. The conference publishes majorly in the area(s): Computer science & Language model. Over the lifetime, 968 publications have been published by the conference receiving 18364 citations.

...read moreread less

Topics: Computer science, Language model, Engineering, Word error rate, Speech processing ...read more

Conference Tools

Create Scientific Poster

Create Conference poster

Create Presentation with AI

Papers published on a yearly basis

Papers

Proceedings Article•10.1109/SLT.2018.8639585•

Speaker Recognition from Raw Waveform with SincNet

[...]

Mirco Ravanelli¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

29 Jul 2018

TL;DR: This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.

...read moreread less

Abstract: Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal.This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application.Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

...read moreread less

912 citations

Proceedings Article•10.1109/SLT.2012.6424228•

Context dependent recurrent neural network language model

[...]

Tomas Mikolov¹, Geoffrey Zweig²•Institutions (2)

Brno University of Technology¹, Microsoft²

1 Dec 2012

TL;DR: This paper improves recurrent neural network language models performance by providing a contextual real-valued input vector in association with each word to convey contextual information about the sentence being modeled by performing Latent Dirichlet Allocation using a block of preceding text.

...read moreread less

Abstract: Recurrent neural network language models (RNNLMs) have recently demonstrated state-of-the-art performance across a variety of tasks. In this paper, we improve their performance by providing a contextual real-valued input vector in association with each word. This vector is used to convey contextual information about the sentence being modeled. By performing Latent Dirichlet Allocation using a block of preceding text, we achieve a topic-conditioned RNNLM. This approach has the key advantage of avoiding the data fragmentation associated with building multiple topic models on different data subsets. We report perplexity results on the Penn Treebank data, where we achieve a new state-of-the-art. We further apply the model to the Wall Street Journal speech recognition task, where we observe improvements in word-error-rate.

...read moreread less

794 citations

Proceedings Article•10.1109/SLT.2016.7846260•

Deep neural network-based speaker embeddings for end-to-end speaker verification

[...]

David Snyder¹, Pegah Ghahremani¹, Daniel Povey¹, Daniel Garcia-Romero¹, Yishay Carmiel¹, Sanjeev Khudanpur¹ - Show less +2 more•Institutions (1)

Johns Hopkins University¹

1 Dec 2016

TL;DR: It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates.

...read moreread less

Abstract: In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.

...read moreread less

480 citations

Proceedings Article•10.1109/SLT.2018.8639535•

StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks

[...]

Hirokazu Kameoka¹, Takuhiro Kaneko¹, Kou Tanaka¹, Nobukatsu Hojo¹•Institutions (1)

Nippon Telegraph and Telephone¹

6 Jun 2018

TL;DR: StarGAN-VC as discussed by the authors uses a variant of a generative adversarial network (GAN) called StarGAN to learn many-to-many mappings across different attribute domains using a single generator.

...read moreread less

Abstract: This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across different attribute domains using a single generator network, (3) is able to generate converted speech signals quickly enough to allow real-time implementations and (4) requires only several minutes of training examples to generate reasonably realistic sounding speech. Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.

...read moreread less

473 citations

Proceedings Article•10.1109/SLT.2014.7078572•

Spoken language understanding using long short-term memory neural networks

[...]

Kaisheng Yao¹, Baolin Peng¹, Yu Zhang¹, Dong Yu¹, Geoffrey Zweig¹, Yangyang Shi¹ - Show less +2 more•Institutions (1)

Microsoft¹

1 Dec 2014

TL;DR: This paper investigates using long short-term memory (LSTM) neural networks, which contain input, output and forgetting gates and are more advanced than simple RNN, for the word labeling task and proposes a regression model on top of the LSTM un-normalized scores to explicitly model output-label dependence.

...read moreread less

Abstract: Neural network based approaches have recently produced record-setting performances in natural language understanding tasks such as word labeling. In the word labeling task, a tagger is used to assign a label to each word in an input sequence. Specifically, simple recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have shown to significantly outperform the previous state-of-the-art - conditional random fields (CRFs). This paper investigates using long short-term memory (LSTM) neural networks, which contain input, output and forgetting gates and are more advanced than simple RNN, for the word labeling task. To explicitly model output-label dependence, we propose a regression model on top of the LSTM un-normalized scores. We also propose to apply deep LSTM to the task. We investigated the relative importance of each gate in the LSTM by setting other gates to a constant and only learning particular gates. Experiments on the ATIS dataset validated the effectiveness of the proposed models.

...read moreread less

414 citations

...

Expand

Performance Metrics

968

Papers

5,040

Citations

No. of papers from the Conference in previous years
Year	Papers
2023	55
2022	91
2021	151
2018	149
2016	101
2014	103