Top 495 papers published in the topic of Speech processing in 2019

Showing papers on "Speech processing published in 2019"

Journal Article•10.1109/ACCESS.2019.2896880•

Speech Recognition Using Deep Neural Networks: A Systematic Review

[...]

Ali Bou Nassif¹, Ismail Shahin¹, Imtinan Basem Attili¹, Mohammad Azzeh², Khaled Shaalan³ - Show less +1 more•Institutions (3)

University of Sharjah¹, Applied Science Private University², British University in Dubai³

01 Feb 2019-IEEE Access

TL;DR: A thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications is provided.

...read moreread less

Abstract: Over the past decades, a tremendous amount of research has been done on the use of machine learning for speech processing applications, especially speech recognition. However, in the past few years, research has focused on utilizing deep learning for speech-related applications. This new area of machine learning has yielded far better results when compared to others in a variety of applications including speech, and thus became a very attractive area of research. This paper provides a thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications. A thorough statistical analysis is provided in this review which was conducted by extracting specific information from 174 papers published between the years 2006 and 2018. The results provided in this paper shed light on the trends of research in this area as well as bring focus to new research topics.

...read moreread less

1,139 citations

Journal Article•10.1109/TASLP.2019.2915167•

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

[...]

Yi Luo¹, Nima Mesgarani¹•Institutions (1)

Columbia University¹

01 Aug 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Conv-TasNet as discussed by the authors uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers, which is achieved by applying a set of weighting functions masks to the encoder output.

...read moreread less

Abstract: Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time–frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time–frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully convolutional time-domain audio separation network Conv-TasNet, a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions masks to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network consisting of stacked one-dimensional dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time–frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study, therefore, represents a major step toward the realization of speech separation systems for real-world speech processing technologies.

...read moreread less

537 citations

Proceedings Article•10.1109/CVPR.2019.01034•

Capture, Learning, and Synthesis of 3D Speaking Styles

[...]

Daniel Cudeiro¹, Timo Bolkart¹, Cassidy Laidlaw¹, Anurag Ranjan¹, Michael J. Black¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

1 Jun 2019

TL;DR: In this article, the authors introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers and train a neural network on their dataset that factors identity from facial motion.

...read moreread less

Abstract: Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input—even speech in languages other than English—and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.

...read moreread less

402 citations

Journal Article•10.1109/TAFFC.2017.2736999•

Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings

[...]

Reza Lotfian¹, Carlos Busso¹•Institutions (1)

University of Texas at Dallas¹

01 Oct 2019-IEEE Transactions on Affective Computing

TL;DR: The proposed approach combines machine learning algorithms to retrieve recordings conveying balanced emotional content with a cost effective annotation process using crowdsourcing, which make it possible to build a large scale speech emotional database.

...read moreread less

Abstract: The lack of a large, natural emotional database is one of the key barriers to translate results on speech emotion recognition in controlled conditions into real-life applications. Collecting emotional databases is expensive and time demanding, which limits the size of existing corpora. Current approaches used to collect spontaneous databases tend to provide unbalanced emotional content, which is dictated by the given recording protocol (e.g., positive for colloquial conversations, negative for discussion or debates). The size and speaker diversity are also limited. This paper proposes a novel approach to effectively build a large, naturalistic emotional database with balanced emotional content, reduced cost and reduced manual labor. It relies on existing spontaneous recordings obtained from audio-sharing websites. The proposed approach combines machine learning algorithms to retrieve recordings conveying balanced emotional content with a cost effective annotation process using crowdsourcing, which make it possible to build a large scale speech emotional database. This approach provides natural emotional renditions from multiple speakers, with different channel conditions and conveying balanced emotional content that are difficult to obtain with alternative data collection protocols.

...read moreread less

370 citations

Journal Article•10.1109/TASLP.2019.2913512•

A New Framework for CNN-Based Speech Enhancement in the Time Domain

[...]

Ashutosh Pandey¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Jul 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new learning mechanism for a fully convolutional neural network to address speech enhancement in the time domain by using mean absolute error loss between the enhanced short-time Fourier transform (STFT) magnitude and the clean STFT magnitude to train the CNN.

...read moreread less

Abstract: This paper proposes a new learning mechanism for a fully convolutional neural network (CNN) to address speech enhancement in the time domain. The CNN takes as input the time frames of noisy utterance and outputs the time frames of the enhanced utterance. At the training time, we add an extra operation that converts the time domain to the frequency domain. This conversion corresponds to simple matrix multiplication, and is hence differentiable implying that a frequency domain loss can be used for training in the time domain. We use mean absolute error loss between the enhanced short-time Fourier transform (STFT) magnitude and the clean STFT magnitude to train the CNN. This way, the model can exploit the domain knowledge of converting a signal to the frequency domain for analysis. Moreover, this approach avoids the well-known invalid STFT problem since the proposed CNN operates in the time domain. Experimental results demonstrate that the proposed method substantially outperforms the other methods of speech enhancement. The proposed method is easy to implement and applicable to related speech processing tasks that require time-frequency masking or spectral mapping.

...read moreread less

300 citations

Journal Article•10.1109/JSTSP.2019.2922820•

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

[...]

Katerina Zmolikova¹, Marc Delcroix², Keisuke Kinoshita², Tsubasa Ochiai², Tomohiro Nakatani², Lukas Burget¹, Jan Cernocky¹ - Show less +3 more•Institutions (2)

Brno University of Technology¹, NTT Communications Corp²

13 Jun 2019-IEEE Journal of Selected Topics in Signal Processing

TL;DR: This paper introduces SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker and shows the benefit of including speaker information in the processing and the effectiveness of the proposed method.

...read moreread less

Abstract: The processing of speech corrupted by interfering overlapping speakers is one of the challenging problems with regards to today's automatic speech recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most of these approaches tackle the problem as speech separation, i.e., they blindly recover all the speakers from the mixture. In some scenarios, such as smart personal devices, we may however be interested in recovering one target speaker from a mixture. In this paper, we introduce SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker. Formulating the problem as speaker extraction avoids certain issues such as label permutation and the need to determine the number of speakers in the mixture. With SpeakerBeam, we jointly learn to extract a representation from the adaptation utterance characterizing the target speaker and to use this representation to extract the speaker. We explore several ways to do this, mostly inspired by speaker adaptation in acoustic models for automatic speech recognition. We evaluate the performance on the widely used WSJ0-2mix and WSJ0-3mix datasets, and these datasets modified with more noise or more realistic overlapping patterns. We further analyze the learned behavior by exploring the speaker representations and assessing the effect of the length of the adaptation data. The results show the benefit of including speaker information in the processing and the effectiveness of the proposed method.

...read moreread less

292 citations

Journal Article•10.1109/TASLP.2019.2925934•

Speech Emotion Classification Using Attention-Based LSTM

[...]

Yue Xie¹, Ruiyu Liang¹, Zhenlin Liang¹, Chengwei Huang², Cairong Zou¹, Björn Schuller³ - Show less +2 more•Institutions (3)

Southeast University¹, Chinese Academy of Sciences², University of Augsburg³

01 Nov 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel method is proposed for speech recognition using frame-level speech features combined with attention-based long short-term memory (LSTM) recurrent neural networks that is able to outperform the state-of-the-art algorithms reported to date.

...read moreread less

Abstract: Automatic speech emotion recognition has been a research hotspot in the field of human–computer interaction over the past decade. However, due to the lack of research on the inherent temporal relationship of the speech waveform, the current recognition accuracy needs improvement. To make full use of the difference of emotional saturation between time frames, a novel method is proposed for speech recognition using frame-level speech features combined with attention-based long short-term memory (LSTM) recurrent neural networks. Frame-level speech features were extracted from waveform to replace traditional statistical features, which could preserve the timing relations in the original speech through the sequence of frames. To distinguish emotional saturation in different frames, two improvement strategies are proposed for LSTM based on the attention mechanism: first, the algorithm reduces the computational complexity by modifying the forgetting gate of traditional LSTM without sacrificing performance and second, in the final output of the LSTM, an attention mechanism is applied to both the time and the feature dimension to obtain the information related to the task, rather than using the output from the last iteration of the traditional algorithm. Extensive experiments on the CASIA, eNTERFACE, and GEMEP emotion corpora demonstrate that the performance of the proposed approach is able to outperform the state-of-the-art algorithms reported to date.

...read moreread less

265 citations

Journal Article•10.1016/J.PATREC.2019.04.005•

Detecting Parkinson's Disease with Sustained Phonation and Speech Signals using Machine Learning Techniques

[...]

Jefferson S. Almeida, Pedro Pedrosa Rebouças Filho¹, Tiago Carneiro², Wei Wei, Robertas Damasevicius³, Rytis Maskeliūnas³, Victor Hugo C. de Albuquerque¹ - Show less +3 more•Institutions (3)

University of Fortaleza¹, French Institute for Research in Computer Science and Automation², Kaunas University of Technology³

01 Jul 2019-Pattern Recognition Letters

TL;DR: It is shown that the task of phonation was more efficient than speech tasks in the detection of disease and compared with other approaches that use the same data set.

...read moreread less

237 citations

Proceedings Article•10.1109/ASRU46091.2019.9003750•

A Comparative Study on Transformer vs RNN in Speech Applications

[...]

Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Yalta, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang - Show less +9 more

13 Sep 2019-arXiv: Computation and Language

TL;DR: An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.

...read moreread less

Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.

...read moreread less

232 citations

Journal Article•10.1016/J.NEURON.2019.04.023•

The Encoding of Speech Sounds in the Superior Temporal Gyrus

[...]

Han Gyol Yi¹, Matthew K. Leonard¹, Edward F. Chang¹•Institutions (1)

University of California, San Francisco¹

19 Jun 2019-Neuron

TL;DR: A theory that temporally recurrent connections within STG generate context-dependent phonological representations, spanning longer temporal sequences relevant for coherent percepts of syllables, words, and phrases is presented.

...read moreread less

220 citations

Journal Article•10.1109/MSP.2019.2918706•

Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques

[...]

Reinhold Haeb-Umbach¹, Shinji Watanabe², Tomohiro Nakatani, Michiel Bacchiani³, Bjorn Hoffmeister⁴, Michael L. Seltzer⁵, Heiga Zen³, Mehrez Souden⁴ - Show less +4 more•Institutions (5)

University of Paderborn¹, Johns Hopkins University², Google³, Apple Inc.⁴, Facebook⁵

30 Oct 2019-IEEE Signal Processing Magazine

TL;DR: The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants.

...read moreread less

Abstract: Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assistants with a spoken language interface have become a ubiquitous commodity today. This success has been made possible by major advancements in signal processing and machine learning for so-called far-field speech recognition, where the commands are spoken at a distance from the sound-capturing device. The challenges encountered are quite unique and different from many other use cases of automatic speech recognition (ASR). The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants. These technologies include multichannel acoustic echo cancellation (MAEC), microphone array processing and dereverberation techniques for signal enhancement, reliable wake-up word and end-of-interaction detection, and high-quality speech synthesis as well as sophisticated statistical models for speech and language, learned from large amounts of heterogeneous training data. In all of these fields, deep learning (DL) has played a critical role.

...read moreread less

Journal Article•10.1523/JNEUROSCI.1828-18.2019•

Neural Speech Tracking in the Theta and in the Delta Frequency Band Differentially Encode Clarity and Comprehension of Speech in Noise.

[...]

Octave Etard¹, Tobias Reichenbach¹•Institutions (1)

Imperial College London¹

17 Jul 2019-The Journal of Neuroscience

TL;DR: The roles of cortical entrainment in different frequency bands and at different temporal lags for speech clarity, reflecting the acoustics of the signal, and speech comprehension, related to linguistic processing are disentangled.

...read moreread less

Abstract: Humans excel at understanding speech even in adverse conditions such as background noise. Speech processing may be aided by cortical activity in the delta and theta frequency bands, which have been found to track the speech envelope. However, the rhythm of non-speech sounds is tracked by cortical activity as well. It therefore remains unclear which aspects of neural speech tracking represent the processing of acoustic features, related to the clarity of speech, and which aspects reflect higher-level linguistic processing related to speech comprehension. Here we disambiguate the roles of cortical tracking for speech clarity and comprehension through recording EEG responses to native and foreign language in different levels of background noise, for which clarity and comprehension vary independently. We then use a both a decoding and an encoding approach to relate clarity and comprehension to the neural responses. We find that cortical tracking in the theta frequency band is mainly correlated to clarity, whereas the delta band contributes most to speech comprehension. Moreover, we uncover an early neural component in the delta band that informs on comprehension and that may reflect a predictive mechanism for language processing. Our results disentangle the functional contributions of cortical speech tracking in the delta and theta bands to speech processing. They also show that both speech clarity and comprehension can be accurately decoded from relatively short segments of EEG recordings, which may have applications in future mind-controlled auditory prosthesis.SIGNIFICANCE STATEMENT Speech is a highly complex signal whose processing requires analysis from lower-level acoustic features to higher-level linguistic information. Recent work has shown that neural activity in the delta and theta frequency bands track the rhythm of speech, but the role of this tracking for speech processing remains unclear. Here we disentangle the roles of cortical entrainment in different frequency bands and at different temporal lags for speech clarity, reflecting the acoustics of the signal, and speech comprehension, related to linguistic processing. We show that cortical speech tracking in the theta frequency band encodes mostly speech clarity, and thus acoustic aspects of the signal, whereas speech tracking in the delta band encodes the higher-level speech comprehension.

...read moreread less

Journal Article•10.1038/S41593-019-0353-Z•

Spontaneous synchronization to speech reveals neural mechanisms facilitating language learning

[...]

M. Florencia Assaneo¹, Pablo Ripollés¹, Joan Orpella², Wy Ming Lin¹, Ruth de Diego-Balaguer, David Poeppel - Show less +2 more•Institutions (2)

New York University¹, University of Barcelona²

04 Mar 2019-Nature Neuroscience

TL;DR: A deceptively simple behavioral task that robustly identifies two qualitatively different groups within the general population, according to their speech-to-speech synchronization abilities, which predicts brain function and anatomy, as well as word-learning performance.

...read moreread less

Abstract: We introduce a deceptively simple behavioral task that robustly identifies two qualitatively different groups within the general population. When presented with an isochronous train of random syllables, some listeners are compelled to align their own concurrent syllable production with the perceived rate, whereas others remain impervious to the external rhythm. Using both neurophysiological and structural imaging approaches, we show group differences with clear consequences for speech processing and language learning. When listening passively to speech, high synchronizers show increased brain-to-stimulus synchronization over frontal areas, and this localized pattern correlates with precise microstructural differences in the white matter pathways connecting frontal to auditory regions. Finally, the data expose a mechanism that underpins performance on an ecologically relevant word-learning task. We suggest that this task will help to better understand and characterize individual performance in speech processing and language learning. A simple behavioral task identifies two qualitatively different groups within the general population, according to their speech-to-speech synchronization abilities. Group pertinence predicts brain function and anatomy, as well as word-learning performance.

...read moreread less

Journal Article•10.1016/J.SPECOM.2019.01.004•

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

[...]

Dimitri Palaz¹, Dimitri Palaz², Mathew Magimai-Doss², Ronan Collobert³, Ronan Collobert² - Show less +1 more•Institutions (3)

École Polytechnique Fédérale de Lausanne¹, Idiap Research Institute², Facebook³

01 Apr 2019-Speech Communication

TL;DR: This paper investigates an end-to-end acoustic modeling approach using convolutional neural networks (CNNs), where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output.

...read moreread less

Journal Article•10.1523/JNEUROSCI.0584-19.2019•

Semantic Context Enhances the Early Auditory Encoding of Natural Speech

[...]

Michael Broderick¹, Andrew J. Anderson², Edmund C. Lalor¹, Edmund C. Lalor²•Institutions (2)

Trinity College, Dublin¹, University of Rochester²

01 Aug 2019-The Journal of Neuroscience

TL;DR: A novel approach is addressed using a recently introduced method for quantifying the semantic context of speech and relating it to a commonly used method for indexing low-level auditory encoding of speech to suggest a mechanism that links top-down prior information with bottom-up sensory processing in the context of natural, narrative speech listening.

...read moreread less

Abstract: Speech perception involves the integration of sensory input with expectations based on the context of that speech. Much debate surrounds the issue of whether or not prior knowledge feeds back to affect early auditory encoding in the lower levels of the speech processing hierarchy, or whether perception can be best explained as a purely feedforward process. Although there has been compelling evidence on both sides of this debate, experiments involving naturalistic speech stimuli to address these questions have been lacking. Here, we use a recently introduced method for quantifying the semantic context of speech and relate it to a commonly used method for indexing low-level auditory encoding of speech. The relationship between these measures is taken to be an indication of how semantic context leading up to a word influences how its low-level acoustic and phonetic features are processed. We record EEG from human participants (both male and female) listening to continuous natural speech and find that the early cortical tracking of a word's speech envelope is enhanced by its semantic similarity to its sentential context. Using a forward modeling approach, we find that prediction accuracy of the EEG signal also shows the same effect. Furthermore, this effect shows distinct temporal patterns of correlation depending on the type of speech input representation (acoustic or phonological) used for the model, implicating a top-down propagation of information through the processing hierarchy. These results suggest a mechanism that links top-down prior information with the early cortical entrainment of words in natural, continuous speech.SIGNIFICANCE STATEMENT During natural speech comprehension, we use semantic context when processing information about new incoming words. However, precisely how the neural processing of bottom-up sensory information is affected by top-down context-based predictions remains controversial. We address this discussion using a novel approach that indexes a word's similarity to context and how well a word's acoustic and phonetic features are processed by the brain at the time of its utterance. We relate these two measures and show that lower-level auditory tracking of speech improves for words that are more related to their preceding context. These results suggest a mechanism that links top-down prior information with bottom-up sensory processing in the context of natural, narrative speech listening.

...read moreread less

Journal Article•10.1109/TASLP.2019.2892235•

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

[...]

Jing-Xuan Zhang¹, Zhen-Hua Ling¹, Li-Juan Liu, Yuan Jiang¹, Li-Rong Dai¹ - Show less +1 more•Institutions (1)

University of Science and Technology of China¹

01 Mar 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, a neural network named sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion, which can achieve better objective and subjective performance than the baseline methods.

...read moreread less

Abstract: In this paper, a neural network named sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and target speakers implicitly using attention mechanism. At the conversion stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model. Mel-scale spectrograms are adopted as acoustic features, which contain both excitation and vocal tract descriptions of speech signals. The bottleneck features extracted from source speech using an automatic speech recognition model are appended as an auxiliary input. A WaveNet vocoder conditioned on Mel-spectrograms is built to reconstruct waveforms from the outputs of the SCENT model. It is worth noting that our proposed method can achieve appropriate duration conversion, which is difficult in conventional methods. Experimental results show that our proposed method obtained better objective and subjective performance than the baseline methods using Gaussian mixture models and deep neural networks as acoustic models. This proposed method also outperformed our previous work, which achieved the top rank in Voice Conversion Challenge 2018. Ablation tests further confirmed the effectiveness of several components in our proposed method.

...read moreread less

Posted Content•

Almost Unsupervised Text to Speech and Automatic Speech Recognition

[...]

Yi Ren¹, Xu Tan², Tao Qin², Sheng Zhao², Zhou Zhao¹, Tie-Yan Liu² - Show less +2 more•Institutions (2)

Zhejiang University¹, Microsoft²

13 May 2019-arXiv: Audio and Speech Processing

TL;DR: In this paper, a denoising auto-encoder is used to reconstruct the speech and text sequences respectively to develop the capability of language modeling both in the speech domain and the text domain.

...read moreread less

Abstract: Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) a denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of language modeling both in speech and text domain; (2) dual transformation, where the TTS model transforms the text $y$ into speech $\hat{x}$, and the ASR model leverages the transformed pair $(\hat{x},y)$ for training, and vice versa, to boost the accuracy of the two tasks; (3) bidirectional sequence modeling, which addresses error propagation especially in the long speech and text sequence when training with few paired data; (4) a unified model structure, which combines all the above components for TTS and ASR based on Transformer model. Our method achieves 99.84% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR on LJSpeech dataset, by leveraging only 200 paired speech and text data (about 20 minutes audio), together with extra unpaired speech and text data.

...read moreread less

Journal Article•10.1111/NYAS.14137•

Speech rhythm and language acquisition: an amplitude modulation phase hierarchy perspective

[...]

Usha Goswami¹•Institutions (1)

University of Cambridge¹

01 Oct 2019-Annals of the New York Academy of Sciences

TL;DR: The “amplitude modulation phase hierarchy” theoretical perspective on language acquisition is applicable across languages, and cross‐language investigations adopting this novel perspective would be valuable for the field.

...read moreread less

Abstract: Language lies at the heart of our experience as humans and disorders of language acquisition carry severe developmental costs. Rhythmic processing lies at the heart of language acquisition. Here, I review our understanding of the perceptual and neural mechanisms that support language acquisition, from a novel amplitude modulation perspective. Amplitude modulation patterns in infant- and child-directed speech support the perceptual experience of rhythm, and the brain encodes these rhythm patterns in part via neuroelectric oscillations. When brain rhythms align themselves with (entrain to) acoustic rhythms, speech intelligibility improves. Recent advances in the auditory neuroscience of speech processing enable studies of neuronal oscillatory entrainment in children and infants. The "amplitude modulation phase hierarchy" theoretical perspective on language acquisition is applicable across languages, and cross-language investigations adopting this novel perspective would be valuable for the field.

...read moreread less

Journal Article•10.1109/TASLP.2018.2870742•

Phase-Aware Speech Enhancement Based on Deep Neural Networks

[...]

Naijun Zheng¹, Xiao-Lei Zhang²•Institutions (2)

Xidian University¹, Northwestern Polytechnical University²

01 Jan 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper proposes a phase-aware speech enhancement algorithm based on DNN to transform an unstructured phase spectrogram to its derivative along the time axis, i.e., instantaneous frequency deviation (IFD), which has a similar structure with its corresponding magnitude spectrogram.

...read moreread less

Abstract: Short-time frequency transform (STFT) is fundamental in speech processing Because of the difficulty of processing highly unstructured STFT phase, most speech-processing algorithms only operate with STFT magnitude, leaving the STFT phase far from explored However, with the recent development of deep neural network (DNN) based speech processing, eg, speech enhancement and recognition, phase processing is becoming more important than ever before as a new growing point of DNN-based methods In this paper, we propose a phase-aware speech enhancement algorithm based on DNN Specifically, in the training stage, when incorporating phase as a target, our core idea is to transform an unstructured phase spectrogram to its derivative along the time axis, ie, instantaneous frequency deviation (IFD), which has a similar structure with its corresponding magnitude spectrogram We further propose to optimize both IFD and magnitude jointly in a multiobjective learning framework In the test stage, we propose a postprocessing method to recover the phase spectrogram from the estimated IFD Experimental results demonstrate the effectiveness of the proposed method

...read moreread less

Proceedings Article•10.18653/V1/D19-1566•

Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis

[...]

Dushyant Singh Chauhan¹, Shad Akhtar¹, Asif Ekbal¹, Pushpak Bhattacharyya¹•Institutions (1)

Indian Institute of Technology Patna¹

1 Nov 2019

TL;DR: A recurrent neural network based approach for the multi-modal sentiment and emotion analysis that learns the inter- modal interaction among the participating modalities through an auto-encoder mechanism.

...read moreread less

Abstract: In recent times, multi-modal analysis has been an emerging and highly sought-after field at the intersection of natural language processing, computer vision, and speech processing. The prime objective of such studies is to leverage the diversified information, (e.g., textual, acoustic and visual), for learning a model. The effective interaction among these modalities often leads to a better system in terms of performance. In this paper, we introduce a recurrent neural network based approach for the multi-modal sentiment and emotion analysis. The proposed model learns the inter-modal interaction among the participating modalities through an auto-encoder mechanism. We employ a context-aware attention module to exploit the correspondence among the neighboring utterances. We evaluate our proposed approach for five standard multi-modal affect analysis datasets. Experimental results suggest the efficacy of the proposed model for both sentiment and emotion analysis over various existing state-of-the-art systems.

...read moreread less

Proceedings Article•10.1109/ICASSP.2019.8682245•

Single-channel Speech Extraction Using Speaker Inventory and Attention Network

[...]

Xiong Xiao¹, Zhuo Chen¹, Takuya Yoshioka¹, Hakan Erdogan¹, Changliang Liu¹, Dimitrios Dimitriadis¹, Jasha Droppo¹, Yifan Gong¹ - Show less +4 more•Institutions (1)

Microsoft¹

12 May 2019

TL;DR: A novel speech extraction method that utilizes an inventory of voice snippets of possible interfering speakers, or speaker enrollment data, in addition to that of the target speaker is proposed, and an attention-based network architecture is proposed to form time-varying masks for both the target and other speakers during the separation process.

...read moreread less

Abstract: Neural network-based speech separation has received a surge of interest in recent years. Previously proposed methods either are speaker independent or extract a target speaker’s voice by using his or her voice snippet. In applications such as home devices or office meeting transcriptions, a possible speaker list is available, which can be leveraged for speech separation. This paper proposes a novel speech extraction method that utilizes an inventory of voice snippets of possible interfering speakers, or speaker enrollment data, in addition to that of the target speaker. Furthermore, an attention-based network architecture is proposed to form time-varying masks for both the target and other speakers during the separation process. This architecture does not reduce the enrollment audio of each speaker into a single vector, thereby allowing each short time frame of the input mixture signal to be aligned and accurately compared with the enrollment signals. We evaluate the proposed system on a speaker extraction task derived from the Libri corpus and show the effectiveness of the method.

...read moreread less

Journal Article•10.1109/TASLP.2019.2898816•

Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels

[...]

Reza Lotfian¹, Carlos Busso¹•Institutions (1)

University of Texas at Dallas¹

01 Apr 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This study introduces a method to design a curriculum for machine-learning to maximize the efficiency during the training process of deep neural networks (DNNs) for speech emotion recognition and proposes metrics that quantify the inter-evaluation agreement to define the curriculum for regression problems and binary and multi-class classification problems.

...read moreread less

Abstract: This study introduces a method to design a curriculum for machine-learning to maximize the efficiency during the training process of deep neural networks (DNNs) for speech emotion recognition. Previous studies in other machine-learning problems have shown the benefits of training a classifier following a curriculum where samples are gradually presented in increasing level of difficulty. For speech emotion recognition, the challenge is to establish a natural order of difficulty in the training set to create the curriculum. We address this problem by assuming that, ambiguous samples for humans are also ambiguous for computers. Speech samples are often annotated by multiple evaluators to account for differences in emotion perception across individuals. While some sentences with clear emotional content are consistently annotated, sentences with more ambiguous emotional content present important disagreement between individual evaluations. We propose to use the disagreement between evaluators as a measure of difficulty for the classification task. We propose metrics that quantify the inter-evaluation agreement to define the curriculum for regression problems and binary and multi-class classification problems. The experimental results consistently show that relying on a curriculum based on agreement between human judgments leads to statistically significant improvements over baselines trained without a curriculum.

...read moreread less

Posted Content•

A Comparative Study of Glottal Source Estimation Techniques

[...]

Thomas Drugman¹, Baris Bozkurt², Thierry Dutoit¹•Institutions (2)

University of Mons¹, İzmir Institute of Technology²

28 Dec 2019-arXiv: Sound

TL;DR: In this paper, the authors compared three state-of-the-art methods of glottal flow estimation: closed-phase inverse filtering, iterative and adaptive inverse filtering and mixed-phase decomposition.

...read moreread less

Abstract: Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main representative state-of-the-art methods of glottal flow estimation: closed-phase inverse filtering, iterative and adaptive inverse filtering, and mixed-phase decomposition. These techniques are first submitted to an objective assessment test on synthetic speech signals. Their sensitivity to various factors affecting the estimation quality, as well as their robustness to noise are studied. In a second experiment, their ability to label voice quality (tensed, modal, soft) is studied on a large corpus of real connected speech. It is shown that changes of voice quality are reflected by significant modifications in glottal feature distributions. Techniques based on the mixed-phase decomposition and on a closed-phase inverse filtering process turn out to give the best results on both clean synthetic and real speech signals. On the other hand, iterative and adaptive inverse filtering is recommended in noisy environments for its high robustness.

...read moreread less

Journal Article•10.1109/MWC.2019.1800419•

An Audio-Visual Emotion Recognition System Using Deep Learning Fusion for a Cognitive Wireless Framework

[...]

M. Shamim Hossain¹, Ghulam Muhammad¹•Institutions (1)

King Saud University¹

01 Jul 2019-IEEE Wireless Communications

TL;DR: This article proposes an automatic audio-visual emotion recognition system in a connected healthcare framework that uses a 2D CNN model for the speech modality and a 3D network for the visual modality, and uses edge caching prior to intensive-processing cloud computing.

...read moreread less

Abstract: Automatically recognizing emotions of patients can be a good facilitator of a connected healthcare framework. It can give automatic feedback to the stakeholders of the healthcare industry about patients' states and satisfaction levels. In this article, we propose an automatic audio-visual emotion recognition system in a connected healthcare framework. The system uses a 2D CNN model for the speech modality and a 3D CNN model for the visual modality. For the speech signal, preprocessing is done to extract the PS-PA feature vector. The features from the two CNN models are blended by two ELM networks. The first ELM is trained with gender-specific data, while the other one is trained with emotion-specific data. The proposed system is evaluated using three databases, and the experiments prove the success of the system. In the healthcare framework, we use edge computing prior to intensive-processing cloud computing. In the edge computing, we realize edge caching, which can store the CNN model parameters and thereby perform the testing fast.

...read moreread less

Posted Content•

The Deterministic plus Stochastic Model of the Residual Signal and its Applications

[...]

Thomas Drugman¹, Thierry Dutoit¹•Institutions (1)

University of Mons¹

29 Dec 2019-arXiv: Sound

TL;DR: In this paper, a Deterministic plus Stochastic model (DSM) of the residual signal is proposed, which consists of two contributions acting in two distinct spectral bands delimited by a maximum voiced frequency.

...read moreread less

Abstract: The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two contributions acting in two distinct spectral bands delimited by a maximum voiced frequency. Both components are extracted from an analysis performed on a speaker-dependent dataset of pitch-synchronous residual frames. The deterministic part models the low-frequency contents and arises from an orthonormal decomposition of these frames. As for the stochastic component, it is a high-frequency noise modulated both in time and frequency. Some interesting phonetic and computational properties of the DSM are also highlighted. The applicability of the DSM in two fields of speech processing is then studied. First, it is shown that incorporating the DSM vocoder in HMM-based speech synthesis enhances the delivered quality. The proposed approach turns out to significantly outperform the traditional pulse excitation and provides a quality equivalent to STRAIGHT. In a second application, the potential of glottal signatures derived from the proposed DSM is investigated for speaker identification purpose. Interestingly, these signatures are shown to lead to better recognition rates than other glottal-based methods.

...read moreread less

Journal Article•10.1016/J.NEUROBIOLAGING.2019.05.015•

Musical training improves the ability to understand speech-in-noise in older adults.

[...]

Benjamin Rich Zendel, Greg L. West¹, Sylvie Belleville¹, Isabelle Peretz¹•Institutions (1)

Université de Montréal¹

29 May 2019-Neurobiology of Aging

TL;DR: The idea that musical training provides a causal benefit to hearing abilities is supported and musical training could be used as a foundation to develop auditory rehabilitation programs for older adults is suggested.

...read moreread less

Posted Content•

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining.

[...]

Wen-Chin Huang¹, Tomoki Hayashi¹, Yi-Chiao Wu¹, Hirokazu Kameoka², Tomoki Toda¹ - Show less +1 more•Institutions (2)

Nagoya University¹, Nippon Telegraph and Telephone²

14 Dec 2019-arXiv: Audio and Speech Processing

TL;DR: Experimental results show that a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora, can facilitate data-efficient training and outperform an RNN-basedseq VC model in terms of intelligibility, naturalness, and similarity.

...read moreread less

Abstract: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transformer network, which has shown promising results in various speech processing tasks, has not yet been investigated. Nonetheless, their data-hungry property and the mispronunciation of converted speech make seq2seq models far from practical. To this end, we propose a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora. VC models initialized with such pretrained model parameters are able to generate effective hidden representations for high-fidelity, highly intelligible converted speech. Experimental results show that such a pretraining scheme can facilitate data-efficient training and outperform an RNN-based seq2seq VC model in terms of intelligibility, naturalness, and similarity.

...read moreread less

Journal Article•10.1038/S41467-019-11710-Y•

Invariance to background noise as a signature of non-primary auditory cortex.

[...]

Alexander J. E. Kell, Josh H. McDermott

02 Sep 2019-Nature Communications

TL;DR: It is shown that areas of the auditory cortex differ in the extent to which their responses to sounds are altered by the presence of background noise, illustrating a representational consequence of hierarchical organization in the auditory system.

...read moreread less

Abstract: Despite well-established anatomical differences between primary and non-primary auditory cortex, the associated representational transformations have remained elusive. Here we show that primary and non-primary auditory cortex are differentiated by their invariance to real-world background noise. We measured fMRI responses to natural sounds presented in isolation and in real-world noise, quantifying invariance as the correlation between the two responses for individual voxels. Non-primary areas were substantially more noise-invariant than primary areas. This primary-nonprimary difference occurred both for speech and non-speech sounds and was unaffected by a concurrent demanding visual task, suggesting that the observed invariance is not specific to speech processing and is robust to inattention. The difference was most pronounced for real-world background noise-both primary and non-primary areas were relatively robust to simple types of synthetic noise. Our results suggest a general representational transformation between auditory cortical stages, illustrating a representational consequence of hierarchical organization in the auditory system.

...read moreread less

Book Chapter•10.1007/978-3-030-42504-3_16•

Privacy Implications of Voice and Speech Analysis - Information Disclosure by Inference.

[...]

Jacob Leon Kröger¹, Otto Hans-Martin Lutz¹, Otto Hans-Martin Lutz², Philip Raschke¹•Institutions (2)

Technical University of Berlin¹, Fraunhofer Institute for Open Communication Systems²

19 Aug 2019

TL;DR: An overview of sensitive pieces of information that can be derived from human speech and other acoustic elements in recorded audio are presented, demonstrating that recent advances in voice and speech processing induce a new generation of privacy threats.

...read moreread less

Abstract: Internet-connected devices, such as smartphones, smartwatches, and laptops, have become ubiquitous in modern life, reaching ever deeper into our private spheres. Among the sensors most commonly found in such devices are microphones. While various privacy concerns related to microphone-equipped devices have been raised and thoroughly discussed, the threat of unexpected inferences from audio data remains largely overlooked. Drawing from literature of diverse disciplines, this paper presents an overview of sensitive pieces of information that can, with the help of advanced data analysis methods, be derived from human speech and other acoustic elements in recorded audio. In addition to the linguistic content of speech, a speaker’s voice characteristics and manner of expression may implicitly contain a rich array of personal information, including cues to a speaker’s biometric identity, personality, physical traits, geographical origin, emotions, level of intoxication and sleepiness, age, gender, and health condition. Even a person’s socioeconomic status can be reflected in certain speech patterns. The findings compiled in this paper demonstrate that recent advances in voice and speech processing induce a new generation of privacy threats.

...read moreread less

Journal Article•10.1111/NYAS.14099•

Phase−amplitude coupling between theta and gamma oscillations adapts to speech rate

[...]

Mikel Lizarazu, Marie Lallier, Nicola Molinaro¹•Institutions (1)

Ikerbasque¹

01 Oct 2019-Annals of the New York Academy of Sciences

TL;DR: It is shown that the peak of the gamma response—coupled to the phase of theta—follows the speech rate, which indicates that gamma activity in auditory regions synchronizes with the fine‐grain properties of speech, possibly reflecting detailed acoustic analysis of the input.

...read moreread less

Abstract: Low- and high-frequency cortical oscillations play an important role in speech processing. Low-frequency neural oscillations in the delta (<4 Hz) and theta (4-8 Hz) bands entrain to the prosodic and syllabic rates of speech, respectively. Theta band neural oscillations modulate high-frequency neural oscillations in the gamma band (28-40 Hz), which have been hypothesized to be crucial for processing phonemes in natural speech. Since speech rate is known to vary considerably, both between and within talkers, it has yet to be determined whether this nested gamma response reflects an externally induced rhythm sensitive to the rate of the fine-grained structure of the input or a speech rate-independent endogenous response. Here, we recorded magnetoencephalography responses from participants listening to a speech delivered at different rates: decelerated, normal, and accelerated. We found that the phase of theta band oscillations in left and right auditory regions adjusts to speech rate variations. Importantly, we showed that the peak of the gamma response-coupled to the phase of theta-follows the speech rate. This indicates that gamma activity in auditory regions synchronizes with the fine-grain properties of speech, possibly reflecting detailed acoustic analysis of the input.

...read moreread less

...

Expand