Top 483 papers published in the topic of Speech processing in 2020

Showing papers on "Speech processing published in 2020"

Automated assessment of psychiatric disorders using speech: A systematic review.

[...]

Daniel M. Low¹, Daniel M. Low², Kate H. Bentley¹, Kate H. Bentley³, Satrajit S. Ghosh³, Satrajit S. Ghosh¹ - Show less +2 more•Institutions (3)

Harvard University¹, Massachusetts Institute of Technology², McGovern Institute for Brain Research³

31 Jan 2020

TL;DR: This is the first systematic review of studies using speech for automated assessments across a broader range of psychiatric disorders and focuses on using acoustic features from speech to detect depression and schizophrenia.

...read moreread less

Abstract: Objective There are many barriers to accessing mental health assessments including cost and stigma. Even when individuals receive professional care, assessments are intermittent and may be limited partly due to the episodic nature of psychiatric symptoms. Therefore, machine-learning technology using speech samples obtained in the clinic or remotely could one day be a biomarker to improve diagnosis and treatment. To date, reviews have only focused on using acoustic features from speech to detect depression and schizophrenia. Here, we present the first systematic review of studies using speech for automated assessments across a broader range of psychiatric disorders. Methods We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. We included studies from the last 10 years using speech to identify the presence or severity of disorders within the Diagnostic and Statistical Manual of Mental Disorders (DSM-5). For each study, we describe sample size, clinical evaluation method, speech-eliciting tasks, machine learning methodology, performance, and other relevant findings. Results 1395 studies were screened of which 127 studies met the inclusion criteria. The majority of studies were on depression, schizophrenia, and bipolar disorder, and the remaining on post-traumatic stress disorder, anxiety disorders, and eating disorders. 63% of studies built machine learning predictive models, and the remaining 37% performed null-hypothesis testing only. We provide an online database with our search results and synthesize how acoustic features appear in each disorder. Conclusion Speech processing technology could aid mental health assessments, but there are many obstacles to overcome, especially the need for comprehensive transdiagnostic and longitudinal studies. Given the diverse types of data sets, feature extraction, computational methodologies, and evaluation criteria, we provide guidelines for both acquiring data and building machine learning models with a focus on testing hypotheses, open science, reproducibility, and generalizability. Level of evidence 3a.

...read moreread less

312 citations

Journal Article•10.1002/ADMA.201904020•

Flexible Piezoelectric Acoustic Sensors and Machine Learning for Speech Processing.

[...]

Younghoon Jung¹, Seong Kwang Hong¹, Hee Seung Wang¹, Jae Hyun Han¹, Trung X. Pham¹, Hyunsin Park¹, Junyeong Kim¹, Sunghun Kang¹, Chang D. Yoo¹, Keon Jae Lee¹ - Show less +6 more•Institutions (1)

KAIST¹

01 Sep 2020-Advanced Materials

TL;DR: Significant developments in speech recognition are reviewed in terms of flexible piezoelectric materials, self‐powered sensors, machine learning algorithms, and speaker recognition.

...read moreread less

Abstract: Flexible piezoelectric acoustic sensors have been developed to generate multiple sound signals with high sensitivity, shifting the paradigm of future voice technologies. Speech recognition based on advanced acoustic sensors and optimized machine learning software will play an innovative interface for artificial intelligence (AI) services. Collaboration and novel approaches between both smart sensors and speech algorithms should be attempted to realize a hyperconnected society, which can offer personalized services such as biometric authentication, AI secretaries, and home appliances. Here, representative developments in speech recognition are reviewed in terms of flexible piezoelectric materials, self-powered sensors, machine learning algorithms, and speaker recognition.

...read moreread less

261 citations

Proceedings Article•10.18653/V1/2020.ACL-DEMOS.34•

ESPnet-ST: All-in-One Speech Translation Toolkit

[...]

Hirofumi Inaguma¹, Shun Kiyono², Kevin Duh³, Shigeki Karita⁴, Nelson Yalta, Tomoki Hayashi⁵, Shinji Watanabe³ - Show less +3 more•Institutions (5)

Kyoto University¹, Tohoku University², Johns Hopkins University³, Nippon Telegraph and Telephone⁴, Nagoya University⁵

1 Apr 2020

TL;DR: ESnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to -speech functions for speech translation.

...read moreread less

Abstract: We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines for a wide range of benchmark datasets Our reproducible results can match or even outperform the current state-of-the-art performances; these pre-trained models are downloadable The toolkit is publicly available at https://githubcom/espnet/espnet

...read moreread less

211 citations

Posted Content•

Recent Developments on ESPnet Toolkit Boosted by Conformer

[...]

Pengcheng Guo¹, Florian Boyer², Xuankai Chang³, Tomoki Hayashi, Yosuke Higuchi⁴, Hirofumi Inaguma⁵, Naoyuki Kamo⁶, Chenda Li⁷, Daniel Garcia-Romero³, Jiatong Shi³, Jing Shi³, Shinji Watanabe³, Kun Wei¹, Wangyou Zhang⁷, Yuekai Zhang³ - Show less +11 more•Institutions (7)

Northwestern Polytechnical University¹, University of Bordeaux², Johns Hopkins University³, Waseda University⁴, Kyoto University⁵, Nippon Telegraph and Telephone⁶, Shanghai Jiao Tong University⁷

26 Oct 2020-arXiv: Audio and Speech Processing

TL;DR: This paper shows the results for a wide range of end- to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS).

...read moreread less

Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.

...read moreread less

208 citations

Journal Article•10.1016/J.NEURON.2019.10.019•

Two Distinct Neural Timescales for Predictive Speech Processing

[...]

Peter W. Donhauser¹, Sylvain Baillet¹•Institutions (1)

Montreal Neurological Institute and Hospital¹

22 Jan 2020-Neuron

TL;DR: It is shown that speech-related activity is hierarchically organized into two timescales: fast responses (theta: 4-10 Hz), restricted to early auditory regions, and slow responses (delta: 0.5-4 Hz), dominating in downstream auditory regions; and that theta sensory sampling is tuned to maximize expected information gain, while delta encodes only non-redundant information.

...read moreread less

200 citations

Journal Article•10.1109/TIFS.2019.2941773•

Fusing MFCC and LPC Features Using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals

[...]

Anurag Chowdhury¹, Arun Ross¹•Institutions (1)

Michigan State University¹

01 Jan 2020-IEEE Transactions on Information Forensics and Security

TL;DR: This work approaches the problem of speaker recognition from severely degraded audio data by judiciously combining two commonly used features: Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC), and concludes that MFCC and LPC capture two distinct aspects of speech, viz., speech perception and speech production.

...read moreread less

Abstract: Speaker recognition algorithms are negatively impacted by the quality of the input speech signal. In this work, we approach the problem of speaker recognition from severely degraded audio data by judiciously combining two commonly used features: Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC). Our hypothesis rests on the observation that MFCC and LPC capture two distinct aspects of speech, viz., speech perception and speech production. A carefully crafted 1D Triplet Convolutional Neural Network (1D-Triplet-CNN) is used to combine these two features in a novel manner, thereby enhancing the performance of speaker recognition in challenging scenarios. Extensive evaluation on multiple datasets, different types of audio degradations, multi-lingual speech, varying length of audio samples, etc. convey the efficacy of the proposed approach over existing speaker recognition methods, including those based on iVector and xVector.

...read moreread less

161 citations

Journal Article•10.1080/23273798.2019.1693050•

Synchronous, but not entrained: Exogenous and endogenous cortical rhythms of speech and language processing

[...]

Lars Meyer¹, Yue Sun¹, Andrea E. Martin¹•Institutions (1)

Max Planck Society¹

03 Nov 2020-Language, cognition and neuroscience

TL;DR: This article showed that the cortex shadows rhythmic acoustic information with oscillatory activity, a phenomenon termed "entrainment" in speech processing, which is often focused on the phenomenon of "shadowing".

...read moreread less

Abstract: Research on speech processing is often focused on a phenomenon termed “entrainment”, whereby the cortex shadows rhythmic acoustic information with oscillatory activity. Entrainment has been observe...

...read moreread less

147 citations

Proceedings Article•10.1109/ICASSP40776.2020.9054734•

F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder

[...]

Kaizhi Qian¹, Zeyu Jin², Mark Hasegawa-Johnson¹, Gautham J. Mysore²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Adobe Systems²

15 Apr 2020

TL;DR: This work modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time and can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.

...read moreread less

Abstract: Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AU-TOVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker’s identity embedding to synthesize a new voice. However, we found that while speaker identity is disentangled from speech content, a significant amount of prosodic information, such as source F0, leaks through the bottleneck, causing target F0 to fluctuate unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus unsuitable for many applications. In the paper, we modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. Therefore, we can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity. We support our improvement through quantitative and qualitative analysis.

...read moreread less

139 citations

Journal Article•10.1016/J.COPHYS.2020.07.014•

Continuous speech processing.

[...]

Christian Brodbeck¹, Jonathan Z. Simon¹•Institutions (1)

University of Maryland, College Park¹

01 Dec 2020-Current Opinion in Physiology

TL;DR: Two lines of research are closely related, since processing stages throughout auditory cortex contribute to speech comprehension, in addition to subcortical processing and higher order and attentional processes.

...read moreread less

126 citations

Journal Article•10.1109/TASLP.2020.2987429•

SpEx: Multi-Scale Time Domain Speaker Extraction Network

[...]

Chenglin Xu¹, Wei Rao², Eng Siong Chng¹, Haizhou Li²•Institutions (2)

Nanyang Technological University¹, National University of Singapore²

14 Apr 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Wang et al. as mentioned in this paper proposed a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.

...read moreread less

Abstract: Speaker extraction aims to mimic humans’ selective auditory attention by extracting a target speaker's voice from a multi-talker environment. It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra. However, such an approach is adversely affected by the inherent difficulty of phase estimation. Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra. In this way, we avoid phase estimation. The SpEx network consists of four network components, namely speaker encoder , speech encoder , speaker extractor , and speech decoder . Specifically, the speech encoder converts the mixture speech into multi-scale embedding coefficients, the speaker encoder learns to represent the target speaker with a speaker embedding. The speaker extractor takes the multi-scale embedding coefficients and target speaker embedding as input and estimates a receptive mask. Finally, the speech decoder reconstructs the target speaker's speech from the masked embedding coefficients. We also propose a multi-task learning framework and a multi-scale embedding implementation. Experimental results show that the proposed SpEx achieves 37.3%, 37.7% and 15.0% relative improvements over the best baseline in terms of signal-to-distortion ratio (SDR), scale-invariant SDR (SI-SDR), and perceptual evaluation of speech quality (PESQ) under an open evaluation condition.

...read moreread less

124 citations

Journal Article•10.1109/JSTSP.2020.2980956•

Multi-Modal Multi-Channel Target Speech Separation

[...]

Rongzhi Gu¹, Shi-Xiong Zhang¹, Yong Xu¹, Lianwu Chen¹, Yuexian Zou², Dong Yu¹ - Show less +2 more•Institutions (2)

Tencent¹, Peking University²

16 Mar 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A general multi-modal framework for target speech separation is proposed by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements, and a factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi- modalities at embedding level.

...read moreread less

Abstract: Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

...read moreread less

Posted Content•

Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends.

[...]

Siddique Latif¹, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, Björn Schuller - Show less +2 more•Institutions (1)

University of Southern Queensland¹

02 Jan 2020-arXiv: Sound

TL;DR: This paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition, Speaker Recognition (SR), and Speaker Emotion recognition (SER).

...read moreread less

Abstract: Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech---a gap that our survey aims to bridge.

...read moreread less

Journal Article•10.1109/TASLP.2020.2975902•

Deep Learning Based Target Cancellation for Speech Dereverberation

[...]

Zhong-Qiu Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

28 Feb 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: These models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.

...read moreread less

Abstract: This article investigates deep learning based single- and multi-channel speech dereverberation. For single-channel processing, we extend magnitude-domain masking and mapping based dereverberation to complex-domain mapping, where deep neural networks (DNNs) are trained to predict the real and imaginary (RI) components of the direct-path signal from reverberant (and noisy) ones. For multi-channel processing, we first compute a minimum variance distortionless response (MVDR) beamformer to cancel the direct-path signal, and then feed the RI components of the cancelled signal, which is expected to be a filtered version of non-target signals, as additional features to perform dereverberation. Trained on a large dataset of simulated room impulse responses, our models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.

...read moreread less

Journal Article•10.1109/TASLP.2020.3023632•

Semi-Supervised Speech Emotion Recognition With Ladder Networks

[...]

Srinivas Parthasarathy¹, Carlos Busso¹•Institutions (1)

University of Texas at Dallas¹

14 Sep 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, the authors proposed the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task, which is the reconstruction of intermediate feature representations using a denoising autoencoder.

...read moreread less

Abstract: Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. For example, systems that show superior performance on certain databases show poor performance when tested on other corpora. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increase the generalization of the models. An effective way to achieve this goal is by regularizing the models through multitask learning (MTL), where auxiliary tasks are learned along with the primary task. These methods often require the use of labeled data which is computationally expensive to collect for emotion recognition (gender, speaker identity, age or other emotional descriptors). This study proposes the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task. The primary task is a regression problem to predict emotional attributes. The auxiliary task is the reconstruction of intermediate feature representations using a denoising autoencoder. This auxiliary task does not require labels so it is possible to train the framework in a semi-supervised fashion with abundant unlabeled data from the target domain. This study shows that the proposed approach creates a powerful framework for SER, achieving superior performance than fully supervised single-task learning (STL) and MTL baselines. We implement the approach with sentence-level or frame-level features, demonstrating the flexibility of our approach. Additionally, the generalization of the ladder networks is evaluated in cross-corpus settings using sentence-level features, obtaining important improvements. Compared to the STL baselines, the proposed approach achieves relative gains in concordance correlation coefficient (CCC) between 3.0% and 3.5% for within corpus evaluations, and between 16.1% and 74.1% for cross corpus evaluations, highlighting the power of the architecture.

...read moreread less

Proceedings Article•10.21437/INTERSPEECH.2020-1066•

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

[...]

Wen-Chin Huang¹, Tomoki Hayashi¹, Yi-Chiao Wu¹, Hirokazu Kameoka², Tomoki Toda¹ - Show less +1 more•Institutions (2)

Nagoya University¹, Nippon Telegraph and Telephone²

25 Oct 2020

TL;DR: In this paper, a sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-tospeech (TTS) pretraining is proposed.

...read moreread less

Abstract: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transformer network, which has shown promising results in various speech processing tasks, has not yet been investigated. Nonetheless, their data-hungry property and the mispronunciation of converted speech make seq2seq models far from practical. To this end, we propose a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora. VC models initialized with such pretrained model parameters are able to generate effective hidden representations for high-fidelity, highly intelligible converted speech. Experimental results show that such a pretraining scheme can facilitate data-efficient training and outperform an RNN-based seq2seq VC model in terms of intelligibility, naturalness, and similarity.

...read moreread less

Journal Article•10.1109/JSTSP.2019.2952087•

A Review of Automated Speech and Language Features for Assessment of Cognitive and Thought Disorders

[...]

Rohit Voleti¹, Julie M. Liss¹, Visar Berisha¹•Institutions (1)

Arizona State University¹

01 Feb 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A review of existing speech and language features used in this domain, including language diversity, syntactic complexity, semantic coherence, and timing, and a proposal of new research directions to further advance the field are considered.

...read moreread less

Abstract: It is widely accepted that information derived from analyzing speech (the acoustic signal) and language production (words and sentences) serves as a useful window into the health of an individual's cognitive ability. In fact, most neuropsychological testing batteries have a component related to speech and language where clinicians elicit speech from patients for subjective evaluation across a broad set of dimensions. With advances in speech signal processing and natural language processing, there has been recent interest in developing tools to detect more subtle changes in cognitive-linguistic function. This work relies on extracting a set of features from recorded and transcribed speech for objective assessments of speech and language, early diagnosis of neurological disease, and tracking of disease after diagnosis. With an emphasis on cognitive and thought disorders, in this paper we provide a review of existing speech and language features used in this domain, discuss their clinical application, and highlight their advantages and disadvantages. Broadly speaking, the review is split into two categories: language features based on natural language processing and speech features based on speech signal processing. Within each category, we consider features that aim to measure complementary dimensions of cognitive-linguistics, including language diversity, syntactic complexity, semantic coherence, and timing. We conclude the review with a proposal of new research directions to further advance the field.

...read moreread less

Journal Article•10.1109/JSTSP.2020.2987209•

Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network

[...]

Ke Tan¹, Yong Xu², Shi-Xiong Zhang², Meng Yu², Dong Yu² - Show less +1 more•Institutions (2)

Ohio State University¹, Tencent²

16 Apr 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: This study addresses joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation, and proposes a novel multimodal network that exploits both audio and visual signals.

...read moreread less

Abstract: Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers.

...read moreread less

Journal Article•10.1002/WCS.1521•

Phonetic cue weighting in perception and production.

[...]

Jessamyn Schertz¹, Emily J. Clare¹•Institutions (1)

University of Toronto¹

01 Mar 2020-Wiley Interdisciplinary Reviews: Cognitive Science

TL;DR: The comparison of cue weighting in perception and production bears on a range of theoretical issues including the processes underlying sound change, the time course of learning, the nature of cues, and the perception-production interface.

...read moreread less

Abstract: Speech sound contrasts differ along multiple phonetic dimensions. During speech perception, listeners must decide which cues are relevant, and determine the relative importance of each cue, while also integrating other, signal-external cues. The comparison of cue weighting in perception and production bears on a range of theoretical issues including the processes underlying sound change, the time course of learning, the nature of cues, and the perception-production interface. Research examining the relative alignment of cue weighting across the modalities, on both a community and individual level, has revealed both parallels and asymmetries between the modalities. The extraordinarily wide range of ways that have been used to conceptualize and quantify cue weights reflects the inherent theoretical, methodological, and analytical differences between the two modalities. More consideration of the choices of analytical metrics, explicit discussion of the theoretical assumptions that underlie them, and systematic investigations of different types of cues will lead to more generalizable findings that can be incorporated into computational implementable models of speech processing. This article is categorized under: Linguistics > Language in Mind and Brain Psychology > Language.

...read moreread less

Journal Article•10.3390/S20082326•

Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data.

[...]

Ayesha Pervaiz¹, Fawad Hussain¹, Huma Israr², Muhammad Tahir², Fawad Riasat Raja³, Naveed Khan Baloch¹, Farruh Ishmanov⁴, Yousaf Bin Zikria⁵ - Show less +4 more•Institutions (5)

University of Engineering and Technology¹, University of the Sciences², Griffith University³, Kwangwoon University⁴, Yeungnam University⁵

19 Apr 2020-Sensors

TL;DR: A novel technique is proposed for noise robustness by augmenting noise in training data and achieves much better results than existing state-of-the-art techniques, thus setting a new benchmark.

...read moreread less

Abstract: The advent of new devices, technology, machine learning techniques, and the availability of free large speech corpora results in rapid and accurate speech recognition. In the last two decades, extensive research has been initiated by researchers and different organizations to experiment with new techniques and their applications in speech processing systems. There are several speech command based applications in the area of robotics, IoT, ubiquitous computing, and different human-computer interfaces. Various researchers have worked on enhancing the efficiency of speech command based systems and used the speech command dataset. However, none of them catered to noise in the same. Noise is one of the major challenges in any speech recognition system, as real-time noise is a very versatile and unavoidable factor that affects the performance of speech recognition systems, particularly those that have not learned the noise efficiently. We thoroughly analyse the latest trends in speech recognition and evaluate the speech command dataset on different machine learning based and deep learning based techniques. A novel technique is proposed for noise robustness by augmenting noise in training data. Our proposed technique is tested on clean and noisy data along with locally generated data and achieves much better results than existing state-of-the-art techniques, thus setting a new benchmark.

...read moreread less

Proceedings Article•10.1109/ICASSP40776.2020.9053770•

Using X-Vectors to Automatically Detect Parkinson’s Disease from Speech

[...]

Laureano Moro-Velázquez¹, Jesús Villalba¹, Najim Dehak¹•Institutions (1)

Johns Hopkins University¹

4 May 2020

TL;DR: Results suggest that speaker embeddings obtained using deep neural networks are successful extracting acoustic information relative to patterns in articulation, prosody and/or phonation common in persons with PD.

...read moreread less

Abstract: The promise of new neuroprotective treatments to stop or slow the advance of Parkinson’s Disease (PD) urges for new biomarkers or detection schemes that can deliver a faster diagnosis. Given that speech is affected by PD, the combination of deep neural networks and speech processing can provide automatic detection schemes. Accordingly, in this study we analyze for the first time a new state-of-the-art speaker recognition technique, x-Vectors, in a different scenario: the automatic detection of PD from speech. The proposed approach is compared with another speaker recognition technique, i-Vectors, employed in previous works and used as baseline in this study. A corpus with 43 PD patients and 46 control speakers was used to evaluate the performance of these two techniques at two sampling frequencies: 8 and 16 kHz.The x-Vector approach provided the best results in terms of accuracy and AUC reaching values of 90% and 0.94, respectively. Consequently, results suggest that speaker embeddings obtained using deep neural networks are successful extracting acoustic information relative to patterns in articulation, prosody and/or phonation common in persons with PD.

...read moreread less

Journal Article•10.1016/J.COGNITION.2019.104162•

Early lexical influences on sublexical processing in speech perception: Evidence from electrophysiology.

[...]

Colin Noe¹, Simon Fischer-Baum¹•Institutions (1)

Rice University¹

01 Apr 2020-Cognition

TL;DR: This study finds that lexical context modulates the amplitude of the N100, an ERP component linked with sublexical processes in speech perception, and demonstrates that these results can be modeled in an interactive speech perception model and are not well fit by any established feed-forward mechanisms of lexical bias.

...read moreread less

Posted Content•

An open-source voice type classifier for child-centered daylong recordings.

[...]

Marvin Lavechin, Ruben Bousbib, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia - Show less +1 more

26 May 2020-arXiv: Audio and Speech Processing

TL;DR: The architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.

...read moreread less

Abstract: Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.

...read moreread less

Journal Article•10.1109/JSTSP.2019.2949912•

Spectro-Temporal Representation of Speech for Intelligibility Assessment of Dysarthria

[...]

H. M. Chandrashekar¹, Veena Karjigi¹, N. Sreedevi•Institutions (1)

Siddaganga Institute of Technology¹

01 Feb 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: Use of Time-Frequency CNN configuration proved to capture spectro-temporal variations together resulting in an improved performance compared to either Time-CNN or Frequency-CNN configurations which capture either temporal or spectral variations respectively.

...read moreread less

Abstract: Recently, spectro-temporal representation of speech has been used in many fields of speech processing. Owing to this, we explore the use of spectro-temporal representation for speech intelligibility assessment especially for dysarthric speech. In this work, we investigate the use of spectro-temporal representations to evaluate intelligibility levels using artificial neural network (ANN) and convolutional neural network (CNN). Standard American English dysarthric databases namely Universal Access and TORGO are used for evaluation. Performance of CNN classifier is superior to ANN as it is an advanced classifier. Further, use of Time-Frequency CNN configuration proved to capture spectro-temporal variations together resulting in an improved performance compared to either Time-CNN or Frequency-CNN configurations which capture either temporal or spectral variations respectively.

...read moreread less

Journal Article•10.1109/ACCESS.2020.2995737•

Detection of Speech Impairments Using Cepstrum, Auditory Spectrogram and Wavelet Time Scattering Domain Features

[...]

Andrius Lauraitis¹, Rytis Maskeliunas², Robertas Damaševičius², Tomas Krilavičius²•Institutions (2)

Kaunas University of Technology¹, Vytautas Magnus University²

19 May 2020-IEEE Access

TL;DR: Bidirectional Long Short-Term Memory neural network and Wavelet Scattering Transform with Support Vector Machine classifier for detecting speech impairments of patients at the early stage of central nervous system disorders (CNSD) are adopted.

...read moreread less

Abstract: We adopt Bidirectional Long Short-Term Memory (BiLSTM) neural network and Wavelet Scattering Transform with Support Vector Machine (WST-SVM) classifier for detecting speech impairments of patients at the early stage of central nervous system disorders (CNSD). The study includes 339 voice samples collected from 15 subjects: 7 patients with early stage CNSD (3 Huntington, 1 Parkinson, 1 cerebral palsy, 1 post stroke, 1 early dementia), other 8 subjects were healthy. Speech data is collected using voice recorder from Neural Impairment Test Suite (NITS) mobile app. Features are extracted from pitch contours, Mel-frequency cepstral coefficients (MFCC), Gammatone cepstral coefficients (GTCC), Gabor (analytic Morlet) wavelet and auditory spectrograms. 94.50% (BiLSTM) and 96.3% (WST-SVM) accuracy is achieved for solving healthy vs. impaired classification problem. The developed method can be applied for automated CNSD patient health state monitoring and clinical decision support systems as well as a part of Internet of Medical Things (IoMT).

...read moreread less

Journal Article•10.1109/JSTSP.2019.2949419•

Natural Language Processing Methods for Acoustic and Landmark Event-Based Features in Speech-Based Depression Detection

[...]

Zhaocheng Huang¹, Julien Epps¹, D. Joachim, Vidhyasaharan Sethu¹•Institutions (1)

University of New South Wales¹

01 Feb 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A framework for analyzing speech as a sequence of acoustic events, which combines acoustic words and speech landmarks, which are articulation-related speech events is proposed, and its application to depression detection is investigated.

...read moreread less

Abstract: The processing of speech as an explicit sequence of events is common in automatic speech recognition (linguistic events), but has received relatively little attention in paralinguistic speech classification despite its potential for characterizing broad acoustic event sequences. This paper proposes a framework for analyzing speech as a sequence of acoustic events, and investigates its application to depression detection. In this framework, acoustic space regions are tokenized to ‘words’ representing speech events at fixed or irregular intervals. This tokenization allows the exploitation of acoustic word features using proven natural language processing methods. A key advantage of this framework is its ability to accommodate heterogeneous event types: herein we combine acoustic words and speech landmarks, which are articulation-related speech events. Another advantage is the option to fuse such heterogeneous events at various levels, including the embedding level. Evaluation of the proposed framework on both controlled laboratory-grade supervised audio recordings as well as unsupervised self-administered smartphone recordings highlight the merits of the proposed framework across both datasets, with the proposed landmark-dependent acoustic words achieving improvements in F1(depressed) of up to 15% and 13% for SH2-FS and DAIC-WOZ respectively, relative to acoustic speech baseline approaches.

...read moreread less

Journal Article•10.1016/J.NEUROIMAGE.2020.116557•

Transcranial alternating current stimulation in the theta band but not in the delta band modulates the comprehension of naturalistic speech in noise

[...]

Mahmoud Keshavarzi¹, Mikolaj Kegler¹, Shabnam Kadir², Tobias Reichenbach¹•Institutions (2)

Imperial College London¹, University of Hertfordshire²

15 Apr 2020-NeuroImage

TL;DR: In this paper, the authors used transcranial alternating current stimulation with waveforms derived from the speech envelope and filtered in the delta and theta frequency bands to alter cortical entrainment in both bands separately.

...read moreread less

Journal Article•10.1109/TETCI.2020.2977678•

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

[...]

Wen-Chin Huang¹, Hao Luo¹, Hsin-Te Hwang¹, Chen-Chou Lo¹, Yu-Huai Peng¹, Yu Tsao², Hsin-Min Wang¹ - Show less +3 more•Institutions (2)

Academia Sinica¹, Information Technology Institute²

6 Apr 2020

TL;DR: This article extends the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech.

...read moreread less

Abstract: An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this article, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods.

...read moreread less

Journal Article•10.1016/J.DSP.2020.102795•

Optimization of data-driven filterbank for automatic speaker verification

[...]

Susanta Kumar Sarangi¹, Sahidullah, Goutam Saha¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Sep 2020-Digital Signal Processing

TL;DR: The proposed filterbank has more speaker discriminative power than commonly used mel filterbank as well as existing data-driven filterbank and it is shown that the acoustic features created with proposed filter bank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based Frequency Warping Scale (SFCC) in most cases.

...read moreread less

Journal Article•10.1111/EJN.13855•

Cortical oscillations and entrainment in speech processing during working memory load.

[...]

Jens Hjortkjær¹, Jens Hjortkjær², Jonatan Märcher-Rørsted¹, Søren A. Fuglsang¹, Torsten Dau¹ - Show less +1 more•Institutions (2)

Technical University of Denmark¹, Copenhagen University Hospital²

01 Mar 2020-European Journal of Neuroscience

TL;DR: It was found that increases in both types of WM load (background noise and n‐back level) decreased cortical speech envelope entrainment, suggesting a top‐down influence of WM processing on cortical speechEntrainment.

...read moreread less

Abstract: Neuronal oscillations are thought to play an important role in working memory (WM) and speech processing. Listening to speech in real-life situations is often cognitively demanding but it is unknown whether WM load influences how auditory cortical activity synchronizes to speech features. Here, we developed an auditory n-back paradigm to investigate cortical entrainment to speech envelope fluctuations under different degrees of WM load. We measured the electroencephalogram, pupil dilations and behavioural performance from 22 subjects listening to continuous speech with an embedded n-back task. The speech stimuli consisted of long spoken number sequences created to match natural speech in terms of sentence intonation, syllabic rate and phonetic content. To burden different WM functions during speech processing, listeners performed an n-back task on the speech sequences in different levels of background noise. Increasing WM load at higher n-back levels was associated with a decrease in posterior alpha power as well as increased pupil dilations. Frontal theta power increased at the start of the trial and increased additionally with higher n-back level. The observed alpha-theta power changes are consistent with visual n-back paradigms suggesting general oscillatory correlates of WM processing load. Speech entrainment was measured as a linear mapping between the envelope of the speech signal and low-frequency cortical activity (< 13 Hz). We found that increases in both types of WM load (background noise and n-back level) decreased cortical speech envelope entrainment. Although entrainment persisted under high load, our results suggest a top-down influence of WM processing on cortical speech entrainment.

...read moreread less

Journal Article•10.1109/TASLP.2020.2977776•

Machine Speech Chain

[...]

Andros Tjandra¹, Sakriani Sakti¹, Satoshi Nakamura¹•Institutions (1)

Nara Institute of Science and Technology¹

02 Mar 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: To the best of the knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors and significantly improved performance over that from separate systems that were only trained with labeled data.

...read moreread less

Abstract: Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop machine speech chain model based on deep learning. The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved performance over that from separate systems that were only trained with labeled data.

...read moreread less

...

Expand