TL;DR: A comprehensive overview of the key deep learning models and their applications in speech-processing tasks can be found in this paper , where the authors discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of Deep Learning for multimodal speech processing.
TL;DR: Target speech/speaker extraction (TSE) as mentioned in this paper is an emerging field of research that has received increased attention because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning.
Abstract: Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.
TL;DR: In this article , the authors introduce tools and a set of models to estimate the quality and intelligibility of a speech signal using deep neural networks, which are made available in the well-established TorchAudio library.
Abstract: Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made available in the well-established TorchAudio library, the core audio and speech processing library within the PyTorch deep learning framework. We refer to it as TorchAudio-Squim, TorchAudio-Speech QUality and Intelligibility Measures. More specifically, in the current version of TorchAudio-squim, we establish and release models for estimating PESQ, STOI and SI-SDR among objective metrics and MOS among subjective metrics. We develop a novel approach for objective metric estimation and use a recently developed approach for subjective metric estimation. These models operate in a ``reference-less"manner, that is they do not require the corresponding clean speech as reference for speech assessment. Given the unavailability of clean speech and the effortful process of subjective evaluation in real-world situations, such easy-to-use tools would greatly benefit speech processing research and development.
TL;DR: In this article , the authors propose a T-F attention (TFA) module, which uses two parallel attention branches, i.e., time-frame attention and frequency-channel attention, to explicitly exploit position information to generate a 2-D attention map to characterise the salient T-frequency (T-F) speech distribution.
Abstract: Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on speech enhancement tend to investigate how to effectively capture the long-term contextual dependencies of speech signals to boost performance. However, these studies generally neglect the time-frequency (T-F) distribution information of speech spectral components, which is equally important for speech enhancement. In this paper, we propose a simple yet very effective network module, which we term the T-F attention (TFA) module, that uses two parallel attention branches, i.e., time-frame attention and frequency-channel attention, to explicitly exploit position information to generate a 2-D attention map to characterise the salient T-F speech distribution. We validate our TFA module as part of two widely used backbone networks (residual temporal convolution network and Transformer) and conduct speech enhancement with four most popular training objectives. Our extensive experiments demonstrate that our proposed TFA module consistently leads to substantial enhancement performance improvements in terms of the five most widely used objective metrics, with negligible parameter overheads. In addition, we further evaluate the efficacy of speech enhancement as a front-end for a downstream speech recognition task. Our evaluation results show that the TFA module significantly improves the robustness of the system to noisy conditions.
TL;DR: The authors used a data-driven approach to compare two opposing hypotheses regarding the system's capacity to co-represent competing speech: can the brain represent two speakers equally or is the system fundamentally limited, resulting in tradeoffs between them?
TL;DR: Wang et al. as discussed by the authors proposed a time frequency convolution module (TFCM) to increase the receptive field with small convolution kernels, which significantly outperforms TEA-PSE in both speech enhancement performance and computation complexity.
Abstract: Personalized speech enhancement (PSE) utilizes additional cues like speaker embeddings to remove background noise and interfering speech and extract the speech from target speaker. Previous work, the Tencent-Ethereal-Audio-Lab personalized speech enhancement (TEA-PSE) system, ranked 1st in the ICASSP 2022 deep noise suppression (DNS2022) challenge. In this paper, we expand TEA-PSE to its sub-band version - TEA-PSE 2.0, to reduce computational complexity as well as further improve performance. Specifically, we adopt finite impulse response filter banks and spectrum splitting to reduce computational complexity. We introduce a time frequency convolution module (TFCM) to the system for increasing the receptive field with small convolution kernels. Besides, we explore several training strategies to optimize the two-stage network and investigate various loss functions in the PSE task. TEA-PSE 2.0 significantly outperforms TEA-PSE in both speech enhancement performance and computation complexity. Experimental results on the DNS2022 blind test set show that TEA-PSE 2.0 brings 0.102 OVRL personalized DNSMOS improvement with only 21.9% multiply-accumulate operations compared with the previous TEA-PSE.
TL;DR: In this article , a simple ensembling method is shown to considerably improve upon the baseline decoder performance by jointly decoding the speech-evoked frequency-following response and responses to the temporal envelope of speech, as well as by fine-tuning the decoders to individual subjects.
Abstract: During speech perception, a listener's electroencephalogram (EEG) reflects acoustic-level processing as well as higher-level cognitive factors such as speech comprehension and attention. However, decoding speech from EEG recordings is challenging due to the low signal-to-noise ratios of EEG signals. We report on an approach developed for the ICASSP 2023 'Auditory EEG Decoding' Signal Processing Grand Challenge. A simple ensembling method is shown to considerably improve upon the baseline decoder performance. Even higher classification rates are achieved by jointly decoding the speech-evoked frequency-following response and responses to the temporal envelope of speech, as well as by fine-tuning the decoders to individual subjects. Our results could have applications in the diagnosis of hearing disorders or in cognitively steered hearing aids.
TL;DR: In this article , the authors compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension.
Abstract: Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approaches have recently been shown to narrow this performance gap considerably. In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks. For this, we extend our prior contributions on diffusion-based speech enhancement in the complex time-frequency domain to the task of bandwith extension. We then compare it to a discriminatively trained neural network with the same network architecture on three restoration tasks, namely speech denoising, dereverberation and bandwidth extension. We observe that the generative approach performs globally better than its discriminative counterpart on all tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension. Code and audio examples can be found online 1 .
TL;DR: The Multi-modal Information based Speech Processing (MISP)2022 challenge as discussed by the authors aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies.
Abstract: The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when" using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when" with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.
TL;DR: In this article , a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning is proposed, which consists of a STU encoder, which generates hidden speech units common to both whispered and normal speech, and a UTS decoder which reconstructs speech from the encoded speech units.
Abstract: Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech (UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker’s voice from speech units, and it requires only an unlabeled target speaker’s speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities.
TL;DR: NeuroTalk as discussed by the authors converts non-invasive brain signals of imagined speech into the user's own voice, allowing natural correspondence between the imagined speech and the voice as a ground truth.
Abstract: Translating imagined speech from human brain activity into voice is a challenging and absorbing research issue that can provide new means of human communication via brain signals. Efforts to reconstruct speech from brain activity have shown their potential using invasive measures of spoken speech data, but have faced challenges in reconstructing imagined speech. In this paper, we propose NeuroTalk, which converts non-invasive brain signals of imagined speech into the user's own voice. Our model was trained with spoken speech EEG which was generalized to adapt to the domain of imagined speech, thus allowing natural correspondence between the imagined speech and the voice as a ground truth. In our framework, an automatic speech recognition decoder contributed to decomposing the phonemes of the generated speech, demonstrating the potential of voice reconstruction from unseen words. Our results imply the potential of speech synthesis from human EEG signals, not only from spoken speech but also from the brain signals of imagined speech.
TL;DR: In this article , the authors propose a framework bridging speech and text through images to enhance speech models without transcriptions by aligning them via paired images and spoken captions with minimal fine-tuning.
Abstract: Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech-CLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior state-of-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
TL;DR: Inspired by the divide and conquer principle, this article proposed an ensemble of neural network experts to divide the huge decision space regarding speech recognition so that each expert takes care only of a delimited area of this decision space.
Abstract: The automatic speech recognition based on detection of phonemes provides advantages for online recognition of a speech represented by a sound signal. The development of a system for automatic speech recognition is multidisciplinary. It covers several areas of research, such as linguistics, signal processing and computational intelligence. In this work, the process starts with a speech signal pre-processing to extract the main features of the speech signal at a given instant of time. Inspired by the “divide and conquer” principle, we bridge the complexity gap of automatic speech recognition by devising models based on an ensemble of neural network experts, allowing to divide the huge decision space regarding speech recognition so that each expert takes care only of a delimited area of this decision space. This novel application of this strategy improves the precision, sensitivity and accuracy of the recognition process. Each included expert decides regarding each one of the pre-processed input samples. The decision set thus obtained is weighted. So, the expert with the highest weight for the output will determine the sample final classification. After that, a dynamic post-processing step, implemented as a recurrent neural network, is executed. It aims at mitigating the oscillatory effect that occurs during the recognition of classes with similar characteristics. In this work, two ensembles are investigated. The first is based on the clustering of similar phonetics classes while the second takes care of the imbalanced distribution of samples in the training set. The proposed model achieves 7.63% improvement in terms of accuracy with respect to the best so far related model for automatic speech recognition.
TL;DR: Acoustic and phonemic processing are impaired in individuals with aphasia, highlighting the need for interventions targeting these mechanisms.
Abstract: Abstract Acoustic and phonemic processing are understudied in aphasia, a language disorder that can affect different levels and modalities of language processing. For successful speech comprehension, processing of the speech envelope is necessary, which relates to amplitude changes over time (e.g., the rise times). Moreover, to identify speech sounds (i.e., phonemes), efficient processing of spectro-temporal changes as reflected in formant transitions is essential. Given the underrepresentation of aphasia studies on these aspects, we tested rise time processing and phoneme identification in 29 individuals with post-stroke aphasia and 23 healthy age-matched controls. We found significantly lower performance in the aphasia group than in the control group on both tasks, even when controlling for individual differences in hearing levels and cognitive functioning. Further, by conducting an individual deviance analysis, we found a low-level acoustic or phonemic processing impairment in 76% of individuals with aphasia. Additionally, we investigated whether this impairment would propagate to higher-level language processing and found that rise time processing predicts phonological processing performance in individuals with aphasia. These findings show that it is important to develop diagnostic and treatment tools that target low-level language processing mechanisms.
TL;DR: The integration of continuous audio and visual speech in a cocktail-party environment depends on attention. Multisensory integration and attention interact to shape our understanding of speech in noisy environments.
Abstract: In noisy environments, our ability to understand speech benefits greatly from seeing the speaker's face. This is attributed to the brain's ability to integrate audio and visual information, a process known as multisensory integration. In addition, selective attention plays an enormous role in what we understand, the so-called cocktail-party phenomenon. But how attention and multisensory integration interact remains incompletely understood, particularly in the case of natural, continuous speech. Here, we addressed this issue by analyzing EEG data recorded from participants who undertook a multisensory cocktail-party task using natural speech. To assess multisensory integration, we modeled the EEG responses to the speech in two ways. The first assumed that audiovisual speech processing is simply a linear combination of audio speech processing and visual speech processing (i.e., an A + V model), while the second allows for the possibility of audiovisual interactions (i.e., an AV model). Applying these models to the data revealed that EEG responses to attended audiovisual speech were better explained by an AV model, providing evidence for multisensory integration. In contrast, unattended audiovisual speech responses were best captured using an A + V model, suggesting that multisensory integration is suppressed for unattended speech. Follow up analyses revealed some limited evidence for early multisensory integration of unattended AV speech, with no integration occurring at later levels of processing. We take these findings as evidence that the integration of natural audio and visual speech occurs at multiple levels of processing in the brain, each of which can be differentially affected by attention.
Abstract: Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50 room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (pyannote.audio); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.
TL;DR: In this article , the authors present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology and identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues.
Abstract: The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications. In this paper, we present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology. By consolidating findings from across the speech technology landscape, we provide a valuable resource for researchers interested in harnessing the power of transformers to advance the field. We identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues.
TL;DR: In this article , an acoustic-to-articulatory inversion (AAI) model was proposed to generalize to unseen speakers, which leverages autoregression, adversarial training, and self supervision.
Abstract: To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages autoregression, adversarial training, and self supervision to generalize to unseen speakers. Our approach obtains 0.784 correlation on an electromagnetic articulography (EMA) dataset, improving the state-of-the-art by 12.5%. Additionally, we show the interpretability of these representations through directly com-paring the behavior of estimated representations with speech production behavior. Finally, we propose a resynthesis-based AAI evaluation metric that does not rely on articulatory labels, demonstrating its efficacy with an 18-speaker dataset.
TL;DR: In this article , the authors investigated whether individual differences in the intrinsic propensity to track the speech envelope when listening to speech-in-quiet is predictive of individual differences for speech recognition in noise in an independent task.
Abstract: Abstract Speech elicits brain activity time-locked to its amplitude envelope. The resulting speech-brain synchrony (SBS) is thought to be crucial to speech parsing and comprehension. It has been shown that higher speech-brain coherence is associated with increased speech intelligibility. However, studies depending on the experimental manipulation of speech stimuli do not allow conclusion about the causality of the observed tracking. Here, we investigate whether individual differences in the intrinsic propensity to track the speech envelope when listening to speech-in-quiet is predictive of individual differences in speech-recognition-in-noise, in an independent task. We evaluated the cerebral tracking of speech in source-localized magnetoencephalography, at timescales corresponding to the phrases, words, syllables and phonemes. We found that individual differences in syllabic tracking in right superior temporal gyrus and in left middle temporal gyrus (MTG) were positively associated with recognition accuracy in an independent words-in-noise task. Furthermore, directed connectivity analysis showed that this relationship is partially mediated by top-down connectivity from premotor cortex—associated with speech processing and active sensing in the auditory domain—to left MTG. Thus, the extent of SBS—even during clear speech—reflects an active mechanism of the speech processing system that may confer resilience to noise.
TL;DR: In this paper , a Deformable Speech Transformer (DST) is proposed to adaptively adjust the positions of the attention windows, allowing DST to discover and attend to the valuable in-formation embedded in the speech.
Abstract: Enabled by multi-head self-attention, Transformer has exhibited remarkable results in speech emotion recognition (SER). Compared to the original full attention mechanism, window-based attention is more effective in learning fine-grained features while greatly reducing model redundancy. However, emotional cues are present in a multi-granularity manner such that the pre-defined fixed window can severely degrade the model flexibility. In addition, it is difficult to obtain the optimal window settings manually. In this paper, we propose a Deformable Speech Transformer, named DST, for SER task. DST determines the usage of window sizes conditioned on in-put speech via a light-weight decision network. Meanwhile, data-dependent offsets derived from acoustic features are utilized to adjust the positions of the attention windows, allowing DST to adaptively discover and attend to the valuable in-formation embedded in the speech. Extensive experiments on IEMOCAP and MELD demonstrate the superiority of DST.
TL;DR: In this paper , the authors investigated 30 speech representation models grouped into four categories: traditional feature engineering, generative, predictive, and contrastive, to study how these models encode the speech stimuli and align with human brain activity for the Moth Radio Hour fMRI (functional magnetic resonance imaging) dataset.
Abstract: A vast literature on brain encoding has effectively harnessed deep neural network models for accurately predicting brain activations from visual or text stimuli. Unfortunately, there is not much work on brain encoding for speech stimuli. The few existing studies on brain encoding for speech stimuli transcribe speech to text and then leverage text-only models for encoding, thereby ignoring audio signals completely. However, recently several speech representation learning models have revolutionized the field of speech processing. Inspired by the recent progress on deep learning models for speech, we present a first systematic study on understanding human speech processing by probing neural speech models to predict both language and auditory brain region activations. In particular, we investigate 30 speech representation models grouped into four categories: (i) traditional feature engineering, (ii) generative, (iii) predictive, and (iv) contrastive, to study how these models encode the speech stimuli and align with human brain activity for the Moth Radio Hour fMRI (functional magnetic resonance imaging) dataset. We find that both contrastive (Wav2Vec2.0) and predictive models (HuBERT, Data2Vec) are very accurate. Specifically, Data2Vec aligns the best with both language and auditory brain regions among all investigated models. We make our code publicly available 1 .
TL;DR: In this paper , a two-stage system based on DPRNN and UNet was proposed for the speech enhancement task and a Conformer-based system for the sound localization and detection task.
Abstract: The L3DAS23 of ICASSP Signal Processing Grand Challenge encourages research on 3D audio signal processing, such as 3D speech enhancement (SE) and 3D sound localization and detection (SELD). In this paper, we propose a two-stage system based on DPRNN and UNet for the SE task and a Conformer-based system for the SELD task. The proposed SE and SELD systems are evaluated on the L3DAS23 blind test sets. Results show that the proposed methods achieve state-of-the-art performance for 3D SE and SELD.
TL;DR: In this article , the EEG of participants listening to different speakers recounting an emotional life event was analyzed linguistically and modeled the EEG response at word offset using a GLM approach.
Abstract: Spontaneous speech is produced in chunks called Intonation Units (IUs). IUs are defined by a set of prosodic cues and occur in all human languages. Linguistic theory suggests that IUs pace the flow of information and serve as a window onto the dynamic focus of attention in speech processing. IUs provide a promising and hitherto unexplored theoretical framework for studying the neural mechanisms of communication, thanks to their universality and their consistent temporal structure across different grammatical and socio-cultural conditions. In this article, we identify a neural response unique to the boundary defined by the IU. We measured the EEG of participants who listened to different speakers recounting an emotional life event. We analyzed the speech stimuli linguistically, and modeled the EEG response at word offset using a GLM approach. We find that the EEG response to IU-final words differs from the response to IU-nonfinal words when acoustic boundary strength is held constant. To the best of our knowledge, this is the first time this is demonstrated in spontaneous speech under naturalistic listening conditions, and under a theoretical framework that connects the prosodic chunking of speech, on the one hand, with the flow of information during communication, on the other. Finally, we relate our findings to the body of research on rhythmic brain mechanism in speech processing by comparing the topographical distributions of neural speech tracking in model-predicted and empirical EEG. This qualitative comparison suggests that IU-related neural activity contributes to the previously characterized delta-band neural speech tracking.
TL;DR: In this paper , the authors discuss the potential of multivariate analysis as a tool to understand the neural mechanisms underlying predictive processes in speech perception and suggest that using multivariate techniques to measure informational content across the hierarchically organized speech-sensitive brain areas might enable us to specify the mechanisms by which prior knowledge and sensory speech signals are combined.
Abstract: Speech perception is heavily influenced by our expectations about what will be said. In this review, we discuss the potential of multivariate analysis as a tool to understand the neural mechanisms underlying predictive processes in speech perception. First, we discuss the advantages of multivariate approaches and what they have added to the understanding of speech processing from the acoustic-phonetic form of speech, over syllable identity and syntax, to its semantic content. Second, we suggest that using multivariate techniques to measure informational content across the hierarchically organised speech-sensitive brain areas might enable us to specify the mechanisms by which prior knowledge and sensory speech signals are combined. Specifically, this approach might allow us to decode how different priors, e.g. about a speaker's voice or about the topic of the current conversation, are represented at different processing stages and how incoming speech is as a result differently represented.
TL;DR: In this article , a low-latency sequence-to-sequence speech enhancement technique for electrolaryngeal (EL) speech is proposed, which utilizes self-supervised speech representation (SSL) and randomly sampling the predicted and ground-truth features to alleviate prediction errors in variance adaptor.
Abstract: In this paper, we propose a low-latency sequence-to-sequence speech enhancement technique for electrolaryngeal (EL) speech. A low-latency EL speech enhancement technique based on CLDNN was previously proposed to enable laryngectomees to produce relatively naturally sounding speech compared to the original EL speech. However, the naturalness of the enhanced speech is still far from that of the natural speech due to insufficient modeling accuracy of an acoustic feature sequence and speech waveform caused by frame-wise conversion processing. To solve this problem, in this paper, we propose a low-latency sequence-to-sequence EL speech enhancement technique based on CLDNN-FastSpeech2-based VC. Moreover, to improve various factors such as naturalness, intelligibility, and robustness, we also propose the following techniques: 1) utilizing a self-supervised speech representation (SSL) and 2) randomly sampling the predicted and ground-truth features to alleviate prediction errors in variance adaptor. The experimental results demonstrate that the proposed method yields better performance in both objective and subjective evaluations compared to the conventional frame-based EL speech enhancement technique.
TL;DR: In this paper , a differentiable estimator for low-level acoustic descriptors involving frequency related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features is proposed.
Abstract: Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features. Un-like prior work that looks at aggregated acoustic parameters or a few categories of acoustic parameters, our temporal acoustic parameter (TAP) loss enables auxiliary optimization and improvement of many fine-grained speech characteristics in enhancement workflows. We show that adding TAPLoss as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility. We use data from the Deep Noise Suppression 2020 Challenge to demonstrate that both time-domain models and time-frequency domain models can benefit from our method.
TL;DR: In this paper , an unsupervised metric called SpeechLMScore is proposed to evaluate generated speech using a speech language model, which computes the average log-probability of a speech signal by mapping it into discrete tokens and measures the average probability of generating the sequence of tokens.
Abstract: While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech language model. SpeechLMScore computes the average log-probability of a speech signal by mapping it into discrete tokens and measures the average probability of generating the sequence of tokens. Therefore, it does not require human annotation and is a highly scalable framework. Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks including voice conversion, text-to-speech, and speech enhancement.
TL;DR: Wu et al. as mentioned in this paper used self-supervised learning for speaker change detection, overlapped speech detection, and voice activity detection in the AMI corpus, and achieved state-of-the-art performance.
Abstract: Self-supervised learning approaches have lately achieved great success on a broad spectrum of machine learning problems. In the field of speech processing, one of the most successful recent self-supervised models is wav2vec 2.0. In this paper, we explore the effectiveness of this model on three basic speech classification tasks: speaker change detection, overlapped speech detection, and voice activity detection. First, we concentrate on only one task -- speaker change detection -- where our proposed system surpasses the previously reported results on four different corpora, and achieves comparable performance even when trained on out-of-domain data from an artificially designed dataset. Then we expand our approach to tackle all three tasks in a single multitask system with state-of-the-art performance on the AMI corpus. The implementation of the algorithms in this paper is publicly available at https://github.com/mkunes/w2v2_audioFrameClassification.
TL;DR: High-gamma band cortical responses to speech depend on selective attention. Speech tracking in the high-gamma band is robust to the speaker's fundamental frequency, but only for attended speech.
Abstract: Auditory cortical responses to speech obtained by magnetoencephalography (MEG) show robust speech tracking to the speaker's fundamental frequency in the high-gamma band (70-200 Hz), but little is currently known about whether such responses depend on the focus of selective attention. In this study 22 human subjects listened to concurrent, fixed-rate, speech from male and female speakers, and were asked to selectively attend to one speaker at a time, while their neural responses were recorded with MEG. The male speaker's pitch range coincided with the lower range of the high-gamma band, whereas the female speaker's higher pitch range had much less overlap, and only at the upper end of the high-gamma band. Neural responses were analyzed using the temporal response function (TRF) framework. As expected, the responses demonstrate robust speech tracking of the fundamental frequency in the high-gamma band, but only to the male's speech, with a peak latency of ~40 ms. Critically, the response magnitude depends on selective attention: the response to the male speech is significantly greater when male speech is attended than when it is not attended, under acoustically identical conditions. This is a clear demonstration that even very early cortical auditory responses are influenced by top-down, cognitive, neural processing mechanisms.
TL;DR: In this article , a survey of time and frequency analysis techniques used to analyze speech signals for various applications such as automatic recognition of associated text and speaker identity, and coding is presented.