TL;DR: This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.
Abstract: Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units These units are linear when their input is positive and zero otherwise In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data
TL;DR: A neuroimaging study reveals how coupled brain oscillations at different frequencies align with quasi-rhythmic features of continuous speech such as prosody, syllables, and phonemes.
Abstract: Cortical oscillations are likely candidates for segmentation and coding of continuous speech. Here, we monitored continuous speech processing with magnetoencephalography (MEG) to unravel the principles of speech segmentation and coding. We demonstrate that speech entrains the phase of low-frequency (delta, theta) and the amplitude of high-frequency (gamma) oscillations in the auditory cortex. Phase entrainment is stronger in the right and amplitude entrainment is stronger in the left auditory cortex. Furthermore, edges in the speech envelope phase reset auditory cortex oscillations thereby enhancing their entrainment to speech. This mechanism adapts to the changing physical features of the speech envelope and enables efficient, stimulus-specific speech sampling. Finally, we show that within the auditory cortex, coupling between delta, theta, and gamma oscillations increases following speech edges. Importantly, all couplings (i.e., brain-speech and also within the cortex) attenuate for backward-presented speech, suggesting top-down control. We conclude that segmentation and coding of speech relies on a nested hierarchy of entrained cortical oscillations.
TL;DR: Investigating the Parkinson dataset using well-known machine learning tools, sustained vowels are found to carry more PD-discriminative information and representing the samples of a subject with central tendency and dispersion metrics improves generalization of the predictive model.
Abstract: There has been an increased interest in speech pattern analysis applications of Parkinsonism for building predictive telediagnosis and telemonitoring models. For this purpose, we have collected a wide variety of voice samples, including sustained vowels, words, and sentences compiled from a set of speaking exercises for people with Parkinson's disease. There are two main issues in learning from such a dataset that consists of multiple speech recordings per subject: 1) How predictive these various types, e.g., sustained vowels versus words, of voice samples are in Parkinson's disease (PD) diagnosis? 2) How well the central tendency and dispersion metrics serve as representatives of all sample recordings of a subject? In this paper, investigating our Parkinson dataset using well-known machine learning tools, as reported in the literature, sustained vowels are found to carry more PD-discriminative information. We have also found that rather than using each voice recording of each subject as an independent data sample, representing the samples of a subject with central tendency and dispersion metrics improves generalization of the predictive model.
TL;DR: This work proposes to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs.
Abstract: Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.
TL;DR: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech.
Abstract: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.
TL;DR: A common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques is proposed, which will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge.
Abstract: Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-of-the-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results.
TL;DR: The results suggest that, in a complex listening environment, auditory cortex can selectively encode a speech stream in a background insensitive manner, and this stable neural representation of speech provides a plausible basis for background-invariant recognition of speech.
Abstract: Speech recognition is remarkably robust to the listening background, even when the energy of background sounds strongly overlaps with that of speech. How the brain transforms the corrupted acoustic signal into a reliable neural representation suitable for speech recognition, however, remains elusive. Here, we hypothesize that this transformation is performed at the level of auditory cortex through adaptive neural encoding, and we test the hypothesis by recording, using MEG, the neural responses of human subjects listening to a narrated story. Spectrally matched stationary noise, which has maximal acoustic overlap with the speech, is mixed in at various intensity levels. Despite the severe acoustic interference caused by this noise, it is here demonstrated that low-frequency auditory cortical activity is reliably synchronized to the slow temporal modulations of speech, even when the noise is twice as strong as the speech. Such a reliable neural representation is maintained by intensity contrast gain control and by adaptive processing of temporal modulations at different time scales, corresponding to the neural δ and θ bands. Critically, the precision of this neural synchronization predicts how well a listener can recognize speech in noise, indicating that the precision of the auditory cortical representation limits the performance of speech recognition in noise. Together, these results suggest that, in a complex listening environment, auditory cortex can selectively encode a speech stream in a background insensitive manner, and this stable neural representation of speech provides a plausible basis for background-invariant recognition of speech.
TL;DR: Behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion are illustrated.
Abstract: The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion.
TL;DR: In this article, a microprocessor or other application specific integrated circuit provides a mechanism for comparing the relative transit times between a user's voice, a primary speech microphone, and a secondary compliance microphone to determine if the speech microphone is placed in an appropriate proximity to the user's mouth.
Abstract: Apparatus and method that improves speech recognition accuracy, by monitoring the position of a user's headset-mounted speech microphone, and prompting the user to reconfigure the speech microphone's orientation if required. A microprocessor or other application specific integrated circuit provides a mechanism for comparing the relative transit times between a user's voice, a primary speech microphone, and a secondary compliance microphone. The difference in transit times may be used to determine if the speech microphone is placed in an appropriate proximity to the user's mouth. If required, the user is automatically prompted to reposition the speech microphone.
TL;DR: A novel, data-driven approach to voice activity detection based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features clearly outperforming three state-of-the-art reference algorithms under the same conditions.
Abstract: A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye corpora, and adding real long term recordings of diverse noise types. The approach is evaluated on unseen synthetically mixed test data as well as a real-life test set consisting of four full-length Hollywood movies. A frame-wise Equal Error Rate (EER) of 33.2% is obtained for the four movies and an EER of 9.6% is obtained for the synthetic test data at a peak SNR of 0 dB, clearly outperforming three state-of-the-art reference algorithms under the same conditions.
TL;DR: These findings suggest that generalization of foreign-accent adaptation is the result of exposure to systematic variability in accented speech that is similar across talker-independent but accent-dependent learning after training on multiple talkers from multiple language backgrounds.
Abstract: Foreign-accented speech can be difficult to understand but listeners can adapt to novel talkers and accents with appropriate experience. Previous studies have demonstrated talker-independent but accent-dependent learning after training on multiple talkers from a single language background. Here, listeners instead were exposed to talkers from five language backgrounds during training. After training, listeners generalized their learning to novel talkers from language backgrounds both included and not included in the training set. These findings suggest that generalization of foreign-accent adaptation is the result of exposure to systematic variability in accented speech that is similar across talkers from multiple language backgrounds.
TL;DR: In this article, a system may be used to drive an array of loudspeakers to produce a sound field that includes a source component, whose energy is concentrated along a first direction relative to the array, and a masking component that is based on an estimated intensity of the source component in a second direction that is different from the first direction.
Abstract: Arrangements are described that may be used to reduce the intelligibility of speech using masker signals which are obfuscated yet correlated versions of the speech. Other applications of pitch analysis and demodulation are also described. A system may be used to drive an array of loudspeakers to produce a sound field that includes a source component, whose energy is concentrated along a first direction relative to the array, and a masking component that is based on an estimated intensity of the source component in a second direction that is different from the first direction.
TL;DR: In this paper, a wireless system comprises at least one subscriber unit in wireless communication with an infrastructure, and each subscriber unit implements a speech recognition client, and the infrastructure comprises a Speech Recognition Server.
Abstract: A wireless system comprises at least one subscriber unit in wireless communication with an infrastructure. Each subscriber unit implements a speech recognition client, and the infrastructure comprises a speech recognition server. A given subscriber unit takes as input an unencoded speech signal that is subsequently parameterized by the speech recognition client. The parameterized speech is then provided to the speech recognition server that, in turn, performs speech recognition analysis on the parameterized speech. Information signals, based in part upon any recognized utterances identified by the speech recognition analysis, are subsequently provided to the subscriber unit. The information signals may be used to control the subscriber unit itself; to control one or more devices coupled to the subscriber unit, or may be operated upon by the subscriber unit or devices coupled thereto.
TL;DR: This survey wishes to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancement.
Abstract: As speech processing devices like mobile phones, voice controlled devices, and hearing aids have increased in popularity, people expect them to work anywhere and at any time without user intervention However, the presence of acoustical disturbances limits the use of these applications, degrades their performance, or causes the user difficulties in understanding the conversation or appreciating the device A common way to reduce the effects of such disturbances is through the use of single-microphone noise reduction algorithms for speech enhancement The field of single-microphone noise reduction for speech enhancement comprises a history of more than 30 years of research In this survey, we wish to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancementFurthermore, our goal is to provide a concise description of a state-of-the-art speech enhancement system, and demonstrate the relative importance of the various building blocks of such a system This allows the non-expert DSP practitioner to judge the relevance of each building block and to implement a close-to-optimal enhancement system for the particular application at hand Table of Contents: Introduction / Single Channel Speech Enhancement: General Principles / DFT-Based Speech Enhancement Methods: Signal Model and Notation / Speech DFT Estimators / Speech Presence Probability Estimation / Noise PSD Estimation / Speech PSD Estimation / Performance Evaluation Methods / Simulation Experiments with Single-Channel Enhancement Systems / Future Directions
TL;DR: It is shown, in the context of a dual-pathway model, that internal simulation shapes perception in a context-dependent manner.
Abstract: The computational role of efference copies is widely appreciated in action and perception research, but their properties for speech processing remain murky. We tested the functional specificity of auditory efference copies using magnetoencephalography recordings in an unconventional pairing: We used a classical cognitive manipulation mental imagery-to elicit internal simulation and estimation with a well-established experimental paradigm one shot repetition-to assess neuronal specificity. Participants performed tasks that differentially implicated internal prediction of sensory consequences overt speaking, imagined speaking, and imagined hearing and their modulatory effects on the perception of an auditory syllable probe were assessed. Remarkably, the neural responses to overt syllable probes vary systematically, both in terms of directionality suppression, enhancement and temporal dynamics early, late, as a function of the preceding covert mental imagery adaptor. We show, in the context of a dual-pathway model, that internal simulation shapes perception in a context-dependent manner.
TL;DR: Experimental results show the effectiveness of the proposed FLAF-based architectures in nonlinear AEC scenarios, thus resulting an important solution to the modeling of nonlinear acoustic channels.
Abstract: This paper introduces a new class of nonlinear adaptive filters, whose structure is based on Hammerstein model. Such filters derive from the functional link adaptive filter (FLAF) model, defined by a nonlinear input expansion, which enhances the representation of the input signal through a projection in a higher dimensional space, and a subsequent adaptive filtering. In particular, two robust FLAF-based architectures are proposed and designed ad hoc to tackle nonlinearities in acoustic echo cancellation (AEC). The simplest architecture is the split FLAF, which separates the adaptation of linear and nonlinear elements using two different adaptive filters in parallel. In this way, the architecture can accomplish distinctly at best the linear and the nonlinear modeling. Moreover, in order to give robustness against different degrees of nonlinearity, a collaborative FLAF is proposed based on the adaptive combination of filters. Such architecture allows to achieve the best performance regardless of the nonlinearity degree in the echo path. Experimental results show the effectiveness of the proposed FLAF-based architectures in nonlinear AEC scenarios, thus resulting an important solution to the modeling of nonlinear acoustic channels.
TL;DR: This book discusses how word recognition may Evolve from Infant Speech Perception Capacities, and issues of Process and Representation in Lexical Access.
Abstract: Overview, Shillcock, Altmann. Introduction to the Chapters by Werker and Jusczyk, Clifton. How Word Recognition may Evolve from Infant Speech Perception Capacities, Jusczyk. Developmental Changes in Cross-language Speech Perception: Implications for Cognitive Models of Speech Processing, Werker. The Time Course of Prelexical Processing: The Syllabic Hypothesis Revisited, Dupoux. Language-specific Processing: Does the Evidence Converge? Cutler. Representation and Access of Derived Words in English, Tyler, Waksler, Marslen-Wilson. What Determines Morphological Relatedness in the Lexicon? Comments on the Chapter by Tyler, Waksler, and Marslen-Wilson, Burani. Modularity and the Processing of Closed-class Words, Shillcock, Gurman Bard. Issues of Process and Representation in Lexical Access, Marslen-Wilson. Bottom-up Connectionist Models of 'Interaction', Norris. Competitor Effects During Lexical Access: Chasing Zipf's Tail, Bard, Shillcock. Connections, competitions, and Cohorts: Comments on the Chapters by Marslen-Wilson, Norris, and Bard & Shillcock, Tabossi. More Oncombinatory Lexical Information: Thematic Structure in Parsing and Interpretation, Tanenhaus et al. Reconsidering Reactivation, Nicol.
TL;DR: Different methods of separating voiced and unvoiced segments of a speech signals are presented, based on short time energy calculation, short time magnitude calculation, and zero crossing rate calculation and on the basis of autocorrelation of different segments of speech signals to show that the voiced segment of speech remains periodic after applying autoc orrelation function.
Abstract: This paper presents different methods of separating voiced and unvoiced segments of a speech signals. These methods are based on short time energy calculation, short time magnitude calculation, and zero crossing rate calculation and on the basis of autocorrelation of different segments of speech signals. From theoretical studies, it has been observed that energy and magnitude for voiced segments is high, whereas ZCR rate is low for voiced signals. Autocorrelation function is used here to show that the voiced segment of speech remains periodic after applying autocorrelation function, while unvoiced signals lose their periodicity. Experimental results have been presented in this paper to verify theoretical studies.
TL;DR: From the synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features, and the best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate.
Abstract: Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.
TL;DR: The data provide the first direct electrophysiological evidence that the envelope of speech is robustly tracked in non-primary auditory cortex (belt areas in particular), and suggest that the considered higher-order regions (STG and Broca's region) partake in a more abstract linguistic analysis.
Abstract: Humans are highly adept at processing speech. Recently, it has been shown that slow temporal information in speech (i.e., the envelope of speech) is critical for speech comprehension. Furthermore, it has been found that evoked electric potentials in human cortex are correlated with the speech envelope. However, it has been unclear whether this essential linguistic feature is encoded differentially in specific regions, or whether it is represented throughout the auditory system. To answer this question, we recorded neural data with high temporal resolution directly from the cortex while human subjects listened to a spoken story. We found that the gamma activity in human auditory cortex robustly tracks the speech envelope. The effect is so marked that it is observed during a single presentation of the spoken story to each subject. The effect is stronger in regions situated relatively early in the auditory pathway (belt areas) compared to other regions involved in speech processing, including the superior temporal gyrus (STG) and the posterior inferior frontal gyrus (Broca's region). To further distinguish whether speech envelope is encoded in the auditory system as a phonological (speech-related), or instead as a more general acoustic feature, we also probed the auditory system with a melodic stimulus. We found that belt areas track melody envelope weakly, and as the only region considered. Together, our data provide the first direct electrophysiological evidence that the envelope of speech is robustly tracked in non-primary auditory cortex (belt areas in particular), and suggest that the considered higher-order regions (STG and Broca's region) partake in a more abstract linguistic analysis.
TL;DR: The link demonstrated between visual activity and auditory speech perception indicates that visuoauditory synergy is crucial for cross-modal plasticity and fostering speech-comprehension recovery in adult cochlear-implanted deaf patients.
Abstract: Modern cochlear implantation technologies allow deaf patients to understand auditory speech; however, the implants deliver only a coarse auditory input and patients must use long-term adaptive processes to achieve coherent percepts. In adults with post-lingual deafness, the high progress of speech recovery is observed during the first year after cochlear implantation, but there is a large range of variability in the level of cochlear implant outcomes and the temporal evolution of recovery. It has been proposed that when profoundly deaf subjects receive a cochlear implant, the visual cross-modal reorganization of the brain is deleterious for auditory speech recovery. We tested this hypothesis in post-lingually deaf adults by analysing whether brain activity shortly after implantation correlated with the level of auditory recovery 6 months later. Based on brain activity induced by a speech-processing task, we found strong positive correlations in areas outside the auditory cortex. The highest positive correlations were found in the occipital cortex involved in visual processing, as well as in the posterior-temporal cortex known for audio-visual integration. The other area, which positively correlated with auditory speech recovery, was localized in the left inferior frontal area known for speech processing. Our results demonstrate that the visual modality's functional level is related to the proficiency level of auditory recovery. Based on the positive correlation of visual activity with auditory speech recovery, we suggest that visual modality may facilitate the perception of the word's auditory counterpart in communicative situations. The link demonstrated between visual activity and auditory speech perception indicates that visuoauditory synergy is crucial for cross-modal plasticity and fostering speech-comprehension recovery in adult cochlear-implanted deaf patients.
TL;DR: It is demonstrated that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates on YouTube videos compared to a conventional GMM based system.
Abstract: Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize well to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution is to augment the conventional cepstral features with additional, hand-engineered features (e.g., spectral flux, spectral centroid, multiband spectral entropies) which are robust to changes in environment and recording condition. An alternative approach, explored here, is to learn robust features during the course of training using an appropriate architecture such as deep neural networks (DNNs). In this paper we demonstrate that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates (19.6%) on YouTube videos compared to a conventional GMM based system (40%).
TL;DR: The advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages, are described.
Abstract: This paper describes the advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages. GlobalPhone was designed to be uniform across languages with respect to the amount of data, speech quality, the collection scenario, the transcription and phone set conventions. With more than 400 hours of transcribed audio data from more than 2000 native speakers GlobalPhone supplies an excellent basis for research in the areas of multilingual speech recognition, rapid deployment of speech processing systems to yet unsupported languages, language identification tasks, speaker recognition in multiple languages, multilingual speech synthesis, as well as monolingual speech recognition in a large variety of languages.
TL;DR: In this article, a system and method for parallel speech recognition processing of multiple audio signals produced by multiple microphones in a handheld portable electronic device is described, where a primary processor transitions to a power-saving mode while an auxiliary processor remains active.
Abstract: A system and method for parallel speech recognition processing of multiple audio signals produced by multiple microphones in a handheld portable electronic device. In one embodiment, a primary processor transitions to a power-saving mode while an auxiliary processor remains active. The auxiliary processor then monitors the speech of a user of the device to detect a wake-up command by speech recognition processing the audio signals in parallel. When the auxiliary processor detects the command it then signals the primary processor to transition to active mode. The auxiliary processor may also identify to the primary processor which microphone resulted in the command being recognized with the highest confidence. Other embodiments are also described.
TL;DR: Some of the most used methods for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features are presented.
Abstract: The time domain waveform of a speech signal carries all of the auditory information. From the phonological point of view, it little can be said on the basis of the waveform itself. However, past research in mathematics, acoustics, and speech technology have provided many methods for converting data that can be considered as information if interpreted correctly. In order to find some statistically relevant information from incoming data, it is important to have mechanisms for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features. These features should describe each segment in such a characteristic way that other similar segments can be grouped together by comparing their features. There are enormous interesting and exceptional ways to describe the speech signal in terms of parameters. Though, they all have their strengths and weaknesses, we have presented some of the most used methods with their importance.
TL;DR: The results indicate that older adults increasingly recruit cognitive control networks, even under optimal listening conditions, at the expense of these systems’ dynamic range.
Abstract: Speech comprehension abilities decline with age and with age-related hearing loss, but it is unclear how this decline expresses in terms of central neural mechanisms. The current study examined neural speech processing in a group of older adults (aged 56– 77, n = 16, with varying degrees of sensorineural hearing loss), and compared them to a cohort of young adults (aged 22–31, n = 30, self-reported normal hearing). In a functional MRI experiment, listeners heard and repeated back degraded sentences (4-band vocoded, where the temporal envelope of the acoustic signal is preserved, while the spectral information is substantially degraded). Behaviorally, older adults adapted to degraded speech at the same rate as young listeners, although their overall comprehension of degraded speech was lower. Neurally, both older and young adults relied on the left anterior insula for degraded more than clear speech perception. However, anterior insula engagement in older adults was dependent on hearing acuity. Young adults additionally employed the anterior cingulate cortex (ACC). Interestingly, this age group × degradation interaction was driven by a reduced dynamic range in older adults who displayed elevated levels of ACC activity for both degraded and clear speech, consistent with a persistent upregulation in cognitive control irrespective of task difficulty. For correct speech comprehension, older adults relied on the middle frontal gyrus in addition to a core speech comprehension network recruited by younger adults suggestive of a compensatory mechanism. Taken together, the results indicate that older adults increasingly recruit cognitive control networks, even under optimal listening conditions, at the expense of these systems’ dynamic range.
TL;DR: This article presents a methodical approach by designing adapted time-frequency (T-F) kernels for diagnosis applications with illustrations on three selected medical applications using the electroencephalogram (EEG), heart rate variability (HRV), and pathological speech signals.
Abstract: This article presents a methodical approach for improving quadratic time-frequency distribution (QTFD) methods by designing adapted time-frequency (T-F) kernels for diagnosis applications with illustrations on three selected medical applications using the electroencephalogram (EEG), heart rate variability (HRV), and pathological speech signals. Manual and visual inspection of such nonstationary multicomponent signals is laborious especially for long recordings, requiring skilled interpreters with possible subjective judgments and errors. Automated assessment is therefore preferred for objective diagnosis by using T-F distributions (TFDs) to extract more information. This requires designing advanced high-resolution TFDs for automating classification and interpretation. As QTFD methods are general and their coverage is very broad, this article concentrates on methodologies using only a few selected medical problems studied by the authors.
TL;DR: This introduction to crowdsourcing as a means of rapidly processing speech data offers speech researchers the hope that they can spend much less time dealing with the data gathering/annotation bottleneck, leaving them to focus on the scientific issues.
Abstract: Provides an insightful and practical introduction to crowdsourcing as a means of rapidly processing speech dataIntended for those who want to get started in the domain and learn how to set up a task, what interfaces are available, how to assess the work, etc. as well as for those who already have used crowdsourcing and want to create better tasks and obtain better assessments of the work of the crowd. It will include screenshots to show examples of good and poor interfaces; examples of case studies in speech processing tasks, going through the task creation process, reviewing options in the interface, in the choice of medium (MTurk or other) and explaining choices, etc.Provides an insightful and practical introduction to crowdsourcing as a means of rapidly processing speech data.Addresses important aspects of this new technique that should be mastered before attempting a crowdsourcing application.Offers speech researchers the hope that they can spend much less time dealing with the data gathering/annotation bottleneck, leaving them to focus on the scientific issues. Readers will directly benefit from the books successful examples of how crowd- sourcing was implemented for speech processing, discussions of interface and processing choices that worked and choices that didnt, and guidelines on how to play and record speech over the internet, how to design tasks, and how to assess workers.Essential reading for researchers and practitioners in speech research groups involved in speech processing
TL;DR: A significant improvement in speech perception in noise with partial tripolar stimulation is shown and all subjects benefited from the current focused speech processing strategy.
TL;DR: It is found that a network of 8 regions, including the anterior superior temporal gyrus (STG) just anterior to Heschl's gyrus and the right midposterior STG, respond more strongly to speech perceived as song than to mere speech.
Abstract: It is normally obvious to listeners whether a human vocalization is intended to be heard as speech or song. However, the 2 signals are remarkably similar acoustically. A naturally occurring boundary case between speech and song has been discovered where a spoken phrase sounds as if it were sung when isolated and repeated. In the present study, an extensive search of audiobooks uncovered additional similar examples, which were contrasted with samples from the same corpus that do not sound like song, despite containing clear prosodic pitch contours. Using functional magnetic resonance imaging, we show that hearing these 2 closely matched stimuli is not associated with differences in response of early auditory areas. Rather, we find that a network of 8 regions, including the anterior superior temporal gyrus (STG) just anterior to Heschl’s gyrus and the right midposterior STG, respond more strongly to speech perceived as song than to mere speech. This network overlaps a number of areas previously associated with pitch extraction and song production, confirming that phrases originally intended to be heard as speech can, under certain circumstances, be heard as song. Our results suggest that song processing compared with speech processing makes increased demands on pitch processing and auditory--motor integration.