Top 1021 papers published in the topic of Speech processing in 2013

Showing papers on "Speech processing published in 2013"

Proceedings Article•10.1109/ICASSP.2013.6638312•

On rectified linear units for speech processing

[...]

Matthew D. Zeiler¹, Marc'Aurelio Ranzato², Rajat Monga², Mark Z. Mao², Ke Yang², Quoc V. Le², Patrick Nguyen², Andrew W. Senior², Vincent Vanhoucke², Jeffrey Dean², Geoffrey E. Hinton³ - Show less +7 more•Institutions (3)

New York University¹, Google², University of Toronto³

26 May 2013

TL;DR: This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.

...read moreread less

Abstract: Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units These units are linear when their input is positive and zero otherwise In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data

...read moreread less

664 citations

Journal Article•10.1371/JOURNAL.PBIO.1001752•

Speech rhythms and multiplexed oscillatory sensory coding in the human brain.

[...]

Joachim Gross¹, Nienke Hoogenboom², Gregor Thut¹, Philippe G. Schyns¹, Stefano Panzeri³, Pascal Belin¹, Simon Garrod¹ - Show less +3 more•Institutions (3)

University of Glasgow¹, University of Düsseldorf², Istituto Italiano di Tecnologia³

31 Dec 2013-PLOS Biology

TL;DR: A neuroimaging study reveals how coupled brain oscillations at different frequencies align with quasi-rhythmic features of continuous speech such as prosody, syllables, and phonemes.

...read moreread less

Abstract: Cortical oscillations are likely candidates for segmentation and coding of continuous speech. Here, we monitored continuous speech processing with magnetoencephalography (MEG) to unravel the principles of speech segmentation and coding. We demonstrate that speech entrains the phase of low-frequency (delta, theta) and the amplitude of high-frequency (gamma) oscillations in the auditory cortex. Phase entrainment is stronger in the right and amplitude entrainment is stronger in the left auditory cortex. Furthermore, edges in the speech envelope phase reset auditory cortex oscillations thereby enhancing their entrainment to speech. This mechanism adapts to the changing physical features of the speech envelope and enables efficient, stimulus-specific speech sampling. Finally, we show that within the auditory cortex, coupling between delta, theta, and gamma oscillations increases following speech edges. Importantly, all couplings (i.e., brain-speech and also within the cortex) attenuate for backward-presented speech, suggesting top-down control. We conclude that segmentation and coding of speech relies on a nested hierarchy of entrained cortical oscillations.

...read moreread less

656 citations

Journal Article•10.1109/JBHI.2013.2245674•

Collection and Analysis of a Parkinson Speech Dataset With Multiple Types of Sound Recordings

[...]

Betul Erdogdu Sakar¹, E. M. Isenkul², Cemal Okan Sakar¹, Ahmet Sertbas², Fikret Gürgen³, Sakir Delil², Hulya Apaydin², Olcay Kursun² - Show less +4 more•Institutions (3)

Bahçeşehir University¹, Istanbul University², Boğaziçi University³

06 Feb 2013-IEEE Journal of Biomedical and Health Informatics

TL;DR: Investigating the Parkinson dataset using well-known machine learning tools, sustained vowels are found to carry more PD-discriminative information and representing the samples of a subject with central tendency and dispersion metrics improves generalization of the predictive model.

...read moreread less

Abstract: There has been an increased interest in speech pattern analysis applications of Parkinsonism for building predictive telediagnosis and telemonitoring models. For this purpose, we have collected a wide variety of voice samples, including sustained vowels, words, and sentences compiled from a set of speaking exercises for people with Parkinson's disease. There are two main issues in learning from such a dataset that consists of multiple speech recordings per subject: 1) How predictive these various types, e.g., sustained vowels versus words, of voice samples are in Parkinson's disease (PD) diagnosis? 2) How well the central tendency and dispersion metrics serve as representatives of all sample recordings of a subject? In this paper, investigating our Parkinson dataset using well-known machine learning tools, as reported in the literature, sustained vowels are found to carry more PD-discriminative information. We have also found that rather than using each voice recording of each subject as an independent data sample, representing the samples of a subject with central tendency and dispersion metrics improves generalization of the predictive model.

...read moreread less

645 citations

Journal Article•10.1109/TASL.2013.2250961•

Towards Scaling Up Classification-Based Speech Separation

[...]

Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Jul 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This work proposes to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs.

...read moreread less

Abstract: Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.

...read moreread less

538 citations

Journal Article•10.1109/JPROC.2013.2251852•

Speech Synthesis Based on Hidden Markov Models

[...]

Keiichi Tokuda¹, Yoshihiko Nankaku¹, Tomoki Toda², Heiga Zen¹, Junichi Yamagishi³, Keiichiro Oura¹ - Show less +2 more•Institutions (3)

Nagoya Institute of Technology¹, Nara Institute of Science and Technology², University of Edinburgh³

9 Apr 2013

TL;DR: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech.

...read moreread less

Abstract: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.

...read moreread less

512 citations

Proceedings Article•10.1109/WASPAA.2013.6701894•

The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech

[...]

Keisuke Kinoshita¹, Delcroix Marc¹, Takuya Yoshioka¹, Tomohiro Nakatani¹, Armin Sehr², Walter Kellermann³, Roland Maas³ - Show less +3 more•Institutions (3)

University of Paderborn¹, Beuth University of Applied Sciences Berlin², University of Erlangen-Nuremberg³

1 Oct 2013

TL;DR: A common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques is proposed, which will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge.

...read moreread less

Abstract: Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-of-the-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results.

...read moreread less

474 citations

Journal Article•10.1523/JNEUROSCI.5297-12.2013•

Adaptive Temporal Encoding Leads to a Background-Insensitive Cortical Representation of Speech

[...]

Nai Ding¹, Jonathan Z. Simon¹•Institutions (1)

University of Maryland, College Park¹

27 Mar 2013-The Journal of Neuroscience

TL;DR: The results suggest that, in a complex listening environment, auditory cortex can selectively encode a speech stream in a background insensitive manner, and this stable neural representation of speech provides a plausible basis for background-invariant recognition of speech.

...read moreread less

Abstract: Speech recognition is remarkably robust to the listening background, even when the energy of background sounds strongly overlaps with that of speech. How the brain transforms the corrupted acoustic signal into a reliable neural representation suitable for speech recognition, however, remains elusive. Here, we hypothesize that this transformation is performed at the level of auditory cortex through adaptive neural encoding, and we test the hypothesis by recording, using MEG, the neural responses of human subjects listening to a narrated story. Spectrally matched stationary noise, which has maximal acoustic overlap with the speech, is mixed in at various intensity levels. Despite the severe acoustic interference caused by this noise, it is here demonstrated that low-frequency auditory cortical activity is reliably synchronized to the slow temporal modulations of speech, even when the noise is twice as strong as the speech. Such a reliable neural representation is maintained by intensity contrast gain control and by adaptive processing of temporal modulations at different time scales, corresponding to the neural δ and θ bands. Critically, the precision of this neural synchronization predicts how well a listener can recognize speech in noise, indicating that the precision of the auditory cortical representation limits the performance of speech recognition in noise. Together, these results suggest that, in a complex listening environment, auditory cortex can selectively encode a speech stream in a background insensitive manner, and this stable neural representation of speech provides a plausible basis for background-invariant recognition of speech.

...read moreread less

381 citations

Journal Article•10.1109/JPROC.2012.2236291•

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

[...]

Shrikanth S. Narayanan¹, Panayiotis G. Georgiou¹•Institutions (1)

University of Southern California¹

7 Feb 2013

TL;DR: Behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion are illustrated.

...read moreread less

Abstract: The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion.

...read moreread less

324 citations

Patent•

System and method for improving speech recognition accuracy in a work environment

[...]

David R. DiGregorio

14 Mar 2013

TL;DR: In this article, a microprocessor or other application specific integrated circuit provides a mechanism for comparing the relative transit times between a user's voice, a primary speech microphone, and a secondary compliance microphone to determine if the speech microphone is placed in an appropriate proximity to the user's mouth.

...read moreread less

Abstract: Apparatus and method that improves speech recognition accuracy, by monitoring the position of a user's headset-mounted speech microphone, and prompting the user to reconfigure the speech microphone's orientation if required. A microprocessor or other application specific integrated circuit provides a mechanism for comparing the relative transit times between a user's voice, a primary speech microphone, and a secondary compliance microphone. The difference in transit times may be used to determine if the speech microphone is placed in an appropriate proximity to the user's mouth. If required, the user is automatically prompted to reposition the speech microphone.

...read moreread less

308 citations

Proceedings Article•10.1109/ICASSP.2013.6637694•

Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies

[...]

Florian Eyben, Felix Weninger, Stefano Squartini, Björn Schuller

26 May 2013

TL;DR: A novel, data-driven approach to voice activity detection based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features clearly outperforming three state-of-the-art reference algorithms under the same conditions.

...read moreread less

Abstract: A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye corpora, and adding real long term recordings of diverse noise types. The approach is evaluated on unseen synthetically mixed test data as well as a real-life test set consisting of four full-length Hollywood movies. A frame-wise Equal Error Rate (EER) of 33.2% is obtained for the four movies and an EER of 9.6% is obtained for the synthetic test data at a peak SNR of 0 dB, clearly outperforming three state-of-the-art reference algorithms under the same conditions.

...read moreread less

274 citations

Journal Article•10.1121/1.4789864•

Accent-independent adaptation to foreign accented speech

[...]

Melissa M. Baese-Berk¹, Ann R. Bradlow, Beverly A. Wright•Institutions (1)

Michigan State University¹

04 Feb 2013-Journal of the Acoustical Society of America

TL;DR: These findings suggest that generalization of foreign-accent adaptation is the result of exposure to systematic variability in accented speech that is similar across talker-independent but accent-dependent learning after training on multiple talkers from multiple language backgrounds.

...read moreread less

Abstract: Foreign-accented speech can be difficult to understand but listeners can adapt to novel talkers and accents with appropriate experience. Previous studies have demonstrated talker-independent but accent-dependent learning after training on multiple talkers from a single language background. Here, listeners instead were exposed to talkers from five language backgrounds during training. After training, listeners generalized their learning to novel talkers from language backgrounds both included and not included in the training set. These findings suggest that generalization of foreign-accent adaptation is the result of exposure to systematic variability in accented speech that is similar across talkers from multiple language backgrounds.

...read moreread less

Patent•

Systems, methods, apparatus, and computer-readable media for generating obfuscated speech signal

[...]

Dipanjan Sen¹•Institutions (1)

Qualcomm¹

28 Feb 2013

TL;DR: In this article, a system may be used to drive an array of loudspeakers to produce a sound field that includes a source component, whose energy is concentrated along a first direction relative to the array, and a masking component that is based on an estimated intensity of the source component in a second direction that is different from the first direction.

...read moreread less

Abstract: Arrangements are described that may be used to reduce the intelligibility of speech using masker signals which are obfuscated yet correlated versions of the speech. Other applications of pitch analysis and demodulation are also described. A system may be used to drive an array of loudspeakers to produce a sound field that includes a source component, whose energy is concentrated along a first direction relative to the array, and a masking component that is based on an estimated intensity of the source component in a second direction that is different from the first direction.

...read moreread less

Patent•10.1121/1.2040293•

Method and apparatus for the provision of information signals based upon speech recognition

[...]

Ira A. Gerson¹•Institutions (1)

BlackBerry Limited¹

10 May 2013-Journal of the Acoustical Society of America

TL;DR: In this paper, a wireless system comprises at least one subscriber unit in wireless communication with an infrastructure, and each subscriber unit implements a speech recognition client, and the infrastructure comprises a Speech Recognition Server.

...read moreread less

Abstract: A wireless system comprises at least one subscriber unit in wireless communication with an infrastructure. Each subscriber unit implements a speech recognition client, and the infrastructure comprises a speech recognition server. A given subscriber unit takes as input an unencoded speech signal that is subsequently parameterized by the speech recognition client. The parameterized speech is then provided to the speech recognition server that, in turn, performs speech recognition analysis on the parameterized speech. Information signals, based in part upon any recognized utterances identified by the speech recognition analysis, are subsequently provided to the subscriber unit. The information signals may be used to control the subscriber unit itself; to control one or more devices coupled to the subscriber unit, or may be operated upon by the subscriber unit or devices coupled thereto.

...read moreread less

Book•

DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement: A Survey of the State of the Art

[...]

Richard C. Hendriks¹, Timo Gerkmann², Jesper Jensen³•Institutions (3)

Delft University of Technology¹, University of Oldenburg², Aalborg University³

19 Feb 2013

TL;DR: This survey wishes to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancement.

...read moreread less

Abstract: As speech processing devices like mobile phones, voice controlled devices, and hearing aids have increased in popularity, people expect them to work anywhere and at any time without user intervention However, the presence of acoustical disturbances limits the use of these applications, degrades their performance, or causes the user difficulties in understanding the conversation or appreciating the device A common way to reduce the effects of such disturbances is through the use of single-microphone noise reduction algorithms for speech enhancement The field of single-microphone noise reduction for speech enhancement comprises a history of more than 30 years of research In this survey, we wish to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancementFurthermore, our goal is to provide a concise description of a state-of-the-art speech enhancement system, and demonstrate the relative importance of the various building blocks of such a system This allows the non-expert DSP practitioner to judge the relevance of each building block and to implement a close-to-optimal enhancement system for the particular application at hand Table of Contents: Introduction / Single Channel Speech Enhancement: General Principles / DFT-Based Speech Enhancement Methods: Signal Model and Notation / Speech DFT Estimators / Speech Presence Probability Estimation / Noise PSD Estimation / Speech PSD Estimation / Performance Evaluation Methods / Simulation Experiments with Single-Channel Enhancement Systems / Future Directions

...read moreread less

Journal Article•10.1162/JOCN_A_00381•

The effect of imagination on stimulation: The functional specificity of efference copies in speech processing

[...]

Xing Tian¹, David Poeppel¹•Institutions (1)

New York University¹

01 Jul 2013-Journal of Cognitive Neuroscience

TL;DR: It is shown, in the context of a dual-pathway model, that internal simulation shapes perception in a context-dependent manner.

...read moreread less

Abstract: The computational role of efference copies is widely appreciated in action and perception research, but their properties for speech processing remain murky. We tested the functional specificity of auditory efference copies using magnetoencephalography recordings in an unconventional pairing: We used a classical cognitive manipulation mental imagery-to elicit internal simulation and estimation with a well-established experimental paradigm one shot repetition-to assess neuronal specificity. Participants performed tasks that differentially implicated internal prediction of sensory consequences overt speaking, imagined speaking, and imagined hearing and their modulatory effects on the perception of an auditory syllable probe were assessed. Remarkably, the neural responses to overt syllable probes vary systematically, both in terms of directionality suppression, enhancement and temporal dynamics early, late, as a function of the preceding covert mental imagery adaptor. We show, in the context of a dual-pathway model, that internal simulation shapes perception in a context-dependent manner.

...read moreread less

Journal Article•10.1109/TASL.2013.2255276•

Functional Link Adaptive Filters for Nonlinear Acoustic Echo Cancellation

[...]

Danilo Comminiello¹, Michele Scarpiniti¹, Luis A. Azpicueta-Ruiz², Jeronimo Arenas-Garcia², Aurelio Uncini¹ - Show less +1 more•Institutions (2)

Sapienza University of Rome¹, Carlos III Health Institute²

01 Jul 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Experimental results show the effectiveness of the proposed FLAF-based architectures in nonlinear AEC scenarios, thus resulting an important solution to the modeling of nonlinear acoustic channels.

...read moreread less

Abstract: This paper introduces a new class of nonlinear adaptive filters, whose structure is based on Hammerstein model. Such filters derive from the functional link adaptive filter (FLAF) model, defined by a nonlinear input expansion, which enhances the representation of the input signal through a projection in a higher dimensional space, and a subsequent adaptive filtering. In particular, two robust FLAF-based architectures are proposed and designed ad hoc to tackle nonlinearities in acoustic echo cancellation (AEC). The simplest architecture is the split FLAF, which separates the adaptation of linear and nonlinear elements using two different adaptive filters in parallel. In this way, the architecture can accomplish distinctly at best the linear and the nonlinear modeling. Moreover, in order to give robustness against different degrees of nonlinearity, a collaborative FLAF is proposed based on the adaptive combination of filters. Such architecture allows to achieve the best performance regardless of the nonlinearity degree in the echo path. Experimental results show the effectiveness of the proposed FLAF-based architectures in nonlinear AEC scenarios, thus resulting an important solution to the modeling of nonlinear acoustic channels.

...read moreread less

Book•10.4324/9780203775745•

Cognitive models of speech processing : the Second Sperlonga Meeting

[...]

Gerry T. M. Altmann, Richard Shillcock

24 May 2013

TL;DR: This book discusses how word recognition may Evolve from Infant Speech Perception Capacities, and issues of Process and Representation in Lexical Access.

...read moreread less

Abstract: Overview, Shillcock, Altmann. Introduction to the Chapters by Werker and Jusczyk, Clifton. How Word Recognition may Evolve from Infant Speech Perception Capacities, Jusczyk. Developmental Changes in Cross-language Speech Perception: Implications for Cognitive Models of Speech Processing, Werker. The Time Course of Prelexical Processing: The Syllabic Hypothesis Revisited, Dupoux. Language-specific Processing: Does the Evidence Converge? Cutler. Representation and Access of Derived Words in English, Tyler, Waksler, Marslen-Wilson. What Determines Morphological Relatedness in the Lexicon? Comments on the Chapter by Tyler, Waksler, and Marslen-Wilson, Burani. Modularity and the Processing of Closed-class Words, Shillcock, Gurman Bard. Issues of Process and Representation in Lexical Access, Marslen-Wilson. Bottom-up Connectionist Models of 'Interaction', Norris. Competitor Effects During Lexical Access: Chasing Zipf's Tail, Bard, Shillcock. Connections, competitions, and Cohorts: Comments on the Chapters by Marslen-Wilson, Norris, and Bard & Shillcock, Tabossi. More Oncombinatory Lexical Information: Thematic Structure in Parsing and Interpretation, Tanenhaus et al. Reconsidering Reactivation, Nicol.

...read moreread less

Proceedings Article•10.1109/TAEECE.2013.6557272•

Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals

[...]

Madiha Jalil¹, Faran Awais Butt¹, A Malik¹•Institutions (1)

University of Engineering and Technology, Lahore¹

9 May 2013

TL;DR: Different methods of separating voiced and unvoiced segments of a speech signals are presented, based on short time energy calculation, short time magnitude calculation, and zero crossing rate calculation and on the basis of autocorrelation of different segments of speech signals to show that the voiced segment of speech remains periodic after applying autoc orrelation function.

...read moreread less

Abstract: This paper presents different methods of separating voiced and unvoiced segments of a speech signals. These methods are based on short time energy calculation, short time magnitude calculation, and zero crossing rate calculation and on the basis of autocorrelation of different segments of speech signals. From theoretical studies, it has been observed that energy and magnitude for voiced segments is high, whereas ZCR rate is low for voiced signals. Autocorrelation function is used here to show that the voiced segment of speech remains periodic after applying autocorrelation function, while unvoiced signals lose their periodicity. Experimental results have been presented in this paper to verify theoretical studies.

...read moreread less

Proceedings Article•10.1109/ICASSP.2013.6639067•

Synthetic speech detection using temporal modulation feature

[...]

Zhizheng Wu¹, Xiong Xiao¹, Eng Siong Chng¹, Haizhou Li¹•Institutions (1)

Nanyang Technological University¹

26 May 2013

TL;DR: From the synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features, and the best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate.

...read moreread less

Abstract: Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.

...read moreread less

Journal Article•10.1371/JOURNAL.PONE.0053398•

The tracking of speech envelope in the human cortex.

[...]

Jan Kubanek¹, Peter Brunner², Peter Brunner³, Aysegul Gunduz³, Aysegul Gunduz², David Poeppel⁴, Gerwin Schalk², Gerwin Schalk³ - Show less +4 more•Institutions (4)

Washington University in St. Louis¹, New York State Department of Health², Albany Medical College³, New York University⁴

10 Jan 2013-PLOS ONE

TL;DR: The data provide the first direct electrophysiological evidence that the envelope of speech is robustly tracked in non-primary auditory cortex (belt areas in particular), and suggest that the considered higher-order regions (STG and Broca's region) partake in a more abstract linguistic analysis.

...read moreread less

Abstract: Humans are highly adept at processing speech. Recently, it has been shown that slow temporal information in speech (i.e., the envelope of speech) is critical for speech comprehension. Furthermore, it has been found that evoked electric potentials in human cortex are correlated with the speech envelope. However, it has been unclear whether this essential linguistic feature is encoded differentially in specific regions, or whether it is represented throughout the auditory system. To answer this question, we recorded neural data with high temporal resolution directly from the cortex while human subjects listened to a spoken story. We found that the gamma activity in human auditory cortex robustly tracks the speech envelope. The effect is so marked that it is observed during a single presentation of the spoken story to each subject. The effect is stronger in regions situated relatively early in the auditory pathway (belt areas) compared to other regions involved in speech processing, including the superior temporal gyrus (STG) and the posterior inferior frontal gyrus (Broca's region). To further distinguish whether speech envelope is encoded in the auditory system as a phonological (speech-related), or instead as a more general acoustic feature, we also probed the auditory system with a melodic stimulus. We found that belt areas track melody envelope weakly, and as the only region considered. Together, our data provide the first direct electrophysiological evidence that the envelope of speech is robustly tracked in non-primary auditory cortex (belt areas in particular), and suggest that the considered higher-order regions (STG and Broca's region) partake in a more abstract linguistic analysis.

...read moreread less

Journal Article•10.1093/BRAIN/AWT274•

Visual activity predicts auditory recovery from deafness after adult cochlear implantation

[...]

Kuzma Strelnikov¹, Kuzma Strelnikov², Julien Rouger¹, Julien Rouger², Jean-François Démonet³, Sebastien Lagleyre, Bernard Fraysse, Olivier Deguine², Olivier Deguine¹, Pascal Barone², Pascal Barone¹ - Show less +7 more•Institutions (3)

Centre national de la recherche scientifique¹, Paul Sabatier University², University Hospital of Lausanne³

01 Dec 2013-Brain

TL;DR: The link demonstrated between visual activity and auditory speech perception indicates that visuoauditory synergy is crucial for cross-modal plasticity and fostering speech-comprehension recovery in adult cochlear-implanted deaf patients.

...read moreread less

Abstract: Modern cochlear implantation technologies allow deaf patients to understand auditory speech; however, the implants deliver only a coarse auditory input and patients must use long-term adaptive processes to achieve coherent percepts. In adults with post-lingual deafness, the high progress of speech recovery is observed during the first year after cochlear implantation, but there is a large range of variability in the level of cochlear implant outcomes and the temporal evolution of recovery. It has been proposed that when profoundly deaf subjects receive a cochlear implant, the visual cross-modal reorganization of the brain is deleterious for auditory speech recovery. We tested this hypothesis in post-lingually deaf adults by analysing whether brain activity shortly after implantation correlated with the level of auditory recovery 6 months later. Based on brain activity induced by a speech-processing task, we found strong positive correlations in areas outside the auditory cortex. The highest positive correlations were found in the occipital cortex involved in visual processing, as well as in the posterior-temporal cortex known for audio-visual integration. The other area, which positively correlated with auditory speech recovery, was localized in the left inferior frontal area known for speech processing. Our results demonstrate that the visual modality's functional level is related to the proficiency level of auditory recovery. Based on the positive correlation of visual activity with auditory speech recovery, we suggest that visual modality may facilitate the perception of the word's auditory counterpart in communicative situations. The link demonstrated between visual activity and auditory speech perception indicates that visuoauditory synergy is crucial for cross-modal plasticity and fostering speech-comprehension recovery in adult cochlear-implanted deaf patients.

...read moreread less

Proceedings Article•10.21437/INTERSPEECH.2013-203•

Speech activity detection on youtube using deep neural networks.

[...]

Neville Ryant¹, Mark Liberman¹, Jiahong Yuan¹•Institutions (1)

University of Pennsylvania¹

25 Aug 2013

TL;DR: It is demonstrated that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates on YouTube videos compared to a conventional GMM based system.

...read moreread less

Abstract: Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize well to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution is to augment the conventional cepstral features with additional, hand-engineered features (e.g., spectral flux, spectral centroid, multiband spectral entropies) which are robust to changes in environment and recording condition. An alternative approach, explored here, is to learn robust features during the course of training using an appropriate architecture such as deep neural networks (DNNs). In this paper we demonstrate that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates (19.6%) on YouTube videos compared to a conventional GMM based system (40%).

...read moreread less

Proceedings Article•10.1109/ICASSP.2013.6639248•

GlobalPhone: A multilingual text & speech database in 20 languages

[...]

Tanja Schultz¹, Ngoc Thang Vu¹, Tim Schlippe¹•Institutions (1)

Karlsruhe Institute of Technology¹

26 May 2013

TL;DR: The advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages, are described.

...read moreread less

Abstract: This paper describes the advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages. GlobalPhone was designed to be uniform across languages with respect to the amount of data, speech quality, the collection scenario, the transcription and phone set conventions. With more than 400 hours of transcribed audio data from more than 2000 native speakers GlobalPhone supplies an excellent basis for research in the areas of multilingual speech recognition, rapid deployment of speech processing systems to yet unsupported languages, language identification tasks, speaker recognition in multiple languages, multilingual speech synthesis, as well as monolingual speech recognition in a large variety of languages.

...read moreread less

Patent•

Speech recognition wake-up of a handheld portable electronic device

[...]

Aram Lindahl¹•Institutions (1)

Apple Inc.¹

11 Oct 2013

TL;DR: In this article, a system and method for parallel speech recognition processing of multiple audio signals produced by multiple microphones in a handheld portable electronic device is described, where a primary processor transitions to a power-saving mode while an auxiliary processor remains active.

...read moreread less

Abstract: A system and method for parallel speech recognition processing of multiple audio signals produced by multiple microphones in a handheld portable electronic device. In one embodiment, a primary processor transitions to a power-saving mode while an auxiliary processor remains active. The auxiliary processor then monitors the speech of a user of the device to detect a wake-up command by speech recognition processing the audio signals in parallel. When the auxiliary processor detects the command it then signals the primary processor to transition to active mode. The auxiliary processor may also identify to the primary processor which microphone resulted in the command being recognized with the highest confidence. Other embodiments are also described.

...read moreread less

Posted Content•

Techniques for Feature Extraction In Speech Recognition System : A Comparative Study

[...]

Urmila Shrawankar, Vilas M. Thakare

06 May 2013-arXiv: Sound

TL;DR: Some of the most used methods for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features are presented.

...read moreread less

Abstract: The time domain waveform of a speech signal carries all of the auditory information. From the phonological point of view, it little can be said on the basis of the waveform itself. However, past research in mathematics, acoustics, and speech technology have provided many methods for converting data that can be considered as information if interpreted correctly. In order to find some statistically relevant information from incoming data, it is important to have mechanisms for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features. These features should describe each segment in such a characteristic way that other similar segments can be grouped together by comparing their features. There are enormous interesting and exceptional ways to describe the speech signal in terms of parameters. Though, they all have their strengths and weaknesses, we have presented some of the most used methods with their importance.

...read moreread less

Journal Article•10.3389/FNSYS.2013.00116•

Upregulation of cognitive control networks in older adults' speech comprehension.

[...]

Julia Erb¹, Jonas Obleser¹•Institutions (1)

Max Planck Society¹

24 Dec 2013-Frontiers in Systems Neuroscience

TL;DR: The results indicate that older adults increasingly recruit cognitive control networks, even under optimal listening conditions, at the expense of these systems’ dynamic range.

...read moreread less

Abstract: Speech comprehension abilities decline with age and with age-related hearing loss, but it is unclear how this decline expresses in terms of central neural mechanisms. The current study examined neural speech processing in a group of older adults (aged 56– 77, n = 16, with varying degrees of sensorineural hearing loss), and compared them to a cohort of young adults (aged 22–31, n = 30, self-reported normal hearing). In a functional MRI experiment, listeners heard and repeated back degraded sentences (4-band vocoded, where the temporal envelope of the acoustic signal is preserved, while the spectral information is substantially degraded). Behaviorally, older adults adapted to degraded speech at the same rate as young listeners, although their overall comprehension of degraded speech was lower. Neurally, both older and young adults relied on the left anterior insula for degraded more than clear speech perception. However, anterior insula engagement in older adults was dependent on hearing acuity. Young adults additionally employed the anterior cingulate cortex (ACC). Interestingly, this age group × degradation interaction was driven by a reduced dynamic range in older adults who displayed elevated levels of ACC activity for both degraded and clear speech, consistent with a persistent upregulation in cognitive control irrespective of task difficulty. For correct speech comprehension, older adults relied on the middle frontal gyrus in addition to a core speech comprehension network recruited by younger adults suggestive of a compensatory mechanism. Taken together, the results indicate that older adults increasingly recruit cognitive control networks, even under optimal listening conditions, at the expense of these systems’ dynamic range.

...read moreread less

Journal Article•10.1109/MSP.2013.2265914•

Time-Frequency Processing of Nonstationary Signals: Advanced TFD Design to Aid Diagnosis with Highlights from Medical Applications

[...]

Boualem Boashash¹, Ghasem Azemi², John M. O'Toole¹•Institutions (2)

University of Queensland¹, Queensland University of Technology²

16 Oct 2013-IEEE Signal Processing Magazine

TL;DR: This article presents a methodical approach by designing adapted time-frequency (T-F) kernels for diagnosis applications with illustrations on three selected medical applications using the electroencephalogram (EEG), heart rate variability (HRV), and pathological speech signals.

...read moreread less

Abstract: This article presents a methodical approach for improving quadratic time-frequency distribution (QTFD) methods by designing adapted time-frequency (T-F) kernels for diagnosis applications with illustrations on three selected medical applications using the electroencephalogram (EEG), heart rate variability (HRV), and pathological speech signals. Manual and visual inspection of such nonstationary multicomponent signals is laborious especially for long recordings, requiring skilled interpreters with possible subjective judgments and errors. Automated assessment is therefore preferred for objective diagnosis by using T-F distributions (TFDs) to extract more information. This requires designing advanced high-resolution TFDs for automating classification and interpretation. As QTFD methods are general and their coverage is very broad, this article concentrates on methodologies using only a few selected medical problems studied by the authors.

...read moreread less

Book•

Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment

[...]

Maxine Eskenazi, Gina-Anne Levow, Helen Meng, Gabriel Parent, David Suendermann - Show less +1 more

15 Feb 2013

TL;DR: This introduction to crowdsourcing as a means of rapidly processing speech data offers speech researchers the hope that they can spend much less time dealing with the data gathering/annotation bottleneck, leaving them to focus on the scientific issues.

...read moreread less

Abstract: Provides an insightful and practical introduction to crowdsourcing as a means of rapidly processing speech dataIntended for those who want to get started in the domain and learn how to set up a task, what interfaces are available, how to assess the work, etc. as well as for those who already have used crowdsourcing and want to create better tasks and obtain better assessments of the work of the crowd. It will include screenshots to show examples of good and poor interfaces; examples of case studies in speech processing tasks, going through the task creation process, reviewing options in the interface, in the choice of medium (MTurk or other) and explaining choices, etc.Provides an insightful and practical introduction to crowdsourcing as a means of rapidly processing speech data.Addresses important aspects of this new technique that should be mastered before attempting a crowdsourcing application.Offers speech researchers the hope that they can spend much less time dealing with the data gathering/annotation bottleneck, leaving them to focus on the scientific issues. Readers will directly benefit from the books successful examples of how crowd- sourcing was implemented for speech processing, discussions of interface and processing choices that worked and choices that didnt, and guidelines on how to play and record speech over the internet, how to design tasks, and how to assess workers.Essential reading for researchers and practitioners in speech research groups involved in speech processing

...read moreread less

Journal Article•10.1016/J.HEARES.2013.02.004•

Improving speech perception in noise with current focusing in cochlear implant users

[...]

Arthi G. Srinivasan¹, Monica Padilla¹, Robert V. Shannon², Robert V. Shannon¹, David M. Landsberger¹ - Show less +1 more•Institutions (2)

House Ear Institute¹, University of Southern California²

01 May 2013-Hearing Research

TL;DR: A significant improvement in speech perception in noise with partial tripolar stimulation is shown and all subjects benefited from the current focused speech processing strategy.

...read moreread less

Journal Article•10.1093/CERCOR/BHS003•

Speech versus Song: Multiple Pitch-Sensitive Areas Revealed by a Naturally Occurring Musical Illusion

[...]

Adam Tierney¹, Frederic Dick², Diana Deutsch³, Martin I. Sereno²•Institutions (3)

Northwestern University¹, Birkbeck, University of London², University of California, San Diego³

01 Feb 2013-Cerebral Cortex

TL;DR: It is found that a network of 8 regions, including the anterior superior temporal gyrus (STG) just anterior to Heschl's gyrus and the right midposterior STG, respond more strongly to speech perceived as song than to mere speech.

...read moreread less

Abstract: It is normally obvious to listeners whether a human vocalization is intended to be heard as speech or song. However, the 2 signals are remarkably similar acoustically. A naturally occurring boundary case between speech and song has been discovered where a spoken phrase sounds as if it were sung when isolated and repeated. In the present study, an extensive search of audiobooks uncovered additional similar examples, which were contrasted with samples from the same corpus that do not sound like song, despite containing clear prosodic pitch contours. Using functional magnetic resonance imaging, we show that hearing these 2 closely matched stimuli is not associated with differences in response of early auditory areas. Rather, we find that a network of 8 regions, including the anterior superior temporal gyrus (STG) just anterior to Heschl’s gyrus and the right midposterior STG, respond more strongly to speech perceived as song than to mere speech. This network overlaps a number of areas previously associated with pitch extraction and song production, confirming that phrases originally intended to be heard as speech can, under certain circumstances, be heard as song. Our results suggest that song processing compared with speech processing makes increased demands on pitch processing and auditory--motor integration.

...read moreread less

...

Expand