Top 1069 papers published in the topic of Speech processing in 2010

Showing papers on "Speech processing published in 2010"

Book•

Handbook of Blind Source Separation: Independent Component Analysis and Applications

[...]

8 Mar 2010

TL;DR: This handbook provides the definitive reference on Blind Source Separation, giving a broad and comprehensive description of all the core principles and methods, numerical algorithms and major applications in the fields of telecommunications, biomedical engineering and audio, acoustic and speech processing.

...read moreread less

Abstract: Edited by the people who were forerunners in creating the field, together with contributions from 34 leading international experts, this handbook provides the definitive reference on Blind Source Separation, giving a broad and comprehensive description of all the core principles and methods, numerical algorithms and major applications in the fields of telecommunications, biomedical engineering and audio, acoustic and speech processing. Going beyond a machine learning perspective, the book reflects recent results in signal processing and numerical analysis, and includes topics such as optimization criteria, mathematical tools, the design of numerical algorithms, convolutive mixtures, and time frequency approaches. This Handbook is an ideal reference for university researchers, RD algebraic identification of under-determined mixtures, time-frequency methods, Bayesian approaches, blind identification under non negativity approaches, semi-blind methods for communicationsShows the applications of the methods to key application areas such as telecommunications, biomedical engineering, speech, acoustic, audio and music processing, while also giving a general method for developing applications

...read moreread less

1,925 citations

Proceedings Article•10.1109/ICASSP.2010.5495701•

A short-time objective intelligibility measure for time-frequency weighted noisy speech

[...]

Cees H. Taal¹, Richard C. Hendriks¹, Richard Heusdens¹, Jesper Jensen•Institutions (1)

Delft University of Technology¹

14 Mar 2010

TL;DR: An objective intelligibility measure is presented, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech, and shows significantly better performance than three other, more sophisticated, objective measures.

...read moreread less

Abstract: Existing objective speech-intelligibility measures are suitable for several types of degradation, however, it turns out that they are less appropriate for methods where noisy speech is processed by a time-frequency (TF) weighting, e.g., noise reduction and speech separation. In this paper, we present an objective intelligibility measure, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech. The proposed method shows significantly better performance than three other, more sophisticated, objective measures. Furthermore, it is based on an intermediate intelligibility measure for short-time (approximately 400 ms) TF-regions, and uses a simple DFT-based TF-decomposition. In addition, a free Matlab implementation is provided.

...read moreread less

1,224 citations

Journal Article•10.1016/J.SPECOM.2009.08.002•

Silent speech interfaces

[...]

Bruce Denby¹, Tanja Schultz², K. Honda, Thomas Hueber³, James M. Gilbert⁴, Jonathan S. Brumberg⁵ - Show less +2 more•Institutions (5)

Pierre-and-Marie-Curie University¹, Karlsruhe Institute of Technology², ESPCI ParisTech³, University of Hull⁴, Boston University⁵

01 Apr 2010-Speech Communication

TL;DR: The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telecommunications privacy issues, and then follows with a presentation of demonstrator systems based on seven different types of technologies.

...read moreread less

544 citations

Book•

Theory and Applications of Digital Speech Processing

[...]

Lawrence R. Rabiner, Ronald W. Schafer

13 Mar 2010

TL;DR: This new text presents the basic concepts and theories of speech processing with clarity and currency, while providing hands-on computer-based laboratory experiences for students.

...read moreread less

Abstract: Theory and Applications of Digital Speech Processing is ideal for graduate students in digital signal processing, and undergraduate students in Electrical and Computer Engineering. With its clear, up-to-date, hands-on coverage of digital speech processing, this text is also suitable for practicing engineers in speech processing. This new text presents the basic concepts and theories of speech processing with clarity and currency, while providing hands-on computer-based laboratory experiences for students. The material is organized in a manner that builds a strong foundation of basics first, and then concentrates on a range of signal processing methods for representing and processing the speech signal.

...read moreread less

495 citations

Journal Article•10.3109/14992027.2010.506889•

Development and analysis of an International Speech Test Signal (ISTS)

[...]

Inga Holube, Stefan Fredelake¹, Marcel S. M. G. Vlaming², Birger Kollmeier¹•Institutions (2)

University of Oldenburg¹, VU University Amsterdam²

11 Nov 2010-International Journal of Audiology

TL;DR: The primary intention is to include this test signal with a new measurement method for a new hearing aid standard (IEC 60118-15) that is based on natural recordings but is largely non-intelligible because of segmentation and remixing.

...read moreread less

Abstract: For analysing the processing of speech by a hearing instrument, a standard test signal is necessary which allows for reproducible measurement conditions, and which features as many of the m...

...read moreread less

397 citations

Journal Article•10.1016/J.TICS.2010.06.005•

Cortical speech processing unplugged: a timely subcortico-cortical framework.

[...]

Sonja A. Kotz¹, Michael Schwartze¹•Institutions (1)

Max Planck Society¹

01 Sep 2010-Trends in Cognitive Sciences

TL;DR: An integrative speech processing framework is developed by synthesizing evolutionary, anatomical and neurofunctional concepts of auditory, temporal and speech processing into a network that extends cortical speech processing systems with cortical and subcortical systems associated with motor control.

...read moreread less

381 citations

Journal Article•10.1109/TASL.2010.2041110•

A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation

[...]

Guoning Hu¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Nov 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A tandem algorithm is proposed that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively and performs substantially better than previous systems for either pitch extraction or voiced speech segregation.

...read moreread less

Abstract: A lot of effort has been made in computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. The performance of current CASA systems on voiced speech segregation is limited by lacking a robust algorithm for pitch estimation. We propose a tandem algorithm that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively. This algorithm first obtains a rough estimate of target pitch, and then uses this estimate to segregate target speech using harmonicity and temporal continuity. It then improves both pitch estimation and voiced speech segregation iteratively. Novel methods are proposed for performing segregation with a given pitch estimate and pitch determination with given segregation. Systematic evaluation shows that the tandem algorithm extracts a majority of target speech without including much interference, and it performs substantially better than previous systems for either pitch extraction or voiced speech segregation.

...read moreread less

333 citations

Journal Article•10.1016/J.SPECOM.2010.08.014•

Non-native speech perception in adverse conditions: A review

[...]

Maria Luisa Garcia Lecumberri¹, Martin Cooke¹, Anne Cutler²•Institutions (2)

University of the Basque Country¹, Max Planck Society²

01 Nov 2010-Speech Communication

TL;DR: In this article, the authors reviewed experimental studies on non-native listening in adverse conditions, organized around three principal contributory factors: the task facing listeners, the effect of adverse conditions on speech, and the differences among listener populations.

...read moreread less

312 citations

Journal Article•10.1109/TSP.2009.2034935•

DOA Estimation of Quasi-Stationary Signals With Less Sensors Than Sources and Unknown Spatial Noise Covariance: A Khatri–Rao Subspace Approach

[...]

Wing-Kin Ma¹, Tsung-Han Hsieh², Chong-Yung Chi³•Institutions (3)

The Chinese University of Hong Kong¹, Realtek², National Tsing Hua University³

01 Apr 2010-IEEE Transactions on Signal Processing

TL;DR: This paper considers the problem of direction-of-arrival (DOA) estimation of quasi-stationary signals and develops a Khatri-Rao (KR) subspace approach that provides a simple yet effective way of eliminating the unknown spatial noise covariance from the signal SOSs.

...read moreread less

Abstract: In real-world applications such as those for speech and audio, there are signals that are nonstationary but can be modeled as being stationary within local time frames. Such signals are generally called quasi-stationary or locally stationary signals. This paper considers the problem of direction-of-arrival (DOA) estimation of quasi-stationary signals. Specifically, in our problem formulation we assume: i) sensor array of uniform linear structure; ii) mutually uncorrelated wide-sense quasi-stationary source signals; and iii) wide-sense stationary noise process with unknown, possibly nonwhite, spatial covariance. Under the assumptions above and by judiciously examining the structures of local second-order statistics (SOSs), we develop a Khatri-Rao (KR) subspace approach that has two notable advantages. First, through an identifiability analysis, it is proven that this KR subspace approach can operate even when the number of sensors is about half of the number of sources. The idea behind is to make use of a ?virtual? array structure provided inherently in the local SOS model, of which the degree of freedom is about twice of that of the physical array. Second, the KR formulation naturally provides a simple yet effective way of eliminating the unknown spatial noise covariance from the signal SOSs. Extensive simulation results are provided to demonstrate the effectiveness of the KR subspace approach under various situations.

...read moreread less

301 citations

Journal Article•10.1016/J.CSL.2009.02.006•

Monaural speech separation and recognition challenge

[...]

Martin Cooke¹, John R. Hershey², Steven J. Rennie²•Institutions (2)

Ikerbasque¹, IBM²

01 Jan 2010-Computer Speech & Language

TL;DR: The purpose of the monaural speech separation and recognition challenge was to permit a large-scale comparison of techniques for the competing talker problem and the finding that several systems achieved near-human performance in some conditions, and one out-performed listeners overall.

...read moreread less

248 citations

Journal Article•10.1016/J.BANDL.2009.09.003•

Language or music, mother or Mozart? Structural and environmental influences on infants' language networks.

[...]

Ghislaine Dehaene-Lambertz, A. Montavont¹, A. Montavont², A. Jobert¹, A. Jobert², L. Allirol³, Jessica Dubois³, Lucie Hertz-Pannier, Stanislas Dehaene - Show less +5 more•Institutions (3)

French Institute of Health and Medical Research¹, University of Paris², IBM³

01 Aug 2010-Brain and Language

TL;DR: In this paper, the authors used fMRI to study the organization of brain activity in two-month-old infants when listening to speech or to music, and explored how infants react to their mother's voice relative to an unknown voice, finding that the well-known structural asymmetry already present in infants' posterior temporal areas has a functional counterpart: there is a left-hemisphere advantage for speech relative to music at the level of the planum temporale.

...read moreread less

Book•

Phonetic Analysis of Speech Corpora

[...]

Jonathan Harrington

12 Apr 2010

TL;DR: This workbook-style text provides an extensive set of exercises to help readers develop the necessary skills to design and carry out experiments in speech research and offers the first step-by-step treatment of advanced techniques in experimental phonetics using speech corpora and downloadable software.

...read moreread less

Abstract: An accessible introduction to the phonetic analysis of speech corpora, this workbook-style text provides an extensive set of exercises to help readers develop the necessary skills to design and carry out experiments in speech research. Offers the first step-by-step treatment of advanced techniques in experimental phonetics using speech corpora and downloadable software, including the R programming language Introduces methods of analyzing phonetically-labelled speech corpora, with the goal of testing hypotheses that often arise in experimental phonetics and laboratory phonology Incorporates an extensive set of exercises and answers to reinforce the techniques introduced Accessibly written with easy-to-follow computer commands and spectrograms of speech Companion website at www.wiley.com/go/harrington, which includes illustrations, video tutorials, appendices, and downloadable speech corpora for testing purposes. Discusses techniques in digital speech processing and in structuring and querying annotations from speech corpora Includes substantial coverage of analysis, including measuring gestural synchronization using EMA, the acoustics of vowels, consonant overlap using EPG, spectral analysis of fricatives and obstruents, and the probabilistic classification of acoustic speech data

...read moreread less

Journal Article•10.1016/J.CSL.2008.11.001•

Super-human multi-talker speech recognition: A graphical modeling approach

[...]

John R. Hershey¹, Steven J. Rennie¹, Peder A. Olsen¹, Trausti Kristjansson²•Institutions (2)

IBM¹, Google²

01 Jan 2010-Computer Speech & Language

TL;DR: A system that can separate and recognize the simultaneous speech of two people recorded in a single channel is presented and how belief-propagation reduces the complexity of temporal inference from exponential to linear in the number of sources and the size of the language model is shown.

...read moreread less

Journal Article•10.1016/J.SPECOM.2010.02.004•

Single-channel speech enhancement using spectral subtraction in the short-time modulation domain

[...]

Kuldip K. Paliwal¹, Kamil Wojcicki¹, Belinda Schwerin¹•Institutions (1)

Griffith University¹

01 May 2010-Speech Communication

TL;DR: The results indicate that modulation frame durations, provide a good compromise between different types of spectral distortions, namely musical noise and temporal slurring, and given a proper selection of modulation frame duration, the proposed modulation spectral subtraction does not suffer from musical noise artifacts typically associated with acoustic spectral subtracted.

...read moreread less

Journal Article•10.1016/J.SPECOM.2009.12.002•

Modeling coarticulation in EMG-based continuous speech recognition

[...]

Tanja Schultz¹, Michael Wand¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Apr 2010-Speech Communication

TL;DR: The new approach of phonetic feature bundling for modeling coarticulation in EMG-based speech recognition is described and results on theEMG-PIT corpus, a multiple speaker large vocabulary database of silent and audible EMG speech recordings, which was recently collected are reported.

...read moreread less

Journal Article•10.1109/TMM.2010.2052239•

A 3-D Audio-Visual Corpus of Affective Communication

[...]

Gabriele Fanelli¹, Jürgen Gall¹, Harald Romsdorfer², Thibaut Weise³, L. Van Gool¹ - Show less +1 more•Institutions (3)

ETH Zurich¹, Graz University of Technology², École Polytechnique Fédérale de Lausanne³

01 Oct 2010-IEEE Transactions on Multimedia

TL;DR: This work presents a new audio-visual corpus for possibly the two most important modalities used by humans to communicate their emotional states, namely speech and facial expression in the form of dense dynamic 3-D face geometries.

...read moreread less

Abstract: Communication between humans deeply relies on the capability of expressing and recognizing feelings. For this reason, research on human-machine interaction needs to focus on the recognition and simulation of emotional states, prerequisite of which is the collection of affective corpora. Currently available datasets still represent a bottleneck for the difficulties arising during the acquisition and labeling of affective data. In this work, we present a new audio-visual corpus for possibly the two most important modalities used by humans to communicate their emotional states, namely speech and facial expression in the form of dense dynamic 3-D face geometries. We acquire high-quality data by working in a controlled environment and resort to video clips to induce affective states. The annotation of the speech signal includes: transcription of the corpus text into the phonological representation, accurate phone segmentation, fundamental frequency extraction, and signal intensity estimation of the speech signals. We employ a real-time 3-D scanner to acquire dense dynamic facial geometries and track the faces throughout the sequences, achieving full spatial and temporal correspondences. The corpus is a valuable tool for applications like affective visual speech synthesis or view-independent facial expression recognition.

...read moreread less

Journal Article•10.1109/TASL.2009.2024731•

New Insights Into the MVDR Beamformer in Room Acoustics

[...]

Emanuel A. P. Habets¹, Jacob Benesty², Israel Cohen³, Sharon Gannot⁴, Jacek P. Dmochowski² - Show less +1 more•Institutions (4)

Imperial College London¹, Université du Québec², Technion – Israel Institute of Technology³, Bar-Ilan University⁴

01 Jan 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction, and shows that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only.

...read moreread less

Abstract: The minimum variance distortionless response (MVDR) beamformer, also known as Capon's beamformer, is widely studied in the area of speech enhancement. The MVDR beamformer can be used for both speech dereverberation and noise reduction. This paper provides new insights into the MVDR beamformer. Specifically, the local and global behavior of the MVDR beamformer is analyzed and novel forms of the MVDR filter are derived and discussed. In earlier works it was observed that there is a tradeoff between the amount of speech dereverberation and noise reduction when the MVDR beamformer is used. Here, the tradeoff between speech dereverberation and noise reduction is analyzed thoroughly. The local and global behavior, as well as the tradeoff, is analyzed for different noise fields such as, for example, a mixture of coherent and non-coherent noise fields, entirely non-coherent noise fields and diffuse noise fields. It is shown that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only. The amount of noise reduction that is sacrificed when complete dereverberation is required depends on the direct-to-reverberation ratio of the acoustic impulse response between the source and the reference microphone. The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction. When desiring both speech dereverberation and noise reduction, the results also demonstrate that the amount of noise reduction that is sacrificed decreases when the number of microphones increases.

...read moreread less

Proceedings Article•10.1109/BTAS.2010.5634515•

Subject identification from electroencephalogram (EEG) signals during imagined speech

[...]

Katharine Brigham¹, B. V. K. Vijaya Kumar¹•Institutions (1)

Carnegie Mellon University¹

11 Nov 2010

TL;DR: The proposed approach was tested on a publicly available database consisting of EEG signals corresponding to Visual Evoked Potentials to test the applicability of the proposed method on a larger number of subjects, and it was able to classify 120 subjects with 98.96% accuracy.

...read moreread less

Abstract: We investigate the potential of using electrical brainwave signals during imagined speech to identify which subject the signals originated from. Electroencephalogram (EEG) signals were recorded at the University of California, Irvine (UCI) from 6 volunteer subjects imagining speaking one of two syllables, /ba/ and /ku/, at different rhythms without performing any overt actions. In this work, we assess the degree of subject-to-subject variation and the feasibility of using imagined speech for subject identification. The EEG data are first preprocessed to reduce the effects of artifacts and noise, and autoregressive (AR) coefficients are extracted from each electrode's signal and concatenated for subject identification using a linear SVM classifier. The subjects were identifiable to a 99.76% accuracy, which indicates a clear potential for using imagined speech EEG data for biometrie identification due to its strong inter-subject variation. Furthermore, the subject identification appears to be tolerant to differing conditions such as different imagined syllables and rhythms (as it is expected that the subjects will not imagine speaking the syllables at exactly the same rhythms from trial to trial). The proposed approach was also tested on a publicly available database consisting of EEG signals corresponding to Visual Evoked Potentials (VEPs) to test the applicability of the proposed method on a larger number of subjects, and it was able to classify 120 subjects with 98.96% accuracy.

...read moreread less

Journal Article•10.1109/TMM.2010.2051872•

Feature Analysis and Evaluation for Automatic Emotion Identification in Speech

[...]

Iker Luengo¹, Eva Navas¹, Inmaculada Hernáez¹•Institutions (1)

University of the Basque Country¹

01 Oct 2010-IEEE Transactions on Multimedia

TL;DR: Analysis of the characteristics of features derived from prosody, spectral envelope, and voice quality as well as their capability to discriminate emotions suggest that spectral envelope features outperform the prosodic ones.

...read moreread less

Abstract: The definition of parameters is a crucial step in the development of a system for identifying emotions in speech. Although there is no agreement on which are the best features for this task, it is generally accepted that prosody carries most of the emotional information. Most works in the field use some kind of prosodic features, often in combination with spectral and voice quality parametrizations. Nevertheless, no systematic study has been done comparing these features. This paper presents the analysis of the characteristics of features derived from prosody, spectral envelope, and voice quality as well as their capability to discriminate emotions. In addition, early fusion and late fusion techniques for combining different information sources are evaluated. The results of this analysis are validated with experimental automatic emotion identification tests. Results suggest that spectral envelope features outperform the prosodic ones. Even when different parametrizations are combined, the late fusion of long-term spectral statistics with short-term spectral envelope parameters provides an accuracy comparable to that obtained when all parametrizations are combined.

...read moreread less

Journal Article•10.1109/TASL.2009.2028374•

Theoretical Analysis of Binaural Multimicrophone Noise Reduction Techniques

[...]

Bram Cornelis¹, Simon Doclo¹, T. Van dan Bogaert¹, Marc Moonen¹, Jan Wouters¹ - Show less +1 more•Institutions (1)

Katholieke Universiteit Leuven¹

01 Feb 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Two extensions of the binaural SDW-MWF are proposed to improve the binural cue preservation and are able to preserve bINAural cues for the speech and noise sources, while still achieving significant noise reduction performance.

...read moreread less

Abstract: Binaural hearing aids use microphone signals from both left and right hearing aid to generate an output signal for each ear. The microphone signals can be processed by a procedure based on speech distortion weighted multichannel Wiener filtering (SDW-MWF) to achieve significant noise reduction in a speech + noise scenario. In binaural procedures, it is also desirable to preserve binaural cues, in particular the interaural time difference (ITD) and interaural level difference (ILD), which are used to localize sounds. It has been shown in previous work that the binaural SDW-MWF procedure only preserves these binaural cues for the desired speech source, but distorts the noise binaural cues. Two extensions of the binaural SDW-MWF have therefore been proposed to improve the binaural cue preservation, namely the MWF with partial noise estimation (MWF-eta) and MWF with interaural transfer function extension (MWF-ITF). In this paper, the binaural cue preservation of these extensions is analyzed theoretically and tested based on objective performance measures. Both extensions are able to preserve binaural cues for the speech and noise sources, while still achieving significant noise reduction performance.

...read moreread less

Journal Article•10.1016/J.CSL.2009.03.004•

Active learning and semi-supervised learning for speech recognition: A unified framework using the global entropy reduction maximization criterion

[...]

Dong Yu¹, Balakrishnan Varadarajan², Li Deng¹, Alex Acero¹•Institutions (2)

Microsoft¹, Johns Hopkins University²

01 Jul 2010-Computer Speech & Language

TL;DR: It is shown that both the traditional confidence-based active learning and semi-supervised learning approaches can be improved by maximizing the lattice entropy reduction over the whole dataset.

...read moreread less

Patent•

System and method for open speech recognition

[...]

Mazin Gilbert¹, Srinivas Bangalore¹, Patrick Haffner¹, Robert M. Bell¹•Institutions (1)

AT&T¹

30 Sep 2010

TL;DR: In this article, the authors present systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the received speech.

...read moreread less

Abstract: Disclosed herein are systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech. The disclosure includes recognizing received speech with a collection of domain-specific speech recognizers, determining a speech recognition confidence for each of the speech recognition outputs, selecting speech recognition candidates based on a respective speech recognition confidence for each speech recognition output, and combining selected speech recognition candidates to generate text based on the combination.

...read moreread less

Proceedings Article•

The CHiME corpus: a resource and a challenge for computational hearing in multisource environments.

[...]

Heidi Christensen¹, Jon Barker¹, Ning Ma¹, Phil D. Green¹•Institutions (1)

University of Sheffield¹

1 Jan 2010

TL;DR: A new corpus designed for noise-robust speech processing research, CHiME, which includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment.

...read moreread less

Abstract: We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results. Index Terms: Data collection, Binaural, Spatialisation

...read moreread less

Journal Article•10.1097/AUD.0B013E3181EDFBD2•

Comparison of bimodal and bilateral cochlear implant users on speech recognition with competing talker, music perception, affective prosody discrimination, and talker identification.

[...]

Helen Cullington¹, Fan-Gang Zeng•Institutions (1)

University of Southampton¹

01 Aug 2010-Ear and Hearing

TL;DR: Although the bimodal cochlear implant group performed better than the bilateral group on most parts of the four pitch-related tests, the differences were not statistically significant and the lack of correlation between test results shows that the tasks used are not simply providing a measure of pitch ability.

...read moreread less

Abstract: AB Objectives: Despite excellent performance in speech recognition in quiet, most cochlear implant users have great difficulty with speech recognition in noise, music perception, identifying tone of voice, and discriminating different talkers. This may be partly due to the pitch coding in cochlear implant speech processing. Most current speech processing strategies use only the envelope information; the temporal fine structure is discarded. One way to improve electric pitch perception is to use residual acoustic hearing via a hearing aid on the nonimplanted ear (bimodal hearing). This study aimed to test the hypothesis that bimodal users would perform better than bilateral cochlear implant users on tasks requiring good pitch perception. Design: Four pitch-related tasks were used. 1. Hearing in Noise Test (HINT) sentences spoken by a male talker with a competing female, male, or child talker. 2. Montreal Battery of Evaluation of Amusia. This is a music test with six subtests examining pitch, rhythm and timing perception, and musical memory. 3. Aprosodia Battery. This has five subtests evaluating aspects of affective prosody and recognition of sarcasm. 4. Talker identification using vowels spoken by 10 different talkers (three men, three women, two boys, and two girls). Bilateral cochlear implant users were chosen as the comparison group. Thirteen bimodal and 13 bilateral adult cochlear implant users were recruited; all had good speech perception in quiet. Results: There were no significant differences between the mean scores of the bimodal and bilateral groups on any of the tests, although the bimodal group did perform better than the bilateral group on almost all tests. Performance on the different pitch-related tasks was not correlated, meaning that if a subject performed one task well they would not necessarily perform well on another. The correlation between the bimodal users' hearing threshold levels in the aided ear and their performance on these tasks was weak. Conclusions: Although the bimodal cochlear implant group performed better than the bilateral group on most parts of the four pitch-related tests, the differences were not statistically significant. The lack of correlation between test results shows that the tasks used are not simply providing a measure of pitch ability. Even if the bimodal users have better pitch perception, the real-world tasks used are reflecting more diverse skills than pitch. This research adds to the existing speech perception, language, and localization studies that show no significant difference between bimodal and bilateral cochlear implant users.

...read moreread less

Journal Article•10.1016/J.LANGSCI.2009.06.001•

Rich memory and distributed phonology

[...]

Robert F. Port¹•Institutions (1)

Indiana University¹

01 Jan 2010-Language Sciences

TL;DR: In this paper, it is claimed that experimental evidence about human speech processing and the richness of memory for linguistic material supports a distributed view of language where every speaker creates an idiosyncratic perspective on the linguistic conventions of the community, and that people actually employ high-dimensional, spectro-temporal, auditory patterns to support speech production, speech perception and linguistic memory in real time.

...read moreread less

Journal Article•10.1109/TASL.2009.2035150•

Gaussian Model-Based Multichannel Speech Presence Probability

[...]

Mehrez Souden¹, Jingdong Chen², Jacob Benesty¹, Sofiene Affes¹•Institutions (2)

Institut national de la recherche scientifique¹, Bell Labs²

01 Jul 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This correspondence establishes a new expression for speech presence probability when an array of microphones with an arbitrary geometry is used and proposes a new proposed multichannel approach that can significantly increase the detection accuracy.

...read moreread less

Abstract: The knowledge of the target speech presence probability in a mixture of signals captured by a speech communication system is of paramount importance in several applications including reliable noise reduction algorithms. In this correspondence, we establish a new expression for speech presence probability when an array of microphones with an arbitrary geometry is used. Our study is based on the assumption of the Gaussian statistical model for all signals and involves the noise and noisy data statistics only. In comparison with the single-channel case, the new proposed multichannel approach can significantly increase the detection accuracy. In particular, when the additive noise is spatially coherent, perfect speech presence detection is theoretically possible, while when the noise is spatially white, a coherent summation of speech components is performed to allow for enhanced speech presence probability estimation.

...read moreread less

Journal Article•10.1093/CERCOR/BHP113•

Absolute Pitch—Functional Evidence of Speech-Relevant Auditory Acuity

[...]

Mathias S. Oechslin¹, Martin Meyer¹, Lutz Jäncke¹•Institutions (1)

University of Zurich¹

01 Feb 2010-Cerebral Cortex

TL;DR: The results suggest that the neural underpinnings of pitch processing expertise exercise a strong influence on propositional speech perception (sentence meaning) by systematically varying the lexical and/or prosodic information of speech stimuli.

...read moreread less

Abstract: Absolute pitch (AP) has been shown to be associated with morphological changes and neurophysiological adaptations in the planum temporale, a cortical area involved in higher-order auditory and speech perception processes. The direct link between speech processing and AP has hitherto not been addressed. We provide first evidence that AP compared with relative pitch (RP) ability is associated with significantly different hemodynamic responses to complex speech sounds. By systematically varying the lexical and/or prosodic information of speech stimuli, we demonstrated consistent activation differences in AP musicians compared with RP musicians and nonmusicians. These differences relate to stronger activations in the posterior part of the middle temporal gyrus and weaker activations in the anterior mid-part of the superior temporal gyrus. Furthermore, this pattern is considerably modulated by the auditory acuity of AP. Our results suggest that the neural underpinnings of pitch processing expertise exercise a strong influence on propositional speech perception (sentence meaning).

...read moreread less

Journal Article•10.1109/JSTSP.2009.2039171•

Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition

[...]

Jort F. Gemmeke¹, H. Van Hamme², Bert Cranen¹, Lou Boves¹•Institutions (2)

Radboud University Nijmegen¹, Katholieke Universiteit Leuven²

22 Feb 2010-IEEE Journal of Selected Topics in Signal Processing

TL;DR: This paper introduces a novel non-parametric, exemplar-based method for reconstructing clean speech from noisy observations, based on techniques from the field of Compressive Sensing, which can impute missing features using larger time windows such as entire words.

...read moreread less

Abstract: An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing), and to replace (impute) the missing ones by clean speech estimates. Conventional imputation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low signal-to-noise ratios (SNRs), these techniques fail, because too many time frames may contain few, if any, reliable features. In this paper, we introduce a novel non-parametric, exemplar-based method for reconstructing clean speech from noisy observations, based on techniques from the field of Compressive Sensing. The method, dubbed sparse imputation, can impute missing features using larger time windows such as entire words. Using an overcomplete dictionary of clean speech exemplars, the method finds the sparsest combination of exemplars that jointly approximate the reliable features of a noisy utterance. That linear combination of clean speech exemplars is used to replace the missing features. Recognition experiments on noisy isolated digits show that sparse imputation outperforms conventional imputation techniques at SNR = -5 dB when using an ideal `oracle' mask. With error-prone estimated masks sparse imputation performs slightly worse than the best conventional technique.

...read moreread less

Patent•

Speech data retrieving and presenting device

[...]

Chen Ming-Fu, Chen Cheng-Hsiung, Daow-Ming Jiang, Chan-Fa Chiu, Cheng-Jen Lin, Liu Po-Yiu - Show less +2 more

8 Nov 2010

TL;DR: A speech data retrieving and presenting device applied with an electronic device through a network includes a data receiving unit, a processing unit and a speech presenting unit as discussed by the authors, which can assist a user to obtain network information, and provide the user a more flexible application according to the property that the device can be operated independently by a simple motion.

...read moreread less

Abstract: A speech data retrieving and presenting device applied with an electronic device through a network includes a data receiving unit, a processing unit and a speech presenting unit. The data receiving unit connected to the network receives data of the electronic device through the network. The processing unit coupled to the data receiving unit receives speech data and retrieves a speech presenting signal from the speech data. The speech presenting unit coupled to the processing unit receives the speech presenting signal and outputs a speech according to the speech data. This device can assist a user to obtain network information, and provide the user a more flexible application according to the property that the device can be operated independently by a simple motion.

...read moreread less

Journal Article•10.1109/TMM.2010.2058095•

Speech Emotion Analysis: Exploring the Role of Context

[...]

Ashish Tawari¹, Mohan M. Trivedi¹•Institutions (1)

University of California, San Diego¹

01 Oct 2010-IEEE Transactions on Multimedia

TL;DR: A novel set of features based on cepstrum analysis of pitch and intensity contours is introduced and the effects of different contexts on two different databases are systematically analyzed.

...read moreread less

Abstract: Automated analysis of human affective behavior has attracted increasing attention in recent years. With the research shift toward spontaneous behavior, many challenges have come to surface ranging from database collection strategies to the use of new feature sets (e.g., lexical cues apart from prosodic features). Use of contextual information, however, is rarely addressed in the field of affect expression recognition, yet it is evident that affect recognition by human is largely influenced by the context information. Our contribution in this paper is threefold. First, we introduce a novel set of features based on cepstrum analysis of pitch and intensity contours. We evaluate the usefulness of these features on two different databases: Berlin Database of emotional speech (EMO-DB) and locally collected audiovisual database in car settings (CVRRCar-AVDB). The overall recognition accuracy achieved for seven emotions in the EMO-DB database is over 84% and over 87% for three emotion classes in CVRRCar-AVDB. This is based on tenfold stratified cross validation. Second, we introduce the collection of a new audiovisual database in an automobile setting (CVRRCar-AVDB). In this current study, we only use the audio channel of the database. Third, we systematically analyze the effects of different contexts on two different databases. We present context analysis of subject and text based on speaker/text-dependent/-independent analysis on EMO-DB. Furthermore, we perform context analysis based on gender information on EMO-DB and CVRRCar-AVDB. The results based on these analyses are promising.

...read moreread less

...

Expand