TL;DR: This handbook provides the definitive reference on Blind Source Separation, giving a broad and comprehensive description of all the core principles and methods, numerical algorithms and major applications in the fields of telecommunications, biomedical engineering and audio, acoustic and speech processing.
Abstract: Edited by the people who were forerunners in creating the field, together with contributions from 34 leading international experts, this handbook provides the definitive reference on Blind Source Separation, giving a broad and comprehensive description of all the core principles and methods, numerical algorithms and major applications in the fields of telecommunications, biomedical engineering and audio, acoustic and speech processing. Going beyond a machine learning perspective, the book reflects recent results in signal processing and numerical analysis, and includes topics such as optimization criteria, mathematical tools, the design of numerical algorithms, convolutive mixtures, and time frequency approaches. This Handbook is an ideal reference for university researchers, RD algebraic identification of under-determined mixtures, time-frequency methods, Bayesian approaches, blind identification under non negativity approaches, semi-blind methods for communicationsShows the applications of the methods to key application areas such as telecommunications, biomedical engineering, speech, acoustic, audio and music processing, while also giving a general method for developing applications
TL;DR: An objective intelligibility measure is presented, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech, and shows significantly better performance than three other, more sophisticated, objective measures.
Abstract: Existing objective speech-intelligibility measures are suitable for several types of degradation, however, it turns out that they are less appropriate for methods where noisy speech is processed by a time-frequency (TF) weighting, e.g., noise reduction and speech separation. In this paper, we present an objective intelligibility measure, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech. The proposed method shows significantly better performance than three other, more sophisticated, objective measures. Furthermore, it is based on an intermediate intelligibility measure for short-time (approximately 400 ms) TF-regions, and uses a simple DFT-based TF-decomposition. In addition, a free Matlab implementation is provided.
TL;DR: The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telecommunications privacy issues, and then follows with a presentation of demonstrator systems based on seven different types of technologies.
TL;DR: This new text presents the basic concepts and theories of speech processing with clarity and currency, while providing hands-on computer-based laboratory experiences for students.
Abstract: Theory and Applications of Digital Speech Processing is ideal for graduate students in digital signal processing, and undergraduate students in Electrical and Computer Engineering. With its clear, up-to-date, hands-on coverage of digital speech processing, this text is also suitable for practicing engineers in speech processing. This new text presents the basic concepts and theories of speech processing with clarity and currency, while providing hands-on computer-based laboratory experiences for students. The material is organized in a manner that builds a strong foundation of basics first, and then concentrates on a range of signal processing methods for representing and processing the speech signal.
TL;DR: The primary intention is to include this test signal with a new measurement method for a new hearing aid standard (IEC 60118-15) that is based on natural recordings but is largely non-intelligible because of segmentation and remixing.
Abstract: For analysing the processing of speech by a hearing instrument, a standard test signal is necessary which allows for reproducible measurement conditions, and which features as many of the m...
TL;DR: An integrative speech processing framework is developed by synthesizing evolutionary, anatomical and neurofunctional concepts of auditory, temporal and speech processing into a network that extends cortical speech processing systems with cortical and subcortical systems associated with motor control.
TL;DR: A tandem algorithm is proposed that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively and performs substantially better than previous systems for either pitch extraction or voiced speech segregation.
Abstract: A lot of effort has been made in computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. The performance of current CASA systems on voiced speech segregation is limited by lacking a robust algorithm for pitch estimation. We propose a tandem algorithm that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively. This algorithm first obtains a rough estimate of target pitch, and then uses this estimate to segregate target speech using harmonicity and temporal continuity. It then improves both pitch estimation and voiced speech segregation iteratively. Novel methods are proposed for performing segregation with a given pitch estimate and pitch determination with given segregation. Systematic evaluation shows that the tandem algorithm extracts a majority of target speech without including much interference, and it performs substantially better than previous systems for either pitch extraction or voiced speech segregation.
TL;DR: In this article, the authors reviewed experimental studies on non-native listening in adverse conditions, organized around three principal contributory factors: the task facing listeners, the effect of adverse conditions on speech, and the differences among listener populations.
TL;DR: This paper considers the problem of direction-of-arrival (DOA) estimation of quasi-stationary signals and develops a Khatri-Rao (KR) subspace approach that provides a simple yet effective way of eliminating the unknown spatial noise covariance from the signal SOSs.
Abstract: In real-world applications such as those for speech and audio, there are signals that are nonstationary but can be modeled as being stationary within local time frames. Such signals are generally called quasi-stationary or locally stationary signals. This paper considers the problem of direction-of-arrival (DOA) estimation of quasi-stationary signals. Specifically, in our problem formulation we assume: i) sensor array of uniform linear structure; ii) mutually uncorrelated wide-sense quasi-stationary source signals; and iii) wide-sense stationary noise process with unknown, possibly nonwhite, spatial covariance. Under the assumptions above and by judiciously examining the structures of local second-order statistics (SOSs), we develop a Khatri-Rao (KR) subspace approach that has two notable advantages. First, through an identifiability analysis, it is proven that this KR subspace approach can operate even when the number of sensors is about half of the number of sources. The idea behind is to make use of a ?virtual? array structure provided inherently in the local SOS model, of which the degree of freedom is about twice of that of the physical array. Second, the KR formulation naturally provides a simple yet effective way of eliminating the unknown spatial noise covariance from the signal SOSs. Extensive simulation results are provided to demonstrate the effectiveness of the KR subspace approach under various situations.
TL;DR: The purpose of the monaural speech separation and recognition challenge was to permit a large-scale comparison of techniques for the competing talker problem and the finding that several systems achieved near-human performance in some conditions, and one out-performed listeners overall.
TL;DR: In this paper, the authors used fMRI to study the organization of brain activity in two-month-old infants when listening to speech or to music, and explored how infants react to their mother's voice relative to an unknown voice, finding that the well-known structural asymmetry already present in infants' posterior temporal areas has a functional counterpart: there is a left-hemisphere advantage for speech relative to music at the level of the planum temporale.
TL;DR: This workbook-style text provides an extensive set of exercises to help readers develop the necessary skills to design and carry out experiments in speech research and offers the first step-by-step treatment of advanced techniques in experimental phonetics using speech corpora and downloadable software.
Abstract: An accessible introduction to the phonetic analysis of speech corpora, this workbook-style text provides an extensive set of exercises to help readers develop the necessary skills to design and carry out experiments in speech research. Offers the first step-by-step treatment of advanced techniques in experimental phonetics using speech corpora and downloadable software, including the R programming language Introduces methods of analyzing phonetically-labelled speech corpora, with the goal of testing hypotheses that often arise in experimental phonetics and laboratory phonology Incorporates an extensive set of exercises and answers to reinforce the techniques introduced Accessibly written with easy-to-follow computer commands and spectrograms of speech Companion website at www.wiley.com/go/harrington, which includes illustrations, video tutorials, appendices, and downloadable speech corpora for testing purposes. Discusses techniques in digital speech processing and in structuring and querying annotations from speech corpora Includes substantial coverage of analysis, including measuring gestural synchronization using EMA, the acoustics of vowels, consonant overlap using EPG, spectral analysis of fricatives and obstruents, and the probabilistic classification of acoustic speech data
TL;DR: A system that can separate and recognize the simultaneous speech of two people recorded in a single channel is presented and how belief-propagation reduces the complexity of temporal inference from exponential to linear in the number of sources and the size of the language model is shown.
TL;DR: The results indicate that modulation frame durations, provide a good compromise between different types of spectral distortions, namely musical noise and temporal slurring, and given a proper selection of modulation frame duration, the proposed modulation spectral subtraction does not suffer from musical noise artifacts typically associated with acoustic spectral subtracted.
TL;DR: The new approach of phonetic feature bundling for modeling coarticulation in EMG-based speech recognition is described and results on theEMG-PIT corpus, a multiple speaker large vocabulary database of silent and audible EMG speech recordings, which was recently collected are reported.
TL;DR: This work presents a new audio-visual corpus for possibly the two most important modalities used by humans to communicate their emotional states, namely speech and facial expression in the form of dense dynamic 3-D face geometries.
Abstract: Communication between humans deeply relies on the capability of expressing and recognizing feelings. For this reason, research on human-machine interaction needs to focus on the recognition and simulation of emotional states, prerequisite of which is the collection of affective corpora. Currently available datasets still represent a bottleneck for the difficulties arising during the acquisition and labeling of affective data. In this work, we present a new audio-visual corpus for possibly the two most important modalities used by humans to communicate their emotional states, namely speech and facial expression in the form of dense dynamic 3-D face geometries. We acquire high-quality data by working in a controlled environment and resort to video clips to induce affective states. The annotation of the speech signal includes: transcription of the corpus text into the phonological representation, accurate phone segmentation, fundamental frequency extraction, and signal intensity estimation of the speech signals. We employ a real-time 3-D scanner to acquire dense dynamic facial geometries and track the faces throughout the sequences, achieving full spatial and temporal correspondences. The corpus is a valuable tool for applications like affective visual speech synthesis or view-independent facial expression recognition.
TL;DR: The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction, and shows that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only.
Abstract: The minimum variance distortionless response (MVDR) beamformer, also known as Capon's beamformer, is widely studied in the area of speech enhancement. The MVDR beamformer can be used for both speech dereverberation and noise reduction. This paper provides new insights into the MVDR beamformer. Specifically, the local and global behavior of the MVDR beamformer is analyzed and novel forms of the MVDR filter are derived and discussed. In earlier works it was observed that there is a tradeoff between the amount of speech dereverberation and noise reduction when the MVDR beamformer is used. Here, the tradeoff between speech dereverberation and noise reduction is analyzed thoroughly. The local and global behavior, as well as the tradeoff, is analyzed for different noise fields such as, for example, a mixture of coherent and non-coherent noise fields, entirely non-coherent noise fields and diffuse noise fields. It is shown that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only. The amount of noise reduction that is sacrificed when complete dereverberation is required depends on the direct-to-reverberation ratio of the acoustic impulse response between the source and the reference microphone. The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction. When desiring both speech dereverberation and noise reduction, the results also demonstrate that the amount of noise reduction that is sacrificed decreases when the number of microphones increases.
TL;DR: The proposed approach was tested on a publicly available database consisting of EEG signals corresponding to Visual Evoked Potentials to test the applicability of the proposed method on a larger number of subjects, and it was able to classify 120 subjects with 98.96% accuracy.
Abstract: We investigate the potential of using electrical brainwave signals during imagined speech to identify which subject the signals originated from. Electroencephalogram (EEG) signals were recorded at the University of California, Irvine (UCI) from 6 volunteer subjects imagining speaking one of two syllables, /ba/ and /ku/, at different rhythms without performing any overt actions. In this work, we assess the degree of subject-to-subject variation and the feasibility of using imagined speech for subject identification. The EEG data are first preprocessed to reduce the effects of artifacts and noise, and autoregressive (AR) coefficients are extracted from each electrode's signal and concatenated for subject identification using a linear SVM classifier. The subjects were identifiable to a 99.76% accuracy, which indicates a clear potential for using imagined speech EEG data for biometrie identification due to its strong inter-subject variation. Furthermore, the subject identification appears to be tolerant to differing conditions such as different imagined syllables and rhythms (as it is expected that the subjects will not imagine speaking the syllables at exactly the same rhythms from trial to trial). The proposed approach was also tested on a publicly available database consisting of EEG signals corresponding to Visual Evoked Potentials (VEPs) to test the applicability of the proposed method on a larger number of subjects, and it was able to classify 120 subjects with 98.96% accuracy.
TL;DR: Analysis of the characteristics of features derived from prosody, spectral envelope, and voice quality as well as their capability to discriminate emotions suggest that spectral envelope features outperform the prosodic ones.
Abstract: The definition of parameters is a crucial step in the development of a system for identifying emotions in speech. Although there is no agreement on which are the best features for this task, it is generally accepted that prosody carries most of the emotional information. Most works in the field use some kind of prosodic features, often in combination with spectral and voice quality parametrizations. Nevertheless, no systematic study has been done comparing these features. This paper presents the analysis of the characteristics of features derived from prosody, spectral envelope, and voice quality as well as their capability to discriminate emotions. In addition, early fusion and late fusion techniques for combining different information sources are evaluated. The results of this analysis are validated with experimental automatic emotion identification tests. Results suggest that spectral envelope features outperform the prosodic ones. Even when different parametrizations are combined, the late fusion of long-term spectral statistics with short-term spectral envelope parameters provides an accuracy comparable to that obtained when all parametrizations are combined.
TL;DR: Two extensions of the binaural SDW-MWF are proposed to improve the binural cue preservation and are able to preserve bINAural cues for the speech and noise sources, while still achieving significant noise reduction performance.
Abstract: Binaural hearing aids use microphone signals from both left and right hearing aid to generate an output signal for each ear. The microphone signals can be processed by a procedure based on speech distortion weighted multichannel Wiener filtering (SDW-MWF) to achieve significant noise reduction in a speech + noise scenario. In binaural procedures, it is also desirable to preserve binaural cues, in particular the interaural time difference (ITD) and interaural level difference (ILD), which are used to localize sounds. It has been shown in previous work that the binaural SDW-MWF procedure only preserves these binaural cues for the desired speech source, but distorts the noise binaural cues. Two extensions of the binaural SDW-MWF have therefore been proposed to improve the binaural cue preservation, namely the MWF with partial noise estimation (MWF-eta) and MWF with interaural transfer function extension (MWF-ITF). In this paper, the binaural cue preservation of these extensions is analyzed theoretically and tested based on objective performance measures. Both extensions are able to preserve binaural cues for the speech and noise sources, while still achieving significant noise reduction performance.
TL;DR: It is shown that both the traditional confidence-based active learning and semi-supervised learning approaches can be improved by maximizing the lattice entropy reduction over the whole dataset.
TL;DR: In this article, the authors present systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the received speech.
Abstract: Disclosed herein are systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech. The disclosure includes recognizing received speech with a collection of domain-specific speech recognizers, determining a speech recognition confidence for each of the speech recognition outputs, selecting speech recognition candidates based on a respective speech recognition confidence for each speech recognition output, and combining selected speech recognition candidates to generate text based on the combination.
TL;DR: A new corpus designed for noise-robust speech processing research, CHiME, which includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment.
Abstract: We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results. Index Terms: Data collection, Binaural, Spatialisation
TL;DR: Although the bimodal cochlear implant group performed better than the bilateral group on most parts of the four pitch-related tests, the differences were not statistically significant and the lack of correlation between test results shows that the tasks used are not simply providing a measure of pitch ability.
Abstract: AB Objectives: Despite excellent performance in speech recognition in quiet, most cochlear implant users have great difficulty with speech recognition in noise, music perception, identifying tone of voice, and discriminating different talkers. This may be partly due to the pitch coding in cochlear implant speech processing. Most current speech processing strategies use only the envelope information; the temporal fine structure is discarded. One way to improve electric pitch perception is to use residual acoustic hearing via a hearing aid on the nonimplanted ear (bimodal hearing). This study aimed to test the hypothesis that bimodal users would perform better than bilateral cochlear implant users on tasks requiring good pitch perception. Design: Four pitch-related tasks were used. 1. Hearing in Noise Test (HINT) sentences spoken by a male talker with a competing female, male, or child talker. 2. Montreal Battery of Evaluation of Amusia. This is a music test with six subtests examining pitch, rhythm and timing perception, and musical memory. 3. Aprosodia Battery. This has five subtests evaluating aspects of affective prosody and recognition of sarcasm. 4. Talker identification using vowels spoken by 10 different talkers (three men, three women, two boys, and two girls). Bilateral cochlear implant users were chosen as the comparison group. Thirteen bimodal and 13 bilateral adult cochlear implant users were recruited; all had good speech perception in quiet. Results: There were no significant differences between the mean scores of the bimodal and bilateral groups on any of the tests, although the bimodal group did perform better than the bilateral group on almost all tests. Performance on the different pitch-related tasks was not correlated, meaning that if a subject performed one task well they would not necessarily perform well on another. The correlation between the bimodal users' hearing threshold levels in the aided ear and their performance on these tasks was weak. Conclusions: Although the bimodal cochlear implant group performed better than the bilateral group on most parts of the four pitch-related tests, the differences were not statistically significant. The lack of correlation between test results shows that the tasks used are not simply providing a measure of pitch ability. Even if the bimodal users have better pitch perception, the real-world tasks used are reflecting more diverse skills than pitch. This research adds to the existing speech perception, language, and localization studies that show no significant difference between bimodal and bilateral cochlear implant users.
TL;DR: In this paper, it is claimed that experimental evidence about human speech processing and the richness of memory for linguistic material supports a distributed view of language where every speaker creates an idiosyncratic perspective on the linguistic conventions of the community, and that people actually employ high-dimensional, spectro-temporal, auditory patterns to support speech production, speech perception and linguistic memory in real time.
TL;DR: This correspondence establishes a new expression for speech presence probability when an array of microphones with an arbitrary geometry is used and proposes a new proposed multichannel approach that can significantly increase the detection accuracy.
Abstract: The knowledge of the target speech presence probability in a mixture of signals captured by a speech communication system is of paramount importance in several applications including reliable noise reduction algorithms. In this correspondence, we establish a new expression for speech presence probability when an array of microphones with an arbitrary geometry is used. Our study is based on the assumption of the Gaussian statistical model for all signals and involves the noise and noisy data statistics only. In comparison with the single-channel case, the new proposed multichannel approach can significantly increase the detection accuracy. In particular, when the additive noise is spatially coherent, perfect speech presence detection is theoretically possible, while when the noise is spatially white, a coherent summation of speech components is performed to allow for enhanced speech presence probability estimation.
TL;DR: The results suggest that the neural underpinnings of pitch processing expertise exercise a strong influence on propositional speech perception (sentence meaning) by systematically varying the lexical and/or prosodic information of speech stimuli.
Abstract: Absolute pitch (AP) has been shown to be associated with morphological changes and neurophysiological adaptations in the planum temporale, a cortical area involved in higher-order auditory and speech perception processes. The direct link between speech processing and AP has hitherto not been addressed. We provide first evidence that AP compared with relative pitch (RP) ability is associated with significantly different hemodynamic responses to complex speech sounds. By systematically varying the lexical and/or prosodic information of speech stimuli, we demonstrated consistent activation differences in AP musicians compared with RP musicians and nonmusicians. These differences relate to stronger activations in the posterior part of the middle temporal gyrus and weaker activations in the anterior mid-part of the superior temporal gyrus. Furthermore, this pattern is considerably modulated by the auditory acuity of AP. Our results suggest that the neural underpinnings of pitch processing expertise exercise a strong influence on propositional speech perception (sentence meaning).
TL;DR: This paper introduces a novel non-parametric, exemplar-based method for reconstructing clean speech from noisy observations, based on techniques from the field of Compressive Sensing, which can impute missing features using larger time windows such as entire words.
Abstract: An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing), and to replace (impute) the missing ones by clean speech estimates. Conventional imputation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low signal-to-noise ratios (SNRs), these techniques fail, because too many time frames may contain few, if any, reliable features. In this paper, we introduce a novel non-parametric, exemplar-based method for reconstructing clean speech from noisy observations, based on techniques from the field of Compressive Sensing. The method, dubbed sparse imputation, can impute missing features using larger time windows such as entire words. Using an overcomplete dictionary of clean speech exemplars, the method finds the sparsest combination of exemplars that jointly approximate the reliable features of a noisy utterance. That linear combination of clean speech exemplars is used to replace the missing features. Recognition experiments on noisy isolated digits show that sparse imputation outperforms conventional imputation techniques at SNR = -5 dB when using an ideal `oracle' mask. With error-prone estimated masks sparse imputation performs slightly worse than the best conventional technique.
TL;DR: A speech data retrieving and presenting device applied with an electronic device through a network includes a data receiving unit, a processing unit and a speech presenting unit as discussed by the authors, which can assist a user to obtain network information, and provide the user a more flexible application according to the property that the device can be operated independently by a simple motion.
Abstract: A speech data retrieving and presenting device applied with an electronic device through a network includes a data receiving unit, a processing unit and a speech presenting unit. The data receiving unit connected to the network receives data of the electronic device through the network. The processing unit coupled to the data receiving unit receives speech data and retrieves a speech presenting signal from the speech data. The speech presenting unit coupled to the processing unit receives the speech presenting signal and outputs a speech according to the speech data. This device can assist a user to obtain network information, and provide the user a more flexible application according to the property that the device can be operated independently by a simple motion.
TL;DR: A novel set of features based on cepstrum analysis of pitch and intensity contours is introduced and the effects of different contexts on two different databases are systematically analyzed.
Abstract: Automated analysis of human affective behavior has attracted increasing attention in recent years. With the research shift toward spontaneous behavior, many challenges have come to surface ranging from database collection strategies to the use of new feature sets (e.g., lexical cues apart from prosodic features). Use of contextual information, however, is rarely addressed in the field of affect expression recognition, yet it is evident that affect recognition by human is largely influenced by the context information. Our contribution in this paper is threefold. First, we introduce a novel set of features based on cepstrum analysis of pitch and intensity contours. We evaluate the usefulness of these features on two different databases: Berlin Database of emotional speech (EMO-DB) and locally collected audiovisual database in car settings (CVRRCar-AVDB). The overall recognition accuracy achieved for seven emotions in the EMO-DB database is over 84% and over 87% for three emotion classes in CVRRCar-AVDB. This is based on tenfold stratified cross validation. Second, we introduce the collection of a new audiovisual database in an automobile setting (CVRRCar-AVDB). In this current study, we only use the audio channel of the database. Third, we systematically analyze the effects of different contexts on two different databases. We present context analysis of subject and text based on speaker/text-dependent/-independent analysis on EMO-DB. Furthermore, we perform context analysis based on gender information on EMO-DB and CVRRCar-AVDB. The results based on these analyses are promising.