TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.
TL;DR: The preface to the IEEE Edition explains the background to speech production, coding, and quality assessment and introduces the Hidden Markov Model, the Artificial Neural Network, and Speech Enhancement.
Abstract: Preface to the IEEE Edition. Preface. Acronyms and Abbreviations. SIGNAL PROCESSING BACKGROUND. Propaedeutic. SPEECH PRODUCTION AND MODELLING. Fundamentals of Speech Science. Modeling Speech Production. ANALYSIS TECHNIQUES. Short--Term Processing of Speech. Linear Prediction Analysis. Cepstral Analysis. CODING, ENHANCEMENT AND QUALITY ASSESSMENT. Speech Coding and Synthesis. Speech Enhancement. Speech Quality Assessment. RECOGNITION. The Speech Recognition Problem. Dynamic Time Warping. The Hidden Markov Model. Language Modeling. The Artificial Neural Network. Index.
TL;DR: NoISEX-92 specifies a carefully controlled experiment on artificially noisy speech data, examining performance for a limited digit recognition task but with a relatively wide range of noises and signal-to-noise ratios.
TL;DR: A tutorial on signal processing in state-of-the-art speech recognition systems is presented, reviewing those techniques most commonly used, and three important trends that have developed in the last five years in speech recognition are examined.
Abstract: A tutorial on signal processing in state-of-the-art speech recognition systems is presented, reviewing those techniques most commonly used. The four basic operations of signal modeling, i.e. spectral shaping, spectral analysis, parametric transformation, and statistical modeling, are discussed. Three important trends that have developed in the last five years in speech recognition are examined. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similarity transform techniques, often used to normalize and decorrelate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal's spectrum can be estimated in a closed-loop manner. The signal processing components of these algorithms are reviewed. >
TL;DR: The present data demonstrate that linguistic errors of different categories evoke different ERP patterns, and indicate that with using connected speech as input, different aspects of language comprehension processes cannot only be described with respect to their temporal structure, but eventually also withrespect to possible brain systems subserving these processes.
TL;DR: Speech Reference EPFL-CONF-82487 describes the “politics of language” in the developing world and some of the challenges faced by speech interpreters and interpreters in the rapidly changing environment.
Abstract: Keywords: speech Reference EPFL-CONF-82487 Record created on 2006-03-10, modified on 2017-05-10
TL;DR: The resulting WSOLA (waveform-similarity-based synchronized overlap-add) algorithm produces high-quality speech output, is algorithmically and computationally efficient and robust, and allows for online processing with arbitrary time-scaling factors.
Abstract: A concept of waveform similarity for tackling the problem of time-scale modification of speech is proposed. It is worked out in the context of short-time Fourier transform representations. The resulting WSOLA (waveform-similarity-based synchronized overlap-add) algorithm produces high-quality speech output, is algorithmically and computationally efficient and robust, and allows for online processing with arbitrary time-scaling factors that may be specified in a time-varying fashion and can be chosen over a wide continuous range of values. >
TL;DR: The overall conclusion is that age-related factors other than peripheral hearing loss contribute to diminished speech recognition performance of elderly listeners.
Abstract: This study investigated factors that contribute to deficits of elderly listeners in recognizing speech that is degraded by temporal waveform distortion. Young and elderly listeners with normal hear...
TL;DR: A new method to calculate a spectral harmonics-to-noise ratio (HNR) in speech signals is presented and involves discrimination between harmonic and noise energy in the magnitude spectrum by discriminating between them.
Abstract: A new method to calculate a spectral harmonics-to-noise ratio (HNR) in speech signals is presented. The method involves discrimination between harmonic and noise energy in the magnitude spectrum by...
TL;DR: Experiments with a recognizer trained on clean speech and test data degraded by both convolutional and additive noise show that doing RASTA processing in the new domain yields results comparable with those obtained by training the recognizer on known noise.
Abstract: RASTA (relative spectral) processing is studied in a spectral domain which is linear-like for small spectral values and logarithmic-like for large spectral values. Experiments with a recognizer trained on clean speech and test data degraded by both convolutional and additive noise show that doing RASTA processing in the new domain yields results comparable with those obtained by training the recognizer on known noise. >
TL;DR: A series of recent experiments on corpora of recorded (read) speech and spontaneous (elicited) speech suggest that it is indeed possible to model human accent strategies with fair success for unrestricted text—with only the tools for automatic text analysis currently available.
TL;DR: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems, is presented in this article.
Abstract: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems. The system includes a speech recognition unit for carrying out a speech recognition processing for a speech input made by a user to obtain a recognition result; a program management table for managing program management data indicating a speech recognition interface function required by each application program; and a message processing unit for exchanging messages with the plurality of application programs in order to specify an appropriate recognition vocabulary to be used in the speech recognition processing of the speech input to the speech recognition unit, and to transmit the recognition result for the speech input obtained by the speech recognition unit by using the appropriate recognition vocabulary to appropriate ones of the plurality of application programs, according to the program management data managed by the program management table.
TL;DR: In this article, a wideband speech signal (8 kHz) of high quantity is reconstructed from a narrowband speech signals (300 Hz to 3.4 kHz) by LPC-analyzing to obtain spectrum information parameters.
Abstract: A wideband speech signal (8 kHz, for example) of high quantity is reconstructed from a narrowband speech signal (300 Hz to 3.4 kHz). The input narrowband speech signal is LPC-analyzed to obtain spectrum information parameters, and the parameters are vector-quantized using a narrowband speech signal codebook. For each code number of the narrowband speech signal codebook, the wideband speech waveform corresponding to the codevector concerned is extracted by one pitch for voiced speech and by one frame for unvoiced speech and prestored in a representative waveform codebook. Representative waveform segments corresponding to the respective output codevector numbers of the quantizer are extracted from the representative waveform codebook. Voiced speech is synthesized by pitch-synchronous overlapping of the extracted representative waveform segments and unvoiced speech is synthesized by randomly using waveforms of one frame length. By this, a wideband speech signal is produced. Then, frequency components below 300 Hz and above 3.4 kHz are extracted from the wideband speech signal and are added to an up-sampled version of the input narrowband speech signal to thereby reconstruct the wideband speech signal.
TL;DR: The PMC technique is based on parallel model combination in which the parameters of corresponding pairs of speech and noise states are combined to yield a set of compensated parameters, which improves on earlier cepstral mean compensation methods in that it also adapts the variances and as a result can deal with much lower SNRs.
TL;DR: In this article, the authors present a method of, and apparatus for, operating an automatic message recognition system, in which the following steps are executed: a user's speech is converted to a first signal; a users handwriting is converted into a second signal; and the first signal and the second signal are processed to decode a consistent message, conveyed separately by the first signals and by the second signals.
Abstract: A method of, and apparatus for, operating an automatic message recognition system. In accordance with the method the following steps are executed: a user's speech is converted to a first signal; a user's handwriting is converted to a second signal; and the first signal and the second signal are processed to decode a consistent message, conveyed separately by the first signal and by the second signal, or conveyed jointly by the first signal and the second signal. The step of processing includes the steps of converting the first signal into a plurality of first multi-dimensional vectors and converting the second signal into a plurality of second multi-dimensional vectors. For a system employing a combined use of speech and handwriting the step of processing includes a further step of combining individual ones of the plurality of first multi-dimensional vectors and individual ones of the plurality of second multi-dimensional vectors to form a plurality of third multi-dimensional vectors. The multi-dimensional vectors are employed to train a single set of word models, for joint use of speech and handwriting, or two sets of word models, for sequentially employed or merged speech and handwriting.
TL;DR: VoiceNotes explores the problem of capturing and retrieving spontaneous ideas, the use of speech as data, and theUse of speech input and output in the user interface for a hand-held computer without a visual display.
Abstract: VoiceNotes is an application for a voice-controlled hand-held computer that allows the creation, management, and retrieval of user-authored voice notes—small segments of digitized speech containing thoughts, ideas, reminders, or things to do. Iterative design and user testing helped to refine the initial user interface design. VoiceNotes explores the problem of capturing and retrieving spontaneous ideas, the use of speech as data, and the use of speech input and output in the user interface for a hand-held computer without a visual display. In addition, VoiceNotes serves as a step toward new uses of voice technology and interfaces for future portable devices.
TL;DR: A text-to-speech synthesis system that synthesizes speech from unrestricted text is discussed, and the text analysis system, which includes text preprocessing, phrasing and intonation, and letter- to-phoneme conversion, is described.
Abstract: A text-to-speech synthesis system that synthesizes speech from unrestricted text is discussed. The text analysis system, which includes text preprocessing, phrasing and intonation, and letter-to-phoneme conversion, is described. The analyzed text is represented by phonetic characters, stress values, minor- and major-phrase markers, and intonational descriptors. The synthesizer uses this information to compute a speech signal in several stages. The duration of the different speech events is computed, and the intonational descriptors are converted to a fundamental frequency contour. Loudness control is also generated. After these prosodic parameters have been computed, the synthesis parameters that describe the different sounds or phonemes are generated. These parameters are converted to speech by a waveform synthesizer. >
TL;DR: The coding method is easily combined with existing LP-based speech coders, such as CELP, for unvoiced signals and excellent voiced speech quality is obtained at rates between 3.0 and 4.0 kb/s.
Abstract: Voiced speech is interpreted as a concentration of slowly evolving pitch-cycle waveforms. This signal can be reconstructed by interpolation from a downsampled sequence of pitch-cycle waveforms with a rate of one prototype waveform per 20-30 ms interval. The prototype waveform is described by a set of linear-prediction (LP) filter coefficients describing the formant structure and a prototype excitation waveform, quantized with analysis-by-synthesis procedures. The speech signal is reconstructed by filtering an excitation signal consisting of the concatenation of (infinitesimal) sections of the instantaneous excitation waveforms. To obtain the correct level of periodicity, the short-term and the long-term correlations between the instantaneous excitation waveforms can be controlled explicitly. Thus, distortions such as noise, reverberation, and buzziness can be prevented. The coding method is easily combined with existing LP-based speech coders, such as CELP, for unvoiced signals. Excellent voiced speech quality is obtained at rates between 3.0 and 4.0 kb/s. >
TL;DR: In drawing parallels between speech and music, the chapter focuses on two principal issues: the input provided by caregivers for their infants and the processing of such input by infant listeners.
Abstract: Publisher Summary This chapter focuses on potential similarities between speech and music from the perspective of infant listeners. The stimuli of concern are sound sequences rather than single sounds, despite the predominant research focusing on the latter class of stimuli. The exclusion of single sounds can be justified on a number of grounds. First, several comprehensive reviews of infants' ability to perceive single speech and non-speech sounds are available. Second, evidence indicates that global patterns of speech are more salient in the pre-linguistic period than are individual speech segments. In the non-speech domain, evidence also indicates that infants proceed from global processing of auditory patterns to local processing of pattern details. In drawing parallels between speech and music, the chapter focuses on two principal issues: the input provided by caregivers for their infants and the processing of such input by infant listeners. Much of the work to be reported, particularly in the musical domain, is relatively recent. As a result, the exposition is tentative rather than definitive, its purpose being to suggest new avenues for future research and thinking.
TL;DR: This is an introduction and brief overview of the techniques and algorithms used in the design of speech systems and the roles of other disciplines such as electronics, computing science, linguistics and physiology are described.
Abstract: This is an introduction and brief overview of the techniques and algorithms used in the design of speech systems. The author focuses on signal processing, and briefly describes the roles of other disciplines such as electronics, computing science, linguistics and physiology.
TL;DR: The analog signals to the right and left channels are altered according to position data stored with the text string so that the synthesized voice appears to originate at the apparent spatial position when the analog signals are sent to a speaker system.
Abstract: Method, product and system alters audio data for a synthesized voice so that when it is produced on a speaker system, it appears to emanate from a spatial position. First, the voice is synthesized into a speech waveform from a set of stored data representative of a text string using standard techniques. The speech waveform is converted into analog signals for a right and left channel. According to the invention, the analog signals to the right and left channels are altered according to position data stored with the text string so that the synthesized voice appears to originate at the apparent spatial position when the analog signals are sent to a speaker system.
TL;DR: In this paper, a speech coding system employs measurements of robust features of speech frames whose distribution is not strongly affected by noise/levels to make voicing decisions for input speech occurring in a noisy environment.
Abstract: A speech coding system employs measurements of robust features of speech frames whose distribution are not strongly affected by noise/levels to make voicing decisions for input speech occurring in a noisy environment. Linear programing analysis of the robust features and respective weights are used to determine an optimum linear combination of these features. The input speech vectors are matched to a vocabulary of codewords in order to select the corresponding, optimally matching codeword. Adaptive vector quantization is used in which a vocabulary of words obtained in a quiet environment is updated based upon a noise estimate of a noisy environment in which the input speech occurs, and the "noisy" vocabulary is then searched for the best match with an input speech vector. The corresponding clean codeword index is then selected for transmission and for synthesis at the receiver end. The results are better spectral reproduction and significant intelligibility enhancement over prior coding approaches. Robust features found to allow robust voicing decisions include: low-band energy; zero-crossing counts adapted for noise level; AMDF ratio (speech periodicity) measure; low-pass filtered backward correlation; low-pass filtered forward correlation; inverse-filtered backward correlation; and inverse-filtered pitch prediction gain measure.
TL;DR: In this method, speech samples are selectively weighted based on how well they match the speech production model, and the estimates of the LPC coefficients obtained are more accurate and less sensitive to the values of the fundamental frequency than conventional LPC.
TL;DR: A method and system for reducing perplexity in a speech recognition system based upon determined geographic location.
Abstract: A method and system for reducing perplexity in a speech recognition system based upon determined geographic location. In a mobile speech recognition system which processes input frames of speech against stored templates representing speech, a core library of speech templates is created and stored representing a basic vocabulary of speech. Multiple location-specific libraries of speech templates are also created and stored, each library containing speech templates representing a specialized vocabulary for a specific geographic location. The geographic location of the mobile speech recognition system is then periodically determined utilizing a cellular telephone system, a geopositioning satellite system or other similar systems and a particular one of the location-specific libraries of speech templates is identified for the current location of the system. Input frames of speech are then processed against the combination of the core library and the particular location-specific library to greatly enhance the accuracy and efficiency of speech recognition by the system. Each location-specific library preferably includes speech templates representative of location place names, proper names, and business establishments within a specific geographic location.
TL;DR: This decomposition provides a method of parameter simplification which appears to be useful for detecting fundamental frequencies, and characterizing formants.
Abstract: Uses an algorithm based on the adapted-window Malvar transform to decompose digitized speech signals into a local time-frequency representation. The authors present some applications and experimental results for a signal compression and automatic voiced-unvoiced segmentation. This decomposition provides a method of parameter simplification which appears to be useful for detecting fundamental frequencies, and characterizing formants. >
TL;DR: This paper addresses the question what perceptual quality can be achieved for unvoiced speech by a linear model with white noise excitation and demonstrates that this linear model results in unvoicing speech of high perceptual quality.
Abstract: Recent interest in nonlinear modeling of speech has brought up the need to re-assess the performance limitations of linear speech models. While nonlinearity is essential in the production mechanism of speech, it need not be reflected in a speech-signal model. This paper addresses the question what perceptual quality can be achieved for unvoiced speech by a linear model with white noise excitation. Formal MOS test results demonstrate that this linear model results in unvoiced speech of high perceptual quality.
TL;DR: Further improvements in speech perception for cochlear implant patients in quiet and in noise should be possible with speech processing strategies using binaural implants, for this reason, a series of initial psychophysical and speech perception studies on the authors' first bINAural co-lear implant patient is presented.
Abstract: Further improvements in speech perception for cochlear implant patients in quiet and in noise should be possible with speech processing strategies using binaural implants. For this reason, presented here is a series of initial psychophysical and speech perception studies on the authors' first binaural cochlear implant patient. For an approximate matching of the places of stimulation on the two sides, the patient usually reported a single percept when the two sides were simultaneously stimulated. Lateralization was strongly influenced by amplitude differences between the electrical stimuli on the two sides, but only weakly by interaural time delays. Speech testing, comparing monaural with binaural electrical stimulation, showed a binaural advantage particularly in noise.
TL;DR: Advanced Time-Frequency Representations for Speech Processing Auditory-Based Wavelet Representation Distortion Maps for Speech Analysis Phase Representations of Acoustic Speech Waveforms Speech Analysis Using Higher Order Statistics Group Delay Processing of Speech Signals Contributors.
Abstract: Advanced Time-Frequency Representations for Speech Processing Auditory-Based Wavelet Representation Distortion Maps for Speech Analysis Phase Representations of Acoustic Speech Waveforms Speech Analysis Using Higher Order Statistics Group Delay Processing of Speech Signals Contributors The Sheffield Signals Index.
TL;DR: In this paper, an adaptive filter such as a finite impulse response (FIR) filter receives a digital accelerometer input signal, adjusts filter coefficients according to an estimation error signal, and provides an enhanced speech signal as an output.
Abstract: A speech processing system (30) operates in a noisy environment (20) by performing adaptive prediction between inputs from two sensors positioned to transduce speech from a speaker, such as an accelerometer and a microphone. An adaptive filter (37) such as a finite impulse response (FIR) filter receives a digital accelerometer input signal, adjusts filter coefficients according to an estimation error signal, and provides an enhanced speech signal as an output. The estimation error signal is a difference between a digital microphone input signal and the enhanced speech signal. In one embodiment, the adaptive filter (37) selects a maximum one of a first predicted speech signal based on a relatively-large smoothing parameter and a second predicted speech signal based on a relatively-small smoothing parameter, with which to normalize a predicted signal power. The predicted signal power is then used to adapt the filter coefficients.