TL;DR: This paper presents principal characteristics of speech speech production models speech analysis and analysis-synthesis systems linear predictive coding (LPC) analysis speech coding speech synthesis speech recognition future directions of speech processing.
Abstract: Principal characteristics of speech speech production models speech analysis and analysis-synthesis systems linear predictive coding (LPC) analysis speech coding speech synthesis speech recognition future directions of speech processing. Appendices: convolution and z-transform vector quantization algorithm neural nests.
TL;DR: It is demonstrated that neural networks are able to extract speech information from the visual images and that this information can be used to improve automatic vowel recognition.
Abstract: Results from a series of experiments that use neural networks to process the visual speech signals of a male talker are presented. In these preliminary experiments, the results are limited to static images of vowels. It is demonstrated that these networks are able to extract speech information from the visual images and that this information can be used to improve automatic vowel recognition. The structure of speech and its corresponding acoustic and visual signals are reviewed. The specific data that was used in the experiments along with the network architectures and algorithms are described. The results of integrating the visual and auditory signals for vowel recognition in the presence of acoustic noise are presented. >
TL;DR: In this article, a description of the voice activity detector (VAD) standardized by CEPT for use in the Pan-European digital cellular mobile telephone service is given, and performance tests carried out to validate the design are described.
Abstract: A description is given of the voice activity detector (VAD) standardized by CEPT for use in the Pan-European digital cellular mobile telephone service The speech-coding algorithm chosen is a 13-kb/s speech coder, using a technique in which speech is produced at the decoder by passing a substitute for the residual through long-term and short-term predictor filters The difficulties of detecting speech in a noisy environment are discussed, and the performance tests carried out to validate the design are described The tests show that clipping levels are very low but that low levels of speech activity are recorded in conversations The VAD has low complexity (because it uses the results of analysis performed in the speech coder) and is failsafe in difficult conditions >
TL;DR: A method of inputting Chinese characters into a computer directly from Mandarin speech which recognizes a series of monosyllables by separately recognizing syllables and Mandarin tones and assembling the recognized parts to recognize the mono-syllable using Hidden Markov Models.
Abstract: A method of inputting Chinese characters into a computer directly from Mandarin speech which recognizes a series of monosyllables by separately recognizing syllables and Mandarin tones and assembling the recognized parts to recognize the mono-syllable using Hidden Markov Models. The recognized mono-syllable is used by a Markov Chinese Language Model in a Linguistic decoder section to determine the corresponding Chinese character A Mandarin dictation machine which uses the above method, using a speech input device to receive the Mandarin speech and digitizing it so a personal computer can further process that information. A pitch frequency detector, a Voice signal pre-processing unit, a Hidden Markov Model processor, and a training facility are all attached to the personal computer to perform their associated functions of the method above.
TL;DR: The authors use the diagnostic acceptability measure (DAM) to evaluate speech quality of the latest 2400-b/s linear-predictive coder (LPC) with a noise suppressor at the front end and used a spectral subtraction technique for noise suppression.
Abstract: Numerous noise-suppression techniques have been developed for operating at the front end of low-bit-rate digital voice terminals. Some of these techniques have been evaluated by standardized intelligibility tests such as the diagnostic rhyme test (DRT). It is well known that the use of a noise suppressor seldom improves the DRT score even though listeners have had the impression that speech quality was enhanced. Unfortunately, noise suppressors have only occasionally been evaluated by standardized quality tests. The authors supplement quality test data for reference purposes. They use the diagnostic acceptability measure (DAM) to evaluate speech quality of the latest 2400-b/s linear-predictive coder (LPC) with a noise suppressor at the front end. They used a spectral subtraction technique for noise suppression. Ten different sets of noisy speech recorded at actual military platforms (such as a helicopter, tank, turboprop, helicopter carrier, or jeep) were input sources. The magnitude of the DAM improvement is substantial: as much as six points on the average, which is large enough to upgrade speech quality somewhat. >
TL;DR: A speech analysis and synthesis system operates to determine a sound source signal for the entire interval of each speech unit which is to be used for speech synthesis, according to a spectrum parameter obtained from each speech units based on cepstrum as discussed by the authors.
Abstract: A speech analysis and synthesis system operates to determine a sound source signal for the entire interval of each speech unit which is to be used for speech synthesis, according to a spectrum parameter obtained from each speech unit based on cepstrum. The sound source signal and the spectrum parameter are stored for each speech unit. Speech is synthesized according to the spectrum parameter while controlling prosody of the sound source signal. The spectrum of the synthesized speech is compensated through filtering based on cepstrum.
TL;DR: A real-time speech processing development system has a control subsystem (CS) and a recognition subsystem (RS) interconnected by a CS/RS interface and an embodiment of a speaker verification system includes template enrollment, template training, recognition by template-concatenation and time alignment, silence and filler template generation, and speaker monitoring modes.
Abstract: A real-time speech processing development system has a control subsystem (CS) and a recognition subsystem (RS) interconnected by a CS/RS interface. The control subsystem includes a control processor, an operator interface, a user interface, and a control program module for loading any one of a plurality of control programs which employ speech recognition processes. The recognition system RS includes a master processor, speech signal processor, and template matching processors all interconnected on a common bus which communicates with the control subsystem through the mediation of the CS/RS interface. The two-part configuration allows the control subsystem to be accessed by the operator for non-real-time system functions, and the recognition subsystem to be accessed by the user for real-time speech processing functions. An embodiment of a speaker verification system includes template enrollment, template training, recognition by template-concatenation and time alignment, silence and filler template generation, and speaker monitoring modes.
TL;DR: A phonetically sensitive transformation of speech features has yielded significant improvement in speech-recognition performance and is designed to discriminate against out-of-class confusion data and is a function of phonetic state.
Abstract: A phonetically sensitive transformation of speech features has yielded significant improvement in speech-recognition performance. This (linear) transformation of the speech feature vector is designed to discriminate against out-of-class confusion data and is a function of phonetic state. Evaluation of the technique on the TI/NBS connected digit database demonstrates word (sentence) error rates of 0.5% (1.5%) for unknown-length strings and 0.2% (0.6%) for known-length strings. These error rates are two to three times lower than the best previously reported results and suggest that significant improvements in speech-recognition system performance can be achieved by better acoustic-phonetic modeling. >
TL;DR: The authors describe those parts of the system dealing with acoustic segmentation and phonetic classification and document its current performance.
Abstract: Recently, the authors initiated a project to develop a phonetically-based spoken-language-understanding system called SUMMIT. In contrast to many of the past efforts that make use of heuristic rules whose development requires intense knowledge engineering, their approach attempts to express the speech knowledge within a formal framework using well-defined mathematical tools. In the authors' system, features and decision strategies are discovered and trained automatically, using a large body of speech data. The authors describe those parts of the system dealing with acoustic segmentation and phonetic classification and document its current performance. >
TL;DR: A class of very general hidden Markov models which can accommodate sequences of information-bearing acoustic feature vectors lying either in a discrete or in a continuous space are considered.
Abstract: The acoustic modeling problem in automatic speech recognition is estimated with the specific goal of unifying discrete and continuous parameter approaches. The authors consider a class of very general hidden Markov models which can accommodate sequences of information-bearing acoustic feature vectors lying either in a discrete or in a continuous space. More generally, the new class allows one to represent the prototypes in an assumption-limited, yet convenient, way, as (tied) mixtures of simple multivariate densities. Speech recognition experiments, reported for a large (5000-word) vocabulary office correspondence task, demonstrate some of the benefits associated with this technique. >
TL;DR: The functional utilisation of body movement is locally optional as mentioned in this paper, i.e., body movements have specific speech productive functions, primarily the facilitation of lexical selection and the regulation of prosodic features.
Abstract: Speech is normally accompanied by numerous body movements such as hand gestures, head nods, posture changes, etc. These are known to have communicative and regulatory functions such as clarifying or emphasising messages, regulating speaking turns, etc. In addition and in parallel to these, it is argued, body movements have specific speech productive functions, primarily the facilitation of lexical selection and the regulation of prosodic features. Movements serving the two functions differ in many ways, e.g. in their kinematic properties, complexity, timing in relation to speech, impairment in aphasia, mode of encoding and the stages of speech processing in which they originate. These differences are emergent, rather than prescriptive or rule-governed, originating in cognitive and motor constraints. The functional utilisation of body movement is locally optional.
TL;DR: In this article, the body of a message is formed either of such code signals generated when select speech is spoken and computer an-alyzed or when the results of such analysis, select code signals, are employed to query a memory in which a plurality of messages are stored, to selectively reproduce a message or messages therefrom to comprise the information or message desired to be transmitted.
Abstract: An electronic system and method for communicating and composing messages by means of speech spoken into a microphone (11). Speech signals output by a microphone when select words of speech are spoken therein, are computer processed and analyzed by a computer (19) to generate select code signals. The body of a message is formed either of such code signals generated when select speech is spoken and computer analyzed or when the results of such analysis, select code signals, are employed to query a memory in which a plurality of messages are stored, to selectively reproduce a message or messages therefrom to comprise the information or message desired to be transmitted. The identity of the sender or message composer and one or more recipients of the message or messages so formed, are functions also effected by the computer analysis of speech signals generated when select words of speech are spoken into the microphone. Routing instructions or switching codes for the message are also generated as a result of speaking select speech into the microphone, either by computer analysis of each word or group of words spoken to generate codes defining the switching codes or by employing codes to generated to query a memory to generate the switching signals and incorporate same in a series of code signals defining the message to be transmitted, the identity of the sender and recipient(s) and such switching code signals. In a modified form of the invention, all or part of the body of a message is formed of digitized analog speech signals generated when words of speech defining the message are spoken into a microphone is sequence with other words of speech defining the recipient of the message and routing instructions or switching signals. Other functions relating to the message or messages so generated may also be controlled by computer analysis of speech signals defining select control commands.
TL;DR: In this article, the prestored templates of noise-free speech are modified to have the estimated spectral values of noise and the same signal-to-noise ratio as the incoming signal.
Abstract: To improve the recognition of incoming speech signals in noise, the prestored templates of noise-free speech are modified to have the estimated spectral values of noise and the same signal-to-noise ratio as the incoming signal.
TL;DR: A neural network model incorporating radial basis functions is used in a speech-pattern classification problem and is compared with a back-propagation neural network models and with a vector-quantised hidden Markov model.
Abstract: A neural network model incorporating radial basis functions is used in a speech-pattern classification problem. The method is compared with a back-propagation neural network model and with a vector-quantised hidden Markov model of the same problem. Training times are over an order of magnitude faster, with similar classification results.
TL;DR: A series of algorithms for silent and voiced/unvoiced/mixed excitation interval classification, pitch detection, formant estimation and formant tracking was developed, which can surpass the performance of single-channel (acoustic-signal-based) algorithms.
Abstract: The authors describe analysis and synthesis methods for improving the quality of speech produced by D.H. Klatt's (J. Acoust. Soc. Am., vol.67, p.971-95, 1980) software formant synthesizer. Synthetic speech generated using an excitation waveform resembling the glotal volume-velocity was found to be perceptually preferred over speech synthesized using other types of excitation. In addition, listeners ranked speech tokens synthesized with an excitation waveform that simulated the effects of source-tract interaction higher in neutralness than tokens synthesized without such interaction. A series of algorithms for silent and voiced/unvoiced/mixed excitation interval classification, pitch detection, formant estimation and formant tracking was developed. The algorithms can utilize two channels of input data, i.e., speech and electroglottographic signals, and can therefore surpass the performance of single-channel (acoustic-signal-based) algorithms. The formant synthesizer was used to study some aspects of the acoustic correlates of voice quality, e.g., male/female voice conversion and the simulation of breathiness, roughness, and vocal fry. >
TL;DR: A character voice communication system including high efficiency voice coding system for encoding and transmitting speech information at a high efficiency and a voice character input/output system for converting speech information into character information or receiving character information and transmittingspeech or character information are organically integrated.
Abstract: A character voice communication system including high efficiency voice coding system for encoding and transmitting speech information at a high efficiency and a voice character input/output system for converting speech information into character information or receiving character information and transmitting speech or character information are organically integrated. A speech analyzer and a speech synthesizer are shared by both the voice coding and the voice character input/output systems. Communication apparatus is also provided which allows mutual conversion between speech signals and character codes.
TL;DR: In this paper, a method and apparatus for real-time speech recognition with and without speaker dependency is presented. But the method is not suitable for speech recognition in the presence of speaker dependency.
Abstract: A method and apparatus for real time speech recognition with and without speaker dependency which includes the following steps. Converting the speech signals into a series of primitive sound spectrum parameter frames; detecting the beginning and ending of speech according to the primitive sound spectrum parameter frame, to determine the sound spectrum parameter frame series; performing non-linear time domain normalization on the sound spectrum parameter frame series using sound stimuli, to obtain speech characteristic parameter frame series with predefined lengths on the time domain; performing amplitude quantization normalization on the speech characteristic parameter frames; comparing the speech characteristic parameter frame series with the reference samples, to determine the reference sample which most closely matches the speech characteristic parameter frame series; and determining the recognition result according to the most closely matched reference sample.
TL;DR: In this article, a speech decoder for synthesizing a speech signal from a digitized speech bit stream of the type produced by processing speech with a speech encoder is described, which includes an analyzer for processing the digitized bit stream to generate an angular frequency and magnitude for each of a plurality of sinusoidal components representing the speech processed by the encoder, the analyzer generating the angular frequencies and magnitudes over a sequence of times, and a random signal generator for generating a time sequence of random phase components; a phase synthesizer for synthesized phases for at
Abstract: A speech decoder apparatus for synthesizing a speech signal from a digitized speech bit stream of the type produced by processing speech with a speech encoder. The apparatus includes an analyzer for processing the digitized speech bit stream to generate an angular frequency and magnitude for each of a plurality of sinusoidal components representing the speech processed by the speech encoder, the analyzer generating the angular frequencies and magnitudes over a sequence of times; a random signal generator for generating a time sequence of random phase components; a phase synthesizer for generating a time sequence of synthesized phases for at least some of the sinusoidal components, the synthesized phases being generated from the angular frequencies and random phase components; and a synthesizer for synthesizing speech from the time sequences of angular frequencies, magnitudes, and synthesized phases.
TL;DR: A zero-phase sinusoidal analysis-synthesis system which generates natural-sounding speech without the requirement of vocal tract phase is described, which provides a basis for improving sound quality by providing different levels of phase coherence in speech reconstruction for time-scale modification.
Abstract: It has been shown that an analysis-synthesis system based on a sinusoidal representation leads to synthetic speech that is essentially perceptually indistinguishable from the original. A change in speech quality has been observed, however, when the phase relation of the sine waves is altered. This occurs in practice when sine waves are processed for speech enhancement and for speech coding. A description is given of a zero-phase sinusoidal analysis-synthesis system which generates natural-sounding speech without the requirement of vocal tract phase. The method provides a basis for improving sound quality by providing different levels of phase coherence in speech reconstruction for time-scale modification, for a baseline system for coding, and for reducing the peak-to-RMS ratio by dispersion. >
TL;DR: A unified framework is discussed which can be used to accomplish the goal of creating effective basic models of speech and points out the relative advantages of each type of speech unit based on the results of a series of recognition experiments.
Abstract: The problem of how to select and construct a set of fundamental unit statistical models suitable for speech recognition is addressed. A unified framework is discussed which can be used to accomplish the goal of creating effective basic models of speech. The performances of three types of fundamental units, namely whole word, phoneme-like, and acoustic segment units, in a 1109-word vocabulary speech recognition task are compared. The authors point out the relative advantages of each type of speech unit based on the results of a series of recognition experiments. >
TL;DR: A sinusoidal model is presented where the nonstationary nature of speech is considered by using a time-varying frequency and amplitude for each sinusoid using a suboptimal linear estimator.
Abstract: A sinusoidal model is presented where the nonstationary nature of speech is considered by using a time-varying frequency and amplitude for each sinusoid. The proposed model generalizes other sinusoidal models while still having an analytically tractable short-time spectrum. The estimation of the parameters of the sinusoids is done in the frequency domain by a suboptimal linear estimator. The experimental results obtained with the proposed model illustrate its ability to represent nonstationary speech frames. >
TL;DR: In this article, an artificial intelligence system is used to decide upon the adjustment of a filter subsystem by distinguishing between noise and speech in the spectrum of the incoming signal of speech plus noise.
Abstract: A system is provided to reduce noise from a signal of speech that is contaminated by noise. The present system employs an artificial intelligence that is capable of deciding upon the adjustment of a filter subsystem by distinguishing between noise and speech in the spectrum of the incoming signal of speech plus noise. The system does this by testing the pattern of a power or envelope function of the frequency spectrum of the incoming signal. The system determines that the fast changing portions of that envelope denote speech whereas the residual is determined to be the frequency distribution of the noise power. This determination is done while examining either the whole spectrum, or frequency bands thereof, regardless of where the maximum of the spectrum lies. In another embodiment of the invention, a feedback loop is incorporated which provides incremental adjustments to the filter by employing a gradient search procedure to attempt to increase certain speech-like features in the system's output. The present system does not require consideration of minima of functions of the incoming signal or pauses in speech. Instead, the present system employs an artificial intelligence system to which is input the envelope pattern of the incoming signal of speech and noise. The present system then filters out of this envelope signal the rapidly changing variations of the envelope over fixed time windows.
TL;DR: A noise reduction system used for transmission and/or recognition of speech includes a speech analyzer for analyzing a noisy speech input signal thereby converting the speech signal into feature vectors such as autocorrelation coefficients, and a neural network for receiving the feature vectors of the noisy speech signal as its input.
Abstract: A noise reduction system used for transmission and/or recognition of speech includes a speech analyzer for analyzing a noisy speech input signal thereby converting the speech signal into feature vectors such as autocorrelation coefficients, and a neural network for receiving the feature vectors of the noisy speech signal as its input. The neural network extracts from a codebook an index of prototype vectors corresponding to a noise-free equivalent to the noisy speech input signal. Feature vectors of speech are read out from the codebook on the basis of the index delivered as an output from the neural network, thereby causing the speech input to be reproduced on the basis of the feature vectors of speech read out from the codebook.
TL;DR: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units that showed results comparable to those of whole-word-based systems.
Abstract: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units. Several strategies for automatic generation of an acoustic lexicon are outlined. Preliminary tests have been performed on a small vocabulary. In these tests, the proposed system showed results comparable to those of whole-word-based systems. >
TL;DR: The author introduces the methodological novelties that allowed for progress along three axes: from isolated-word recognition to continuous speech, from speaker-dependent recognition to speaker-independent, and from small vocabularies to large vocabULARies.
Abstract: An overview is given of recent advances in the domain of speech recognition. The author focuses on speech recognition, but also mentions some progress in other areas of speech processing (speaker recognition, speech synthesis, speech analysis and coding) using similar methodologies. The problems related to automatic speech processing are identified, and the initial approaches that have been followed in order to address those problems are described. The author then introduces the methodological novelties that allowed for progress along three axes: from isolated-word recognition to continuous speech, from speaker-dependent recognition to speaker-independent, and from small vocabularies to large vocabularies. Special emphasis centers on the improvements made possible by Markov models and, more recently, by connectionist models, resulting in improved performance for difficult vocabularies or in more robust systems. Some specialized hardware is described, as are efforts aimed at assessing speech-recognition systems. >
TL;DR: In this article, a directory assistance call arriving at an automatic call distributor via directory assistance trunk is first processed by a speech processing system to compress the initial request for a telephone number, then connected to an operator position to transmit the processed initial order thereto.
Abstract: Speech compression technology is utilized to reduce the average working time of an operator on directory assistance calls. In particular, a directory assistance call arriving at an automatic call distributor via a directory assistance trunk is first processed by a speech processing system to compress the initial request for a telephone number. The speech processing system is then connected to an operator position to transmit the processed initial order thereto.
TL;DR: The author examines some of the ways that speech signals, subjected to certain degradations, can be processed to increase the likelihood of being correctly understood.
Abstract: The author examines some of the ways that speech signals, subjected to certain degradations (e.g. additive noise, interfering speakers, bandlimiting, single-channel data), can be processed to increase the likelihood of being correctly understood. He concentrates on applications that involve monaural listening. He treats spectral and time-domain subtraction techniques, methods involving fundamental frequency tracking, and enhancement by resynthesis. >