TL;DR: Experimental results are presented that indicate the power of the methods and concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding.
Abstract: Statistical methods useful in automatic recognition of continuous speech are described. They concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding. Experimental results are presented that indicate the power of the methods.
TL;DR: This paper provides a review of recent developments in speech recognition research and the concept of sources of knowledge is introduced and the use of knowledge to generate and verify hypotheses is discussed.
Abstract: This paper provides a review of recent developments in speech recognition research. The concept of sources of knowledge is introduced and the use of knowledge to generate and verify hypotheses is discussed. The difficulties that arise in the construction of different types of speech recognition systems are discussed and the structure and performance of several such systems is presented. Aspects of component subsystems at the acoustic, phonetic, syntactic, and semantic levels are presented. System organizations that are required for effective interaction and use of various component subsystems in the presence of error and ambiguity are discussed.
TL;DR: In this paper, the harmonics of the desired voice in the Fourier transform of the input were selected to distinguish between two different voices. But the authors focus on the principal subproblem, the separation of vocalic speech.
Abstract: A common type of interference in speech transmission is that caused by the speech of a competing talker. Although the brain is adept at clarifying such speech, it relies heavily on binaural data. When voices interfere over a single channel, separation is much more difficult and intelligibility suffers. Clarifying such speech is a complex and varied problem whose nature changes with the moment‐to‐moment variation in the types of sound which interfere. This paper describes an attack on the principal subproblem, the separation of vocalic speech. Separation is done by selecting the harmonics of the desired voice in the Fourier transform of the input. In implementing this process, techniques have been developed for resolving overlapping spectrum components, for determining pitches of both talkers, and for assuring consistent separation. These techniques are described, their performance on test utterances is summarized, and the possibility of using this process as a basis for the solution of the general two‐tal...
TL;DR: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum, which provides a means for controlling and reducing quantizing noise in the coding.
Abstract: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum. The approach provides a means for controlling and reducing quantizing noise in the coding. Each sub-band is quantized with an accuracy (bit allocation) based upon perceptual criteria. As a result, the quality of the coded signal is improved over that obtained from a single full-band coding of the total spectrum. In one implementation, the individual sub-bands are low-pass translated before coding. In another, "integer-band" sampling is employed to alias the signal in an advantageous way before coding. Other possibilities extend to complex demodulation of the sub-bands, and to representing the subband signals in terms of envelopes and phase-derivatives. In all techniques, adaptive quantization is used for the coding, and a parsimonious allocation of bits is made across the bands. Computer simulations are made to demonstrate the signal qualities obtained for codings at 16 and 9.6 Kbits/sec.
TL;DR: When trained to the voice of a particular speaker, the decoder recognized seven‐digit telephone numbers correctly 96% of the time, with a better than 99% per‐digit accuracy.
Abstract: Continuous speech was treated as if produced by a finite‐state machine making a transition every centisecond. The observable output from state transitions was considered to be a power spectrum—a probabilistic function of the target state of each transition. Using this model, observed sequences of power spectra from real speech were decoded as sequences of acoustic states by means of the Viterbi trellis algorithm. The finite‐state machine used as a representation of the speech source was composed of machines representing words, combined according to a “language model.” When trained to the voice of a particular speaker, the decoder recognized seven‐digit telephone numbers correctly 96% of the time, with a better than 99% per‐digit accuracy. Results for other tests of the system, including syllable and phoneme recognition, will also be given.
TL;DR: The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.
Abstract: For many applications, it is desirable to be able to convert arbitrary English text to natural and intelligible sounding speech. This transformation between two surface forms is facilitated by first obtaining the common underlying abstract linguistic representation which relates to both text and speech surface representations. Calculation of these abstract bases then permits proper selection of phonetic segments, lexical stress, juncture, and sentence-level stress and intonation. The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.
TL;DR: A new digital filter bank design is proposed for the processing of speech waveforms where spectral pattern matching techniques are applicable and a distance metric is proposedfor comparing a spectral frame with previously derived reference patterns.
Abstract: A new digital filter bank design is proposed for the processing of speech waveforms where spectral pattern matching techniques are applicable. Outputs in decibels from the 30 channels of the filter bank are computed every 12 ms. Care has been taken to select a time window and filter center frequency and bandwidth values that take into account the acoustic characteristics of speech. A distance metric is proposed for comparing a spectral frame with previously derived reference patterns. The metric incorporates procedures for crude speaker/microphone normalization, signal level normalization, background noise normalization, and procedures for emphasizing differences in the region of spectral peaks.
TL;DR: The author's project is somewhat controversial, since it attempts to model utterance production statistically, rather than through a grammar that would describe syntactically and semantically the allowable (mini-) universe of discourse.
Abstract: HIS PAPER DESCRIBES statistical methods of automatic recognition (transcription) of continuous speech that have been used successfully by the Speech Processing Group at the IBM Thomas J. Watson Research Center. The sources of these procedures will be referenced where practicable, but the working style of the Group has been deliberately cooperative (as the Acknowledgment Section indicates), so a certain amount of inadequate or udust crediting is inevitable. The author tried his best to keep it at a minimum. The exposition, appearing as it does in an IEEE publication, is aimed mostly at engineers who are less familiar with speech and language than with information transmission, statistics, or signal processing. At the same time, the author would like to enable speech specialists to read the more mathematical parts of the paper. Inevitably, a compromise between these two audiences has been attempted that resulted in a somewhat lengthened presentation. The author would like to invite his readers to skip rather boldly over material familiar to them. The fust six sections contain the essence of the formulation that is centered on the design of an actual speech recognition system. Readers of Section VI that contains experimental results may feel somewhat dissatisfied with the fact that no comparisons are attempted with performance achieved by alternate design philosophies. Unfortunately, such judgments are made difficult by the great variety in utterance corpora to be recognized, in experimental conditions, and in recognizer function goals.’ However, the accompanying survey paper by Reddy [231 does assess the merits of the various speech recognition projects, and can serve as an excellent introduction to the field for the nonspecialist. In the speech recognition community, our project is somewhat controversial, since it attempts to model utterance production statistically, rather than through a grammar that would describe syntactically and semantically the allowable (mini-) universe of discourse. It is much too early to tell which emphasis is sounder. There is little doubt that before automatic recognition of speech is accomplished, the statistical
TL;DR: The result shows that such a system might well be based on rules rather than on an extensive dictionary, and a useful tool in speech synthesis work is described (i.e. a programming language).
Abstract: When reading a text a native speaker pronounces most words correctly even if they are unknown to him. During this process he makes use of his knowledge of the language, the semantic content and the syntax. However, if we take away all information except the spelling and some pronunciation rules on the word level, the task would be more difficult. This is basically the case in our text-to-speech synthesis system containing neither semantic and syntactic analysis nor a word or morpheme dictionary. At the conference the function of our present synthesis system will be discussed. The result shows that such a system might well be based on rules rather than on an extensive dictionary. Furthermore a useful tool in speech synthesis work is described (i.e. a programming language).
TL;DR: The details of the implementation of a syntax-controlled acoustic encoder of a speech understanding system (SUS) are presented and it is shown that finite-state automata operating on artificial descriptions of suprasegmentals and global spectral features isolate syllables in continuous speech.
Abstract: The details of the implementation of a syntax-controlled acoustic encoder of a speech understanding system (SUS) are presented. Finite-state automata operating on artificial descriptions of suprasegmentals and global spectral features isolate syllables in continuous speech. Then a combinational algorithm tracks the formants for the voiced intervals of each syllable, and other algorithms provide a complete structural description of spectral and prosodic features for a spoken sentence. Such a description consists of a string of symbols and numerical attributes and is a representation of speech in terms of perceptually significant primitive forms. It contains all the information required to reconstruct the analyzed sentence with a formant synthesizer; it can be used directly either for emitting or verifying hypotheses at the lexical level of an SUS and for automatically learning phonetic features by grammatical inference.
TL;DR: Of those evaluated, a linearly mean-corrected minimum distance measure, on a 40-point spectral representation with a square (or cube) norm was consistently superior to the other methods.
Abstract: An important consideration in speech processing involves classification of speech spectra. Several methods for performing this classification are discussed. A number of these were selected for comparative evaluation. Two measures of performance-accuracy and stability-were derived through the use of an automatic performance evaluation system. Over 3000 hand-labeled spectra were used. Of those evaluated, a linearly mean-corrected minimum distance measure, on a 40-point spectral representation with a square (or cube) norm was consistently superior to the other methods.
TL;DR: In this article, a correct proof of Huang's theorem on the stability of two-dimensional causal recursive digital filters is developed using a maximum modulus theorem for algebraic functions, which is used in this paper.
Abstract: A correct proof of Huang's theorem on the stability of two-dimensional causal recursive digital filters is developed using a maximum modulus theorem for algebraic functions.
TL;DR: The general language-operated decision implementation system (GLODIS) represents a flexible, operating-system approach to the generation and implementation of complex rules for decision making in pattern recognition.
Abstract: The general language-operated decision implementation system (GLODIS) represents a flexible, operating-system approach to the generation and implementation of complex rules for decision making in pattern recognition. GLODIS is briefly described from a general mathematical and philosophical point of view; a current implementation is described in the context of a phonemic-level segmenter for continuous speech. This segmenter is presented in sufficient detail for duplication by others, not only for speech segmentation but also for alternate applications of a similar nature. Performance data are given for a large amount ( 8\frac{1}{2} min) of continuous speech, for a currently running version of the speech segmentation GLODIS. Recent results from a total continuous speech recognition system, which incorporates the above, are also given.
TL;DR: For automatic recognition, spectral models that contain zeros are found to be particularly effective, and their parameters are shown to be sufficient for the complete separation of /s/- and / \int /- samples in CV and VCV utterances.
Abstract: A model is described for the spectral characteristics of voiceless fricative consonants of Japanese, based on an equivalent circuit representation of their generation mechanism. The model, together with its three simplified versions, are then evaluated from the point of view of automatic recognition as well as of synthesis of speech. For automatic recognition, spectral models that contain zeros are found to be particularly effective, and their parameters are shown to be sufficient for the complete separation of /s/- and / \int /- samples in CV and VCV utterances. On the other hand, perceptual experiments using synthetic stimuli reveal considerably smaller differences between models with spectral zeros and those without zeros.
TL;DR: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal.
Abstract: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal. A period of short-time autocorrelation function is sampled and spliced after change of the time scale. Transformed speech is quite natural and free from distortion. Applications of SPAC are expected in many fields such as improvement of speech quality, narrow band transmission, communication aid for hard of hearing, information service for blind, unscrambling of helium speech, stenography and so on.
TL;DR: How a speech synthesizer can be controlled by a small computer in real time and the properties of the synthesizer and the control program are described along with an example of the speech synthesis.
Abstract: This paper describes how a speech synthesizer can be controlled by a small computer in real time. The synthesizer allows precise control of the speech output that is necessary for experimental purposes. The control information is computed in real time during synthesis in order to reduce data storage. The properties of the synthesizer and the control program are prsented along with an example of the speech synthesis.
TL;DR: An apparatus for recognizing the occurrence of a command word within continuous speech, features an improved sequential processing of feature signals derived from the input speech: feature subsets are compared with previously stored subset signals to determine the time interval or boundary of command word candidates.
Abstract: An apparatus for recognizing the occurrence of a command word within continuous speech, features an improved sequential processing of feature signals derived from the input speech: feature subsets are compared with previously stored subset signals to determine the time interval or boundary of command word candidates. The occurrence decision and indication of the command word is made from a comparison of a feature signal matrix versus a previously stored training matrix.
TL;DR: An algorithm to estimate formant frequencies and amplitudes has been developed to provide control signals for a parallel formant speech synthesizer and depends on both spectral match and speech-like constraints.
Abstract: An algorithm to estimate formant frequencies and amplitudes has been developed to provide control signals for a parallel formant speech synthesizer. During voiced sounds power spectra are derived by Fourier analysis of speech samples in the closed glottis regions but for unvoiced sounds spectra are derived with appropriate time averaging. Each power spectrum is matched trying several different allocations of formants to spectral peaks. An analysis-by-synthesis procedure iteratively updates the formant frequencies and amplitudes for each allocation. The final choice from these formant parameters is made every 10 ms, and depends on both spectral match and speech-like constraints.
TL;DR: Using statistical decision theory, various types of tests for speaker verification and identification using only one phoneme segment or the entire utterance are developed.
Abstract: We are interested in determining whether the given utterance comes from a member of a given speaker group or an imposter If it is the former, we are interested in determining the identity of the speaker The only knowledge available is a set of known utterances from the given group of speakers The given utterance is manually divided into phonemes without necessarily ascertaining the identity of phonemes Using statistical decision theory, we will develop various types of tests for speaker verification and identification using only one phoneme segment or the entire utterance We will consider related problems such as the methods of clustering speakers to aid speaker verification, the optimal choice of phonemes for speaker recognition Next we consider the role of speaker variability in speech recognition and recognize its complementarity to the problem of optimal choice of phonemes for speaker recognition We illustrate the efficacy of the various methods developed here by considering the speaker and speech identification problems with three speech data bases
TL;DR: In this article, a speech recognition subsystem determines the presence of speech-like sounds and generates a speech-indicative signal as a function thereof, which is compared with the predetermined vocabulary to determine the probable occurrence of a word from among the predetermined speech vocabulary, and an occurrence indication signal is generated when a vocabulary word is detected.
Abstract: An apparatus for signal-in-noise enhancement by useful-channel selection, includes automatic channel switching if words of a predetermined speech vocabulary are not detected within a predetermined interval. In accordance with the invention there are provided first and second parallel channels to which audio signals are applied. A channel selector, such as a voltage controlled switch, is responsive to a control signal for selecting the first or the second channel. A speech word recognition subsystem is provided and is responsive to the audio signals transmitted over the selected channel. The speech recognition subsystem determines the presence of speech-like sounds and generates a speech-indicative signal as a function thereof. Portions of the audio signals which occur during the speech-indicative signal are compared with the predetermined vocabulary to determine the probable occurrence of a word from among the predetermined vocabulary, and an occurrence indication signal is generated when a vocabulary word is detected. A control signal is generated in response to the output of the speech word recognition subsystem, the control signal being a function of the speech-indicative signal and the occurrence indication signal. In operation, the control signal is operative, in response to the presence of extraneous signals on a channel being utilized, to effect switching to the other channel. In the preferred embodiment, the speech recognition subsystem generates reject signals when the speech-indicative signal persists for a prescribed time without an occurrence indication occurring, and the control signal is generated in response to the reject signals.
TL;DR: In this article, a system and method for detecting the presence of useful speech information in telephone voice channels capable of containing noise as well as such useful information for optimizing the telephone transmission of such speech information is presented.
Abstract: A system and method for detecting the presence of useful speech information in telephone voice channels capable of containing noise as well as such useful speech information for optimizing the telephone transmission of such speech information. Two segments of the envelope of a given voice channel are compared against each other over two different time domains in order to determine if a predetermined magnitude of difference exists between these envelopes. The presence of such magnitude of difference is indicative of the presence of such useful speech information in the voice channel thereby enabling transmission thereof by the system, whereas the absence of such magnitude of difference is indicative of the presence of solely noise thereby preventing the transmission thereof by the system.
TL;DR: Preliminary evaluations suggest that the instrument has considerable potential for training speech production with deaf, and as an aid to the therapist in diagnosis and communication of concepts.
Abstract: A real-time speech spectrograph has been developed which is practical for clinical use. It produces and stores a frequency-time-intensity display on a video monitor while the sound is being spoken. The display closely resembles a conventional, broad-band spectrogram in time, frequency and grey scale resolution. Preliminary evaluations have been made to show its usefulness 1. as an aid to the therapist in diagnosis and communication of concepts, and 2. for student drill relatively independent from the therapist. These results suggest that the instrument has considerable potential for training speech production with deaf.
TL;DR: Relatively little effort has been expended toward designing low data rate speech processing devices which can operate in difficult environments, but problems addressed include that of good beahvior for a wide variety of speakers.
Abstract: : Relatively little effort has been expended toward designing low data rate speech processing devices which can operate in difficult environments. The particular problems addressed include that of good beahvior for a wide variety of speakers, with tandeming and conferencing configurations, in the presence of jamming and/or background noise and with telephone speech as input. (Author)
TL;DR: This paper describes a connected speech understanding system being implemented in Nancy made up of an acoustic recognizer which gives a string of phoneme-like segments from a spoken sentence, a syntactic parser which controls the recognition process, a word recognizer working on words predicted by the parser and a dialog procedure which takes in account semantic constraints in order to avoid some of the errors and ambiguities.
Abstract: This paper describes a connected speech understanding system being implemented in Nancy, thanks to the work done in automatic speech recognition since 1968. This system is made up of four parts : an acoustic recognizer which gives a string of phoneme-like segments from a spoken sentence, a syntactic parser which controls the recognition process, a word recognizer working on words predicted by the parser and a dialog procedure which takes in account semantic constraints in order to avoid some of the errors and ambiguities. Some original features of the system are pointed out : modularily (e.g. the language used is considered as a parameter), possibility of processing slightly syntactically incorrect sentences, ... The application both in data management and in oral control of a telephone center has given very promising results. Work is in progress for generalizing our model : extension of the vocabulary and of the grammar, multi-speaker operation, etc.
TL;DR: Verification offers an alternative strategy by doing a top-down parametric word match independent of segmentation and labeling, which results in a distance measure between the reference parameterization of a hypothesized word and the computed parameterizations of the real speech.
Abstract: If, in a speech understanding system, word matching is performed at the phonetic level, then the accurate determination of the locations and identities of words present in an unknown utterance is necessarily limited by the phonetic segmentation and labeling. Verification offers an alternative strategy by doing a top-down parametric word match independent of segmentation and labeling. The result is a distance measure between the reference parameterization of a hypothesized word and the computed parameterization of the real speech. This distance is interpreted as the likelihood of that word having actually occurred over a given portion of the utterance.
TL;DR: In this paper, a method and an installation for masked or scrambled speech transmission utilize a time-scrambling unit for dividing the speech band into at least two sub-bands, for delaying the one sub-band with respect to the other, and for forming an aggregate signal, and a frequency-scambling unit is used to divide the aggregate signal into two second subbands of variable bandwidth, for their cyclic interchanging, for forming a transmission signal capable of being transmitted over a transmission channel, in order to mask not only the sound character of the speech signals but
Abstract: A method and an installation for masked or scrambled speech transmission utilize a time-scrambling unit for dividing the speech band into at least two sub-bands, for delaying the one sub-band with respect to the other, and for forming an aggregate signal, and a frequency-scrambling unit for dividing the aggregate signal into at least two second sub-bands of variable band-width, for their cyclic interchanging, and for forming a transmission signal capable of being transmitted over a transmission channel, in order to mask not only the sound character of the speech signals but also the speech rhythm, thus ensuring increased privacy of transmission with high code-changing speed and low sensitivity to distortion.
TL;DR: A speech signal-in-noise enhancement system which separates the voiced-unvoiced portions of speech, detects and extracts the voiced fundamental pitch and uses that data to control the band-pass center frequencies of a bank of filters so that the filters pass the harmonics of the fundamental pitch.
Abstract: A speech signal-in-noise enhancement system which separates the voiced-unvoiced portions of speech, detects and extracts the voiced fundamental pitch and uses that data to control the band-pass center frequencies of a bank of filters so that the filters pass the harmonics of the fundamental pitch. The output of these filters is summed to form a composite signal representative of voiced speech. Unvoiced speech is separately passed to the summer.
TL;DR: This work determines at each step to which possible propositions S expressed his commitment both at this step and so far in the speech, to distinguish different cases where S expresses commitments inconsistent with his previous commitments.
Abstract: A speech is an ordered set of speech-acts normally used to express commitments to propositions, which a speaker S performs in some context. Employing the point of view of the audience and by using the formal language CA for conditional assertion, a concept of pragmatical presuppositions and the order of the speech-acts performed, we determine at each step to which possible propositions S expressed his commitment both at this step and so far in the speech. We distinguish different cases where S expresses commitments inconsistent with his previous commitments. S may do so, to some extent, and change his mind, while in other cases this will cause the logical end of the speech.
TL;DR: In this paper, a CCD implementation of the CZT algorithm for performing DFT and IDFT operations in extracting representations of formants and/or pitch data from sampled speech inputs is described.
Abstract: Homomorphic speech processing apparatus utilizing CCD implementation of the CZT algorithm for performing DFT and IDFT operations in extracting representations of formants and/or pitch data from sampled speech inputs Embodiments also are described for performing the DFT and IDFT operations (a) by generating n-transforms and averaging the result and (b) for performing a sliding CZT transform In a further embodiment, a smoothed spectrum of vocal tract data is obtained using a CCD filter with a low pass response The CCD implementation includes transversal filters employing split-electrode signal amplitude weighting