TL;DR: Measured zero-crossing rates and corresponding calculated values for speech and its first derivative are presented and some vowels are shown to be differentiable from other vowels, although measurements other than zero-Crossing rates of filtered, unfiltered, and differentiated speech are shows to be necessary for complete vowel separation.
Abstract: Existing mathematical relations between the power spectral density and the mean zero-crossing rate of an ergodic random process are used to derive relations between spectral measurements and mean zero-crossing rates of a speech signal and its derivatives. Of particular significance is the equation relating zero-crossing rates to the formant parameters of vowels and vowel-like sounds. Reasonably close agreement between measured zero-crossing rates and those calculated from spectral measurements were observed for virtually all phonemes in a variety of contextual environments. Measured zero-crossing rates and corresponding calculated values for speech and its first derivative are presented for vowels, unvoiced fricatives, and unvoiced stops, all in many different contextual environments. Unvoiced fricatives /s/, / \int /, and /f/ are shown to be distinguishable from each other solely on the basis of the zero-crossing rate of the derivative signal. Some vowels are shown to be differentiable from other vowels, although measurements other than zero-crossing rates of filtered, unfiltered, and differentiated speech are shown to be necessary for complete vowel separation. For unvoiced stop consonants, the zero-crossing rate of either the signal or its derivative is shown to be useful for classification, provided some information concerning the contextual environment is available.
TL;DR: Discrete forms of the Fourier, Hadamard, and Karhunen-Loeve transforms are examined for their capacity to reduce the bit rate necessary to transmit speech signals and these bit-rate reductions are shown to be somewhat independent of the transmission bit rate.
Abstract: Discrete forms of the Fourier, Hadamard, and Karhunen-Loeve transforms are examined for their capacity to reduce the bit rate necessary to transmit speech signals. To rate their effectiveness in accomplishing this goal the quantizing error (or noise) resulting for each transformation method at various bit rates is computed and compared with that for conventional companded PCM processing. Based on this comparison, it is found that Karhunen-Loeve provides a reduction in bit rate of 13.5 kbits/s, Fourier 10 kbits/s, and Hadamard 7.5 kbits/s as compared with the bit rate required for companded PCM. These bit-rate reductions are shown to be somewhat independent of the transmission bit rate.
TL;DR: Experiments investigating adaptive pattern recognition in automatic speaker verification are reported, indicating that the utterances used for training purposes should preferably be collected over a relatively long period of time.
Abstract: Experiments investigating adaptive pattern recognition in automatic speaker verification are reported. A binary decision confirming or rejecting a speaker's purported identity is required. The experiments involve 7000 phrase length utterances of 118 speakers. An average misclassification rate of one percent with a "no decision" rate of ten percent is obtained. Other experiments indicate that the utterances used for training purposes should preferably be collected over a relatively long period of time.
TL;DR: A computer technique for synthesizing continuous messages by concatenating formant data for word-length utterances is described and the results show the synthesized numbers to be comparable in communicative effectiveness to naturally spoken digits.
Abstract: Speech signals can be described in terms of the resonances of the vocal tract. These resonances, or formants, change at rates comparable to the motions of the vocal tract. They therefore can be sampled and quantized to low bit-rates, and hence constitute an economical form for digital storage of speech information. Formant coding also permits flexible arrangement of speech elements into various contexts. This report describes a computer technique for synthesizing continuous messages by concatenating formant data for word-length utterances. The stored data for the synthesis corresponds to a bit-rate of 533 b/s. A Honeywell DDP-516 computer is used to experimentally evaluate a voice response system. In an initial application, the system is used to synthesize 7-digit telephone numbers. To assess the synthesis an interactive dialing experiment, also conducted by the computer, is described. The results show the synthesized numbers to be comparable in communicative effectiveness to naturally spoken digits.
TL;DR: This chapter outlines the process of speech production, perception, vocoders, and such, and one major potential application of the man–machine interface using speech—namely, in automated audiovisual display systems is discussed.
Abstract: Publisher Summary This chapter begins with describing the nature of a man–machine interface using speech. Just as the communication situation is symmetric, there is symmetry in the processes of speech recognition and speech synthesis. The advantages of a man–machine interface using speech are: no tool is required, omnidirectional, requires very little energy, allows physical mobility, gives additional reliability, offers more natural communication, very suitable for ‘‘alert” or “break-in” messages, and much more. Some disadvantages are that it could prove aggravating and unusable, a speech output may not be scanned easily, it is transitory, and it requires additional complexity of equipment to allow. The chapter outlines the process of speech production, perception, vocoders, and such. Some of the practical methods of generating speech by machine and approaches to machine perception of speech are examined. One major potential application of the man–machine interface using speech—namely, in automated audiovisual display systems is discussed.
TL;DR: The degree of privacy which can be introduced in a speech communication system by deliberate variation of the order of transmission of the samples is investigated, and intelligibility tests are used to assess the effectiveness of the system.
Abstract: Many modern communication systems transmit amplitude samples of the original input signal, the sampling being carried out according to the wellknown requirements of the Sampling Theorem. At the receiver the signal is reconstituted by low-pass filtering. This paper investigates the degree of privacy which can be introduced in a speech communication system by deliberate variation of the order of transmission of the samples. The distortion of the signal which results from the low-pass filtering of such 'scrambled' samples is investigated theoretically and practically, and intelligibility tests are used to assess the effectiveness of the system. Finally, some guidance is given as to the selection of the scramble sequence for maximum effectiveness.
TL;DR: In this paper, a low rate pulsive signal is generated to act as a carrier for the parameters of a linear predictor for speech signals, and the bandwidth required for transmitting the resulting composite signal is substantially less than that of the original speech signal and somewhat less than the transmission of predictively coded signals.
Abstract: In an adaptive, predictive coder for speech signals, the transmitted signal generally consists of an rms value, a pitch signal, a voice-unvoiced indication, and a number of parameter signals for adjusting the coefficients of a linear predictor. Transmission of these signals is improved in this invention by generating a low rate pulsive signal and by shaping its spectrum in accordance with the parameter signals. The pulsive signals thus act as a carrier for the parameters. The bandwidth required for transmitting the resulting composite signal, i.e., the modulated pulsive signal and the subsidiary signals, is substantially less than that of the original speech signal and somewhat less than that required for the transmission of predictively coded signals.
TL;DR: In this paper, a speech processing device is proposed to correct deficiencies in speech programs caused by inadequate energy in the presence band, and, if it is inadequate, automatically boosts the amplitude of presence band components to a level to obtain a more optimum spectral distribution.
Abstract: A speech processing device operative to correct deficiencies in speech programs caused by inadequate energy in the ''''presence band.'''' The device determines the relative amount of total signal energy in the presence band, and, if it is inadequate, automatically boosts the amplitude of presence band components to a level to obtain a more optimum spectral distribution. The circuit is designed to operate with an automatic speech-music discriminator which inhibits control action during music programming.
TL;DR: The field of digital signal processing can encompass the analysis or synthesis of time or spatial signals for a wide variety of applications that include measuring the response of servo systems, spectral analysis of such things as mechanical vibrations in structures, noise in electronic equipment, and speech waveforms.
Abstract: Digital Signal Processing involves the manipulation of linear time invariant waveforms utilizing digital techniques. The field of digital signal processing can encompass the analysis or synthesis of time or spatial signals for a wide variety of applications. These include measuring the response of servo systems, spectral analysis of such things as mechanical vibrations in structures, noise in electronic equipment, and speech waveforms. Other applications include analyzing seismic returns, simulating coherent optical systems, and the processing of radar signal returns. Certain aspects of pattern recognition and picture processing are also amenable to the application of digital signal processing techniques.
TL;DR: A new pattern recognition technique is proposed that avoids the exhaustive comparison process associated with pattern matching and some preliminary results obtained show that a performance very similar to that obtained from the exhaustive compare process is attainable with a significant saving in computational effort.
Abstract: A description is given of an unusual pattern recognition technique which has been used in an experimental speech recognition system. Preliminary results obtained using this technique are reported. The speech analyzer produces a multichannel ternary signal at its output, which is the short term digital autocorrelation function of the input signal. This output is sampled at regular intervals and this sampled information is transferred to a computer. A new pattern recognition technique is proposed that avoids the exhaustive comparison process associated with pattern matching. The technique is similar to a tree-structured process in that decisions are taken that exclude certain master patterns from further processing as it becomes apparent that these are sufficiently dissimilar to the unknown pattern. However, retracing within the structure and the substitution of an alternative path are permitted if the current path appears unlikely to lead to a correct decision. Some preliminary results obtained using this technique are described. These show that a performance very similar to that obtained from the exhaustive comparison process is attainable with a significant saving in computational effort. The effect of varying certain parameters within the recognition process is also considered and some preliminary optimization of parameter values is reported.
TL;DR: A large set of vocoded speech signals has been evaluated in terms of preference and it is shown that, in certain respects, reliable system evaluations pose formidable problems.
Abstract: Starting from an IEEE Recommended Practice for Speech Quality Measurements and from previous work of the authors, a large set of vocoded speech signals has been evaluated in terms of preference. The set of speech samples has been taken from the vocoder survey of the 1967 Conference on Speech Communication and Processing, Boston, Mass. The test samples are evaluated by several methods: direct comparisons, the isopreferenee method, the relative preference method, the category judgment method, and the absolute preference judgment method. Due to the size of the test material, not all the test samples could be evaluated by all these methods. The test results are discussed and it is shown that, in certain respects, reliable system evaluations pose formidable problems. An effort to rank order the systems, which are described by small sets of test samples of frequently very different quality, for good reasons shows only limited success. The majority of the systems are of about equal preference with only insignificant differences. There are only a few systems that are outside this group and are either significantly better or worse than the rest.
TL;DR: The use of a small digital computer in processing the speech signal to achieve the intelligibility in speech signals by converting them into dichotic signals with an interaural time delay is described with illustrations.
Abstract: An increase in the rate and the intelligibility of sound is highly desirable in speech communication. Also, it is useful to have an accurate and efficient method of obtaining desired segments of a speech sample. In this paper, the use of a small digital computer in processing the speech signal to achieve the above purposes is described with illustrations. On‐line simulation of the method of Fairbanks et al. [G. Fairbanks et al., IRE Trans. Audio 2, 7–12, (1954)] of increasing the speech rate has been achieved with flexible speed‐up ratios and sampling intervals. Increase of intelligibility in speech signals by converting them into dichotic signals with an interaural time delay is discussed. These dichotic signals have been obtained from the computer for time delays between 0 and 1 sec. To obtain different segments of a speech sample, the computer is programmed to store the speech sample and display its waveform on an oscilloscope, so that various segments of the speech sample can be extracted and also joi...
TL;DR: In this paper, a speech sample from a person who may be a legitimate or an imposter is compared to a standard speech sample of the legitimate person, and the magnitudes of the difference signals are added together to generate an authenticity signal.
Abstract: A method and system for talker authentication in which a trial speech sample from a person who may be legitimate or who may be an imposter is compared to a standard speech sample of the legitimate person. The trial speech sample forms the input to a plurality of band pass filters. The outputs of each of the filters are integrated over the duration of the speech sample and the integrated signals are normalized. These normalized signals are compared to normalized signals of a standard speech sample to generate a plurality of difference signals. The magnitudes of the difference signals are added together to generate an authenticity signal, the magnitude of which corresponds to the correspondence between the trial speech sample and the standard speech sample.
TL;DR: A spectrum analyzer having an input, a mixer for feeding the input with a filtered speech signal and a frequency converted signal which is a function of physiological activity associated with the speech is described in this paper.
Abstract: A spectrum analyzer having an input, a mixer for feeding the input with a filtered speech signal and a frequency converted signal which is a function of physiological activity associated with the speech.
TL;DR: In this article, an electronic helium speech processor (unscrambler) incorporating appropriate corrective measures has been developed and evaluated, and its performance culminated in successful retrieval of voice intelligence from divers at 1000-ft pressure depth.
Abstract: The use of helium in breathing gas mixtures for deep diving operations causes marked changes in the speech sounds of the divers. The most prominent effect is an elevation of the formant frequencies which, in conjunction with other phenomena, progressively reduces speech intelligibility as depth increases. Our analyses and investigations have shown that to correct these distortions, the following measures are required: 1) provide a passband extending to at least 10 kHz; 2) equalize high frequency losses in vocal output and in microphone response; 3) reduce voice spectrum envelope frequencies to their expected values for 1.0 atm of air; and 4) preserve the glottal rate of the talker. Recent advances in commercially available electronic signal analyzers have made possible the recording of voice spectra averaged over a wide range of time intervals (50 ms-50 s), over a wide range of frequencies (to 20 kHz) and over a total amplitude range exceeding 100 dB. These capabilities are essential in exploring the total extent of distortions that helium introduces. Analyses of helium speech by means of long term average spectra have revealed phenomena not previously reported. An electronic helium speech processor (unscrambler) incorporating appropriate corrective measures has been developed and evaluated. Its performance culminated in successful retrieval of voice intelligence from divers at 1000-ft pressure depth. Objective word intelligibility test results for corrected speech of one talker were 88 percent at 500-ft and 78 percent at 800-ft depth.
TL;DR: Signal processing has become an important field in communications and in acoustics because the human ear processes the information it receives and does a reasonably good job of filtering.
Abstract: It is very important that methods be developed to extract signals from the noisy backgrounds that usually accompany them. Signal processing has, therefore, become an important field in communications and in acoustics. The human ear processes the information it receives and does a reasonably good job of filtering. A knowledge of the signal processing techniques that have been developed in the past will also be helpful in understanding the functioning of the human ear.
TL;DR: The capability and flexibility of the Voice Data Processor System (VDPS) was increased to include the following features: a modification was designed to provide a real-time CRT display of selected VDPS parameters; the AFCRL Linc Processor was interfaced with the VD PS.
Abstract: : The report contains the results of an investigation of speech pattern-matching using a digital voice data processor to evaluate pattern-matching speech bandwidth compression techniques. The capability and flexibility of the Voice Data Processor System (VDPS) was increased to include the following features: A modification was designed to provide a real-time CRT display of selected VDPS parameters; the AFCRL Linc Processor was interfaced with the VDPS; a modification was made to reinsert silent frames into the output from silence-edited digital tapes as they are run in a tape input mode; the capability of editing out the onset portions of words of speech in real-time has been added to the system. In addition, the results of studies on increasing VDPS flexibility and memory capacity are presented, to accommodate the research on the effects of speaker selection. (Author)
TL;DR: An experimental model of a coder for transmission of speech over a 9600-bits/s digital channel was built to demonstrate feasibility of an adaptive prediction-coding technique.
Abstract: An experimental model of a coder for transmission of speech over a 9600-bits/s digital channel was built to demonstrate feasibility of an adaptive prediction-coding technique. After analog-to-digital conversion of the speech input, the coder employs digital processing using a computer type organization. Resonances in the short-term speech spectrum are removed by a nonrecursive digital transmit filter and the resulting uncorrelated signal is coded by an 8000-bits/s direct feedback delta coder. The transmit filter parameters are adapted to the input spectrum by a least squares algorithm involving calculation of short term correlation coefficients of the sequence of input samples. These filter parameters are multiplexed with the delta coder output for transmission to the receiver. A recursive receive filter restores the original speech spectrum. A computer simulation of the voice digitizer was performed to determine the order of the digital filters and to optimize other parameters prior to the design of the experimental model. The results of the simulation and design considerations for the experimental model are described.
TL;DR: A digital hardware realization of a formant synthesizer which utilizes the technique of digital multiplexing of a single arithmetic unit among several digital filter sections to produce speech in real time.
Abstract: Terminal analog or formant speech synthesizers have found many applications in speech research. These include investigation of computer voice response, speech synthesis-by-rule, and speech perception studies, among others. Many types of formant synthesizers have been designed and realized either in analog circuitry or as a computer program. In this paper we describe a digital hardware realization of a formant synthesizer which utilizes the technique of digital multiplexing of a single arithmetic unit among several digital filter sections. The advantages of this hardware over conventional analog hardware include: precise control over center frequencies and bandwidths of the resonators in the synthesizer, stability and reliability of the hardware, light weight, small size, and low power consumption. The synthesizer is capable of producing speech in real time at sampling rates up to 12.8 kHz, using 24 bits to process the digital signals internal to the synthesizer. A 12-bit digital-to-analog convertor supplies an immediate analog output for monitoring the speech and a provision is included for returning 16 bits of the output signal to the computer for future processing such as waveform display or spectrum analysis.