TL;DR: The vector quantizing approach is shown to be a mathematically and computationally tractable method which builds upon knowledge obtained in linear prediction analysis studies and is introduced in a nonrigorous form.
Abstract: With rare exception, all presently available narrow-band speech coding systems implement scalar quantization (independent quantization) of the transmission parameters (such as reflection coefficients or transformed reflection coefficients in LPC systems). This paper presents a new approach called vector quantization. For very low data rates, realistic experiments have shown that vector quantization can achieve a given level of average distortion with 15 to 20 fewer bits/frame than that required for the optimized scalar quantizing approaches presently in use. The vector quantizing approach is shown to be a mathematically and computationally tractable method which builds upon knowledge obtained in linear prediction analysis studies. This paper introduces the theory in a nonrigorous form, along with practical results to date and an extensive list of research topics for this new area of speech coding.
TL;DR: In this paper, an improved signal processor (100, 200) was proposed for the spread spectrum (de)multiplexing of speech signals and nonspeech signals. But it is not known whether the proposed signal processor can be used in the real world.
Abstract: It is known to multiplex speech signals and nonspeech signals over a common communication path. One arrangement uses a portion of the frequency spectrum of the path for speech signals with the remainder for nonspeech signals. Another inserts data signals during gaps in the speech signals. Still another treats a speech signal as a carrier signal and modulates the speech signal with data signals. Unfortunately, users of such known arrangements experience excessive distortion or perceive others as encroaching on the path. These and other problems are mitigated by my improved signal processor (100, 200) for the spread spectrum (de)multiplexing of speech signals and nonspeech signals. In an illustrative embodiment, at a transmitter, a block (110) of speech signals may be converted (140) from a time domain to a frequency domain by a Fourier transformation. A Fourier component may be pseudo-randomly selected (130) from a subset of such components. Responsive to the selected component, a prediction (160) of the component may be substituted therefor, the prediction being thereafter modified (170), e.g., by its amplitude being incremented or decremented to reflect the multiplexing of a logic 1 or a logic 0 nonspeech signal. The modified prediction may be converted (150) back to the time domain for transmission to a receiver. At the receiver, a parallel demultiplexing (200) occurs for extracting (270) speech signals and nonspeech signals from the multiplexed signals.
TL;DR: The results suggest that listeners are more likely to respond to the postlexical phonological code when contextual constraints are present, and the Dual Code hypothesis is presented, which indicates that listeners can gain access to, or identify, entities at both of these levels.
TL;DR: A membrane diffusion device is disclosed which comprises flat, tubular, semipermeable membrane, and flexible, ribbed membrane support sheet positioned against said membrane and typically rolled up together into a coil.
TL;DR: The development of a digital encoding system designed to exploit the limited detection ability of the auditory system is described, dynamically shaping the encoding error spectrum as a function of the input speech signal, the error is masked by the speech.
Abstract: The development of a digital encoding system designed to exploit the limited detection ability of the auditory system is described. By dynamically shaping the encoding error spectrum as a function of the input speech signal, the error is masked by the speech. Psychoacoustic experiments and results from the literature provide a basis for determining the system parameters that ensure that the error is inaudible. The encoder is a multi-channel system, each channel approximately of critical bandwidth. The input signal is filtered into 17 frequency channels via the quadrature mirror filter technique. Each channel is then coded using block-companding adaptive PCM. For 4.1 kHz bandwidth speech, the differential threshold of the encoding degradation occurs at a bit rate of 34.4 kbps. At 16 kbps, the encoder produces toll quality speech output.
TL;DR: A generalized signal processing device for use as a hearing aid generates and modifies signals representative of acoustic patterns in a physiologically compatible manner which enables persons suffering from various auditory pathologies to recognize sound patterns including human speech as mentioned in this paper.
Abstract: A generalized signal processing device for use as a hearing aid generates and modifies signals representative of acoustic patterns in a physiologically compatible manner which enables persons suffering from various auditory pathologies to recognize sound patterns including human speech.
TL;DR: A signal model based more directly upon the phsyics of of speech generation is proposed and implemented and parametric control of the synthesis model is implemented by an adaptive procedure that minimizes the spectral difference between a human speech input and the synthetic output of the model.
Abstract: A traditional model of the speech signal has provided the underpinning of vocoder technology since the inception of analysis/synthesis telephony. The model is a first‐order approximation to human speech generation in which the source of vocal sound and the resonant acoustic system are treated as linear, separable elements. This source‐system model cannot properly account for a number of acoustic factors now known to exist in speech generation. We propose and implement here a signal model based more directly upon the phsyics of of speech generation. We also implement parametric control of the synthesis model by an adaptive procedure that minimizes the spectral difference between a human speech input and the synthetic output of the model.The adapted parameters constitute a low bit‐rate representation of the input human speech. We test a preliminary form of the system by computer simulation and demonstrate that in simple inital trials the signal model is able to adapt in a realistic manner.
TL;DR: Some of the frequency domain expressions of statistical distance measures between stationary vector Gaussian processes recently derived by the authors have been empirically verified to be very useful speech recognition and speech analysis-synthesis.
Abstract: We summarize some new frequency domain expressions of statistical distance measures between stationary vector Gaussian processes recently derived by the authors. Both time-discrete and time-continuous processes are treated. Some of the frequency domain distance measures have been empirically verified to be very useful speech recognition and speech analysis-synthesis.
TL;DR: An experimental system that enables text-based office services to be merged with a speech storage facility that includes an editor which allows a user to modify speech messages is described.
Abstract: This paper describes an experimental system that enables text-based office services to be merged with a speech storage facility. The principal new component of this system is an editor which allows a user to modify speech messages. The discussion covers the system architecture, the facilities developed to make speech editing tractable, and the speech processing required to implement the system.
TL;DR: It is shown that, in several simple performance evaluations, the local minimum method performed considerably better then the fixed range method.
Abstract: Several variations on algorithms for dynamic time warping have been proposed for speech processing applications. In this paper two general algorithms that have been proposed for word spotting and connected word recognition are studied. These algorithms are called the fixed range method and the local minimum method. The characteristics and properties of these algorithms are discussed. It is shown that, in several simple performance evaluations, the local minimum method performed considerably better then the fixed range method. Explanations of this behavior are given and an optimized method of applying the local minimum algorithm to word spotting and connected word recognition is described.
TL;DR: Examination of the effects of pause time on the perception of sentences indicated that in sentences containing pauses between clauses, words were categorized more rapidly and propositions were recalled more accurately than in sentence containing pauses, within the clause.
Abstract: Pauses can be used to facilitate certain operations involved in the production and in the perception of speech. In the case of speech perception, pauses have been found to improve the accuracy of detection and the recall of lists of digits and letters. The aim of the present experiments was to examine the effects of pause time on the perception of sentences. In experiment I, a semantic categorization task was used and in experiment II a sentence recall task. The results indicated that in sentences containing pauses between clauses, words were categorized more rapidly (experiment I) and propositions were recalled more accurately (experiment II) than in sentences containing pauses, within the clause. The results are interpreted in the context of existing models of speech processing, and the significane of pause time for cognitive activity is discussed.
TL;DR: In this paper, a helium-speech unscrambler was proposed to reduce the bandwidth of the helium speech before transmitting the speech signals to a distant location on a carrier wave selected for optimum transmission through the water.
Abstract: The invention relates to a novel helium-speech unscrambler which can be located at a diver's location, and enables the helium-speech voiced by the diver to be subjected to waveform time expansion to reduce the bandwidth of the helium speech (e.g. to 2 to 3 KHZ) prior to transmitting the speech signals to a distant location on a carrier wave selected for optimum transmission through the water.
TL;DR: A conversational-mode, speech-understanding system which enables its user to make airline reservations and obtain timetable information through a spoken dialog as a three-level hierarchy consisting of an acoustic word recognizer, a syntax analyzer, and a semantic processor.
Abstract: We describe a conversational mode speech understanding system which enables its user to make airline reservations and obtain timetable information through a spoken dialog. The system is structured as a three level hierarchy consisting of an acoustic word recognizer, a syntax analyzer and a semantic processor. The semantic level controls an audio response system making two way speech communication possible. The system is highly robust and operates on-line in a few times real time on a laboratory minicomputer. The speech communication channel is a standard telephone set connected to the computer by an ordinary dialed-up line.
TL;DR: Speech compaction/replay apparatus for real-time monitoring speech and filtering out periods of relative slence from a recording of the speech are described in this article. But they do not address the problem of time code information.
Abstract: Speech compaction/replay apparatus for real time monitoring speech and filtering out periods of relative slence from a recording of the speech. The recording also containing synchronization and time code information for ensuring that on replay and in terms of real time the audio output will essentially replicates the analog speech input. The apparatus and technique minimizing the amount of storage media required to store the speech.
TL;DR: This paper presents an interpretation of the log likelihood ratio measure within the theoretical framework of a waveform coder distortion model, and discusses the implications of this interpretation and how it can be applied to the formulation of better objective measures of wave form coder performance.
Abstract: The log likelihood measure has been widely used in speech research for comparing speech signals. Recently, it has been proposed as a measure for assessing the quality of coded speech. In this paper we present an interpretation of the log likelihood ratio measure within the theoretical framework of a waveform coder distortion model. We then discuss the implications of this interpretation and show how it can be applied to the formulation of better objective measures of waveform coder performance.
TL;DR: In this article, a speech detector uses a signal classifier to identify portions of a representation of the average magnitude of a group of signal samples indicative of either speech or noise, and a level estimator uses selectively obtained signal measures from the defined portions of the representation to provide adaptively variable decision levels.
Abstract: A speech detector uses a signal classifier (19) to identify portions of a representation of the average magnitude of a group of signal samples indicative of either speech or noise. A controller (33) in the signal classifier follows a four state sequence using appropriate time constants for signal measures in a variety of signal conditions in defining the speech and noise portions of the representation. A level estimator (21) uses selectively obtained signal measures from the defined portions of the representation to provide adaptively variable decision levels. A speech definer (16) compares the representation to a first decision level and the signal samples to a higher decision level to indicate the occurrence of speech signal activity when either decision level is exceeded. In a two way transmission arrangement, a receive trunk speech detector uses a stretcher (133) to prevent adaptation of the transmit speech detector thresholds when echo signals are present.
TL;DR: This paper summarizes the performance of the objective measures in predicting the subjective results and presents the results of a statistical correlation study between a database of subjective speech quality measures and a data base of objective Speech Quality Measures.
Abstract: This paper presents the results of a statistical correlation study between a data base of subjective speech quality measures and a data base of objective speech quality measures. Both data bases are derived from approximately 18 hours of coded and distorted speech. The subjective test used was the Diagnostic Acceptability Measure (DAM), a parametric speech quality test developed at the Dynastat Corporation. The objective measures included approximately 1500 parametric variations of many commonly suggested objective measures. This paper summarizes the performance of the objective measures in predicting the subjective results.
TL;DR: This paper presents a new method of voiced/unvoiced/ silence discrimination of speech based on the results of counting bit alternations of the bit stream from linear delta modulation of the speech signal and zero crossings of a band-pass filtered output of the decoded LDM signal.
Abstract: This paper presents a new method of voiced/unvoiced/ silence discrimination of speech. The decision algorithm is based on the results of counting bit alternations of the bit stream from linear delta modulation (LDM) of the speech signal and zero crossings of a band-pass filtered output of the decoded LDM signal. Computer simulation of the system with real speech has yielded accurate results. Economical realization of the discriminator hardware using standard integrated circuits is also considered.
TL;DR: In this article, an apparatus for the acquisition of a raw speech signal and the essentially simultaneous acquisition of an essentially simultaneous transform of the speech signal, wherein said transform covaries as a function of changes in one or more parameters in the speech signals and is indicative of a predetermined speech characteristic, such as nasalization, pitch or intensity.
Abstract: An apparatus for the acquisiton of a raw speech signal and the essentially simultaneous acquisition of a transform of the speech signal, wherein said transform covaries as a function of changes in one or more parameters in the speech signal and is indicative of a predetermined selected speech characteristic, such as nasalization, pitch or intensity. The apparatus includes a microphone for producing first signals representative of raw speech, and a second transducer, such as, for example, an accelerometer for generating second signals essentially simultaneous to the production of the first signals, with the second signals being indicative of a selected parameteric characteristic of the human speech, such as, for example, nasalization. The first and second signals are applied to data processing circuits which analyzes the first and second signals to produce transform signals based on arithmetic combinations thereof. The apparatus further includes display means for providing videographic and alphanumeric display of the transform signals accompanied by synchronous audio display of the raw speech.
TL;DR: An adaptive bit allocation scheme is introduced here, in order to replace the usual form of a fixed distribution of the bit rate among the sub-bands, and highly intelligible reproduction of speech is possible at bit rates below 7 kb/s.
TL;DR: In this algorithm, the partitioning of the speech band into subbands is performed via a bank of eight bandpass filters whose coefficients are derived from a QMF tree structure to enable a more efficient encoding of the subband waveforms.
Abstract: This paper describes a high quality 16 kb/s subband coder using quadrature mirror filters (QMF). In this algorithm, the partitioning of the speech band into subbands is performed via a bank of eight bandpass filters whose coefficients are derived from a QMF tree structure. Due to the special properties of the QMF, no significant spectral distortions are introduced in the band-splitting and reconstruction processes. To enable a more efficient encoding of the subband waveforms, an adaptive bit assignment scheme is incorporated which allocates the number of quantizer bits according to the distribution of subband energies.
TL;DR: A novel innovations based time-domain pitch detection technique for speech-like signals using one of the authors' recursive least-squares ladder algorithms that has the advantage that all the necessary variables are already computed in the modeling ladder recursions and therefore is suited for fast on-line or even hardware implementations.
Abstract: We present a novel innovations based time-domain pitch detection technique for speech-like signals using one of our recursive least-squares ladder algorithms. The basic assumption is that the speech driving process consists of an approximately Gaussian part (unvoiced) and a jump part (voiced). The pitch pulse positions located by processing the innovations alone are known not to be very accurate due to phase-distortions, effects of zeros and inaccurate model parameter estimates. In our ladder form linear prediction recursions, a log-likelihood function is recursively computed (on-line) for each speech sample. The derivative of this log-likelihood function becomes a sensitive measure of extreme outliers of the speech waveforms, i.e., samples that very likely do not fit the Gaussian statistics. When combined with the innovations of our ladder algorithms a good statistic is obtained for locating the pitch pulses by thresholding. This pitch detection scheme has the advantage that all the necessary variables are already computed in the modeling ladder recursions and therefore is suited for fast on-line or even hardware (e.g., VLSI) implementations.
TL;DR: In this paper, an integrated circuit speech synthesis system utilizing complementary metal-insulator-semiconductor technology to achieve low voltage operation, wherein a pluse width modulated digital-to-analog converter is employed to provide for accurate conversion of digital signals into analog signals even though the low-voltage operation prohibits the large voltage swings normally required for digital to analog converter circuitry.
Abstract: Integrated circuit speech synthesis system utilizing complementary metal-insulator-semiconductor technology to achieve low voltage operation, wherein a pluse width modulated digital-to-analog converter is employed to provide for accurate conversion of digital signals into analog signals even though the low voltage operation prohibits the large voltage swings normally required for digital-to-analog converter circuitry. The speech synthesis system includes a linear predictive filter as a speech synthesizer which utilizes coded reflection coefficients to produce digital signals representative of human speech. A microprocessor controls the access of digitized speech data which is stored in a memory. The speech synthesizer and microprocessor along with the pulse width modulated digital-to-analog converter are implemented in complementary metal-insulator-semiconductor technology. The system also includes a speaker for generating audible sounds in the form of synthesized human speech from the analog signals provided by the digital-to-analog converter.
TL;DR: A robust algorithm for making the voiced-unvoiced-silence decision is described, based on a nonparametric rank-order statistical signal-detection scheme that does not require a training set of data and maintains a constant false alarm rate for a broad class of noise inputs.
Abstract: This paper describes a theoretical and experimental investigation for detecting the presence of speech in wide-band noise. A robust algorithm for making the voiced-unvoiced-silence decision is described. This algorithm is based on a nonparametric rank-order statistical signal-detection scheme that does not require a training set of data and maintains a constant false alarm rate for a broad class of noise inputs. Two rank-order decision procedures are investigated, the Kruskal-Wallis and the multiple use of the two-sample savage statistic. The performances of these detectors are evaluated and compared to that obtained from manually classifying twenty recorded utterances. In limited testing, the average probability of misclassification of voiced speech for the Savage case was less than 6, 13, 28, and 55 percent, corresponding to signal-to-noise ratios of 30, 20, 10, and 0 dB, respectively.
TL;DR: The results of these experiments indicate that traditional metrics of boundary strength, as well as linguistic formulations of phonological rules, must be elaborated to recognize the special status of clause boundaries and deletion sites.
TL;DR: The present applications of SYNTE 2 are described, including the speaking machine, a talking data terminal for blind computer programmers, a system for automatic production of spoken information for the blind, etc.
Abstract: SYNTE 2 is a low-cost, high-quality, text-to-speech synthesizer designed for Finnish but applicable also to other languages if "phoneme writing" is used. After its first presentation in 1977 it has been adapted to many communication aids for the handicapped. The first application was a portable speaking machine with unlimited vocabulary for the speech impared. This paper describes the present applications of SYNTE 2, including the speaking machine, a talking data terminal for blind computer programmers, a system for automatic production of spoken information for the blind, etc.
TL;DR: In this article, a talking electronic arithmetic learning aid presented the mathematical problems in words and phrases as audibly voiced in synthesized speech to enhance the ability of the operator in perceiving mathematical problem in an audibly verbalized form.
Abstract: An electronic handheld arithmetic learning aid which includes a speech synthesis device, a speaker driven by the speech synthesis device, a memory having digital data stored therein from which a plurality of mathematical problems may be derived for presentation to an operator for solution, and a controller for accessing selected portions of the digital data from the memory for input to the speech synthesizer device in presenting the mathematical problems to the operator in an audibly voiced manner via the speaker. In one aspect, at least some of the mathematical problems derivable from the memory involve respective sets of at least two individual numbers from which the operator is expected to determine a particular mathematical relation in providing a solution to the corresponding mathematical problem. In another aspect, the mathematical problems derivable from the digital data of the memory respectively involve the random selection of an unknown number which the operator is expected to identify by proposing a trial number. The memory further includes the solutions to these problems and digital speech data for enabling the speech synthesizer device to provide speech signals from which words posing the mathematical problems, the correct solutions thereto, and comments on operator inputs may be audibly voiced in human speech via the speaker as driven by the speech synthesizer device. The talking electronic arithmetic learning aid presents the mathematical problems in words and phrases as audibly voiced in synthesized speech to enhance the ability of the operator in perceiving mathematical problems in an audibly verbalized form.
TL;DR: In this article, a speech analysis-synthesis system was developed which is capable of independent manipulation of the fundamental frequency and spectral envelope of a speech waveform, which has applications in the areas of voice modification, baseband-excited vocoders, time-scale modification, and frequency compression as an aid to the partially deaf.
Abstract: : A new speech analysis-synthesis system has been developed which is capable of independent manipulation of the fundamental frequency and spectral envelope of a speech waveform. The system deconvolves the original speech with the spectral-envelope estimate to obtain a model for the excitation. Hence, explicit pitch extraction is not required. As a consequence, the transformed speech is more natural sounding than would be the case if the excitation were modeled as a sequence of pulses. The system has applications in the areas of voice modification, baseband-excited vocoders, time-scale modification, and frequency compression as an aid to the partially deaf. (Author)