TL;DR: In a set of experiments involving 35 pairs of phonetically similar sentences representing seven types of structural contrasts, the perceptual evidence shows that some, but not all, of the pairs can be disambiguated on the basis of prosodic differences.
Abstract: Prosodic structure and syntactic structure are not identical; neither are they unrelated. Knowing when and how the two correspond could yield better quality speech synthesis, could aid in the disambiguation of competing syntactic hypotheses in speech understanding, and could lead to a more comprehensive view of human speech processing. In a set of experiments involving 35 pairs of phonetically similar sentences representing seven types of structural contrasts, the perceptual evidence shows that some, but not all, of the pairs can be disambiguated on the basis of prosodic differences. The phonological evidence relates the disambiguation primarily to boundary phenomena, although prominences sometimes play a role. Finally, phonetic analyses describing the attributes of these phonological markers indicate the importance of both absolute and relative measures.
TL;DR: This paper investigated adults' use of prosodic emphasis to mark focused words in speech to infants and adults and found that infants are sensitive to clause boundaries in infant-directed but not in adult-directe d speech (Kemler-Nelson, Hirsh-Pasek, Jusczyk, & Cassidy, 1989).
Abstract: Two studies investigated adults' use of prosodic emphasis to mark focused words in speech to infants and adults. In Experiment 1,18 mothers told a story to a 14-month-old infant and to an adult, using a picture book in which 6 target items were the focus of attention. Prosodic emphasis was measured both acoustically and subjectively In speech to infants, mothers consistently positioned focused words on exaggerated pitch peaks in utterance-final position, whereas in speech to adults prosodic emphasis was more variable. In Experiment 2,12 women taught another adult an assembly procedure involving familiar and novel terminology. In both studies, stressed words in adultdirected speech rarely coincided with pitch peaks. However, in infant-directed speech, mothers regularly used pitch prominence to convey primary stress. The use of exaggerated pitch peaks at the ends of utterances to mark focused words may facilitate speech processing for the infant. Research on features of early linguistic experiences that influence language acquisition has begun to explore the role of prosody in providing the infant with acoustic cues to linguistic structure in the speech waveform. Faced with the problem of discovering the linguistically relevant units in continuous speech, the preverbal infant may take advantage of prosodic features that are regularly correlated with such units as words, phrases, and clauses. According to some versions of this "prosodic bootstrapping hypothesis" (e.g., Gleitman & Wanner, 1982; Morgan & Newport, 1981), infants can exploit the prosodic cues routinely available in spoken English to infer the correct units of analysis for the language without requiring special prosodic modifications in maternal speech. Other evidence suggests that the characteristic exaggeration of prosodic cues in infant-directed speech may indeed be useful to the infant in partitioning the speech stream. When speaking to infants, mothers use more exaggerated vowel lengthening to mark both phrase boundaries (Morgan, 1986) and clause boundaries (Bernstein Ratner, 1986) than when speaking to adults. The finding that infants are sensitive to clause boundaries in infantdirected but not in adult-directe d speech (Kemler-Nelson, Hirsh-Pasek, Jusczyk, & Cassidy, 1989) suggests that the exaggerated prosody typical of mothers' speech (e.g., Fernald & Simon, 1984) increases the salience of acoustic cues to linguistic structure for the preverbal infant. This recent interest in the usefulness of prosody to infants who are learning language has focused mainly on how prosodic cues reveal phrase and clause structure. The focus on syntactic rather than lexical units is not surprising, given that most current theories of language acquisition take for granted the
TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).
Abstract: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment. These algorithms attempt to improve the recognition accuracy of speech recognition systems when they are trained and tested in different acoustical environments, and when a desk-top microphone (rather than a close-talking microphone) is used for speech input. Without such processing, mismatches between training and testing conditions produce an unacceptable degradation in recognition accuracy.
Two kinds of environmental variability are introduced by the use of desk-top microphones and different training and testing conditions: additive noise and spectral tilt introduced by linear filtering. An important attribute of the novel compensation algorithms described in this thesis is that they provide joint rather than independent compensation for these two types of degradation.
Acoustical compensation is applied in our algorithms as an additive correction in the cepstral domain. This allows a higher degree of integration within SPHINX, the Carnegie Mellon speech recognition system, that uses the cepstrum as its feature vector. Therefore, these algorithms can be implemented very efficiently. Processing in many of these algorithms is based on instantaneous signal-to-noise ratio (SNR), as the appropriate compensation represents a form of noise suppression at low SNRs and spectral equalization at high SNRs.
The compensation vectors for additive noise and spectral transformations are estimated by minimizing the differences between speech feature vectors obtained from a "standard" training corpus of speech and feature vectors that represent the current acoustical environment. In our work this is accomplished by minimizing the distortion of vector-quantized cepstra that are produced by the feature extraction module in SPHINX.
In this dissertation we describe several algorithms including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cepstral Normalization (CDCN). With CDCN, the accuracy of SPHINX when trained on speech recorded with a close-talking microphone and tested on speech recorded with a desk-top microphone is essentially the same obtained when the system is trained and tested on speech from the desk-top microphone.
An algorithm for frequency normalization has also been proposed in which the parameter of the bilinear transformation that is used by the signal-processing stage to produce frequency warping is adjusted for each new speaker and acoustical environment. The optimum value of this parameter is again chosen to minimize the vector-quantization distortion between the standard environment and the current one. In preliminary studies, use of this frequency normalization produced a moderate additional decrease in the observed error rate.
TL;DR: In 25 original chapter-articles, leading authorities address various aspects of speech signal processing, stressing the advances during the past five to ten years.
Abstract: In 25 original chapter-articles, leading authorities address various aspects of speech signal processing, stressing the advances during the past five to ten years. The volume presents a wealth of material, in a variety of styles, and is divided into four sections: analysis and coding (nine chapters)
TL;DR: The results suggest that redundant gender information was imbedded in the fundamental frequency and vocal tract resonance characteristics of speech, as well as speaker fundamental frequency of voicing.
Abstract: The purpose of this research was to investigate the potential effectiveness of digital speech processing and pattern recognition techniques for the automatic recognition of gender from speech. In part I Coarse Analysis [K. Wu and D. G. Childers, J. Acoust. Soc. Am. 9 0 (1991)] various feature vectors and distance measures were examined to determine their appropriateness for recognizing a speaker’s gender from vowels, unvoiced fricatives, and voiced fricatives. One recognition scheme based on feature vectors extracted from vowels achieved 100% correct recognition of the speaker’s gender using a database of 52 speakers (27 male and 25 female). In this paper a detailed, fine analysis of the characteristics of vowels is performed, including formant frequencies, bandwidths, and amplitudes, as well as speaker fundamental frequency of voicing. The fine analysis used a pitch synchronous closed‐phase analysis technique. Detailed formant features, including frequencies, bandwidths, and amplitudes, were extracted by a closed‐phase weighted recursive least‐squares method that employed a variable forgetting factor, i.e., WRLS‐VFF. The electroglottograph signal was used to locate the closed‐phase portion of the speech signal. A two‐way statistical analysis of variance (ANOVA) was performed to test the differences between gender features. The relative importance of grouped vowel features was evaluated by a pattern recognition approach. Numerous interesting results were obtained, including the fact that the second formant frequency was a slightly better recognizer of gender than fundamental frequency, giving 98.1% versus 96.2% correct recognition, respectively. The statistical tests indicated that the spectra for female speakers had a steeper slope (or tilt) than that for males. The results suggest that redundant gender information was imbedded in the fundamental frequency and vocal tract resonance characteristics. The feature vectors for female voices were observed to have higher within‐group variations than those for male voices. The data in this study were also used to replicate portions of the Peterson and Barney [J. Acoust. Soc. Am. 2 4, 175–184 (1952)] study of vowels for male and female speakers.
TL;DR: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance.
Abstract: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance. The method and apparatus provide user correction actions representing the accuracy of a speech recognition, dynamically, during the recognition of unknown incoming speech utterances and after training of the system. The quality values are updated, during the speech recognition process, for at least a portion of those reference patterns used during the speech recognition process. Reference patterns having low quality values, indicative of either inaccurate representation of the unknown speech or non-use, can be deleted so long as the reference pattern is not needed, for example, where the reference pattern is the last instance of a known word or phrase. Various methods and apparatus are provided for determining when reference patterns can be deleted or added, to the reference memory, and when the scores or values associated with a reference pattern should be increased or decreased to represent the "goodness" of the reference pattern in recognizing speech.
TL;DR: A method for segregating speech from speakers engaged in dialogs employs a distance measure between speech segments used in conjunction with a clustering algorithm to perform the segregation.
Abstract: A method for segregating speech from speakers engaged in dialogs is described. The method, assuming no prior knowledge of the speakers, employs a distance measure between speech segments used in conjunction with a clustering algorithm to perform the segregation. Properties of the distance measure are discussed, and an air traffic control application is described. >
TL;DR: Based on a new similarity model for the voice excitation process, a novel pitch determination procedure is derived that has infinite (super) resolution, better accuracy than the difference limen for F/sub 0/, robustness to noise, reliability, and modest computational complexity.
Abstract: Based on a new similarity model for the voice excitation process, a novel pitch determination procedure is derived. The unique features of the proposed algorithm are infinite (super) resolution, better accuracy than the difference limen for F/sub 0/, robustness to noise, reliability, and modest computational complexity. The algorithm is instrumental to speech processing applications which require pitch synchronous spectral analysis. The computational complexity of the proposed algorithm is well within the capacity of modern digital signal processing (DSP) technology and therefore can be implemented in real time. >
TL;DR: In a set of experiments involving 35 pairs of phonetically similar sentences representing seven types of structural contrasts, the perceptual evidence shows that some, but not all, of the pairs can be disambiguated on the basis of prosodic differences.
Abstract: Prosodic structure and syntactic structure are not identical; neither are they unrelated. Knowing when and how the two eorrespoud could yield better quality speech synthesis, could aid in the disambiguation of competing syntactic hypotheses in speech understanding, and could lead to a more comprehensive view of human speech processing. In a set of experiments involving 35 pairs of phonetically similar sentences representing seven types of structural contrasts, the perceptual evidence shows that some, but not all, of the pairs can be disambiguated on the basis of prosodie differences. The phonological evidence relates the disambiguation primarily to boundary phenomena, although prominences sometimes play a role. Finally, phonetic analyses describing the attributes of these phonological markers indicate the importance of both absolute and relative measures.
TL;DR: A speech enhancement technique is proposed based on principal component analysis and a new criterion for the selection of the parsimonious number of components for noise-free signal regeneration that has an improved performance compared to existing techniques.
TL;DR: In this paper, a CELP speech processor utilizes an organized, non-overlapping, algebraic codebook containing a predetermined number of vectors, uniformly distributed over a multi-dimensional sphere to generate a remaining speech residual.
Abstract: Apparatus and method for encoding speech using a codebook excited linear predictive (CELP) speech processor and an algebraic codebook for use therewith The CELP speech processor receives a digital speech input representative of human speech and performs linear predictive code analysis and perceptual weighting filtering to produce a short term speech information and a long term speech information The CELP speech processor utilizes an organized, non-overlapping, algebraic codebook containing a predetermined number of vectors, uniformly distributed over a multi-dimensional sphere to generate a remaining speech residual The short term speech information, long term speech information and remaining speech residual are combinable to form a quality reproduction of the digital speech input
TL;DR: "coarse" acoustic coefficients (autocorrelation, linear prediction, cepstrum, and reflection) were used to form test and reference templates for vowels, voiced fricatives, and unvoiced fricative and implied that the gender information is time invariant, phoneme independent, and speaker independent for a given gender.
Abstract: The purpose of this research was to investigate the potential effectiveness of digital speech processing and pattern recognition techniques for the automatic recognition of gender from speech segments. In this paper "coarse" acoustic coefficients (autocorrelation, linear prediction, cepstrum, and reflection) were used to form test and reference templates for vowels, voiced fricatives, and unvoiced fricatives. The effects of different distance measures, filter orders, recognition schemes, and vowels and fricatives were comparatively assessed to determine their effectiveness for the task of gender recognition from speech segments. The results showed that most of the acoustic parameters worked well for gender recognition. A within-gender and within-subject averaging technique was important for generating appropriate test and reference templates. The Euclidean distance measure appeared to be the most robust as well as the simplest of the distance measures. The results from this study implied that the gender information is time invariant, phoneme independent, and speaker independent for a given gender. One recognition scheme achieved 100% correct speaker gender classification for a database of 52 talkers (27 male and 25 female). In part II of this paper [D.G. Childers and K. Wu, J. Acoust. Soc. Am. 90, 1841-1856 (1991); hereafter referred to as paper II] the detailed features of ten vowels that appeared responsible for distinguishing a speaker's gender were examined statistically. Included in paper II is a replication of part of the classical study of Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] of vowel characteristics.
TL;DR: The integration of speech and natural language processing in Phi DM-Dialog and its cost-based scheme of ambiguity resolution are discussed and its simultaneous interpretation capability, made possible by an incremental parsing and generation algorithm, is examined.
Abstract: Phi DM-Dialog, one of the first experimental speech-to-speech systems and the first to demonstrate simultaneous interpretation possibilities, is described. An overview is given of the model behind Phi DM-Dialog. It consists of a memory network for representing various knowledge levels and markers for inferencing. The markers have rich information content. The integration of speech and natural language processing in Phi DM-Dialog and its cost-based scheme of ambiguity resolution are discussed. Its simultaneous interpretation capability, which is made possible by an incremental parsing and generation algorithm, is examined. Prototype system results are reported. >
TL;DR: Aninput speech signal is encoded by an adaptive quantizer which quantizes the predicted residual signal between the digital input speech signal, and prediction signals provided by predictors and a shaped quantization noise provided by a noise shaping filter.
Abstract: An input speech signal is encoded by an adaptive quantizer which quantizes the predicted residual signal between the digital input speech signal, and prediction signals provided by predictors and a shaped quantization noise provided by a noise shaping filter. An inverse quantizer, to which the encoded speech signal is supplied, is provided for noise shaping and local decoding. A noise shaping filter makes the spectrum of the quantization noise similar to that of the original digital input speech signal by using the shaping factors. The shaping factors are changed depending upon the prediction gain (ex. ratio of input speech signal to predicted residual signal or the prediction coefficients). On a decoding side of the system there are an inverse quantizer, predictors, and a post noise shaping filter. The shaping factors for the post noise shaping filter are similarly changed depending upon the prediction gain.
TL;DR: A detailed exposition of the main areas of signal processing, this book is divided into three sections: one-dimensional signal processing and digital filters; two-dimensional signals processing and image processing; and pattern recognition.
Abstract: A detailed exposition of the main areas of signal processing, this book is divided into three sections: one-dimensional signal processing and digital filters; two-dimensional signal processing and image processing; and pattern recognition. Among the more specific topics covered are: analog filters; discrete systems and signals; non-recursive filters; FFT; IIR design; quantization effects; and hardware and software design. There is also material on system stability, picture enhancement and restoration, and parallel processing methods, as well as a comprehensive treatment of syntactic methods, parsing and neural networks.
TL;DR: A modular software TTS (text-to-speech) system for Greek with good intelligibility and quality of speech and the possibility of being further improved by extending its linguistic knowledge is presented.
Abstract: A modular software TTS (text-to-speech) system for Greek with good intelligibility and quality of speech and the possibility of being further improved by extending its linguistic knowledge is presented. The system has several peculiarities in comparison to most systems for other languages, combining the advantages of formant synthesis with those of diphone synthesis. In addition to the text normalizer (including numbers) and the sophisticated text preprocessor, the system uses composite speech segments besides phonemes which are concatenated, using a dynamically-adjusted-in-range, sigmoid function. The segments are coded in a novel scheme aiding the rules which manipulate voice onset and duration times. A declined line with its ending part dependent on the punctuation mark or function word, which fluctuates according to the stressed points and unvoiced (voiced) consonant and plosive locations, controls the intonation of the input text. The system is to be improved by incorporating elaborate prosodic rules resulting from syntactic analysis of the text. >
TL;DR: A method for automatic segmentation of speech into phones is described, where the incoming utterance is split up into more or less stationary parts, and these stationary parts are labelled as phones using the phonetic transcription of the utterance.
Abstract: A method for automatic segmentation of speech into phones is described. The incoming utterance is split up into more or less stationary parts, and these stationary parts are labelled as phones using the phonetic transcription of the utterance. An implicit segmentation algorithm splits up the utterance into segments on the basis of the degree of similarity between the frequency spectra of neighboring frames. An explicit algorithm does the same, but on the basis of the degree of similarity between the frequency spectra of the frames in the utterance and reference spectra. A combination algorithm compares the two segmentation results and produces the final segmentation. Automatically determined phone boundaries are compared with manually determined ones. The result of a perception test is described. >
TL;DR: The algorithm is similar to the cepstral smoothing approach for formant extraction using homomorphic deconvolution but the logarithmic operation is replaced by ()' operation and the additive and high resolution properties of group delay functions are exploited to emphasize formant peaks.
TL;DR: A speech coder apparatus operates to compress speech signals to a low bit rate and includes a continuous speech recognizer (CSR) which has a memory for storing templates.
Abstract: A speech coder apparatus operates to compress speech signals to a low bit rate. The apparatus includes a continuous speech recognizer (CSR) which has a memory for storing templates. Input speech is processed by the CSR where information in the speech is compared against the templates to provide an output digital signal indicative of recognized words, which signal is transmitted along a first path. There is further included a front end processor which is also responsive to the input speech signal for providing output digitized speech samples during a given frame interval. A side information encoder circuit responds to the output from the front end processor to provide at the output of the encoder a parameter signal indicative of the value of the pitch and word duration for each word as recognized by the CSR unit. The output of the encoder is transmitted as a second signal. There is a receiver which includes a synthesizer responsive to the first and second transmitted signals for providing an output synthesized signal for each recognized word where the pitch, duration and amplitude of the synthesized signal is changed according to the parameter signal to preserve the quality of the synthesized speech.
TL;DR: In this article, an adaptive filtering technique is applied to sequences of energy estimates in each of two signal channels, one channel containing speech and environmental noise and the other channel containing primarily the same environmental noise.
Abstract: A digital signal processing system applies an adaptive filtering technique to sequences of energy estimates in each of two signal channels, one channel containing speech and environmental noise and the other channel containing primarily the same environmental noise. From the channel containing primarily environmental noise, a prediction is made of the energy of that noise in the channel containing both the speech and that noise, so that the noise can be extracted from the mixture of speech and noise. The result is that the speech will be more easily recognizable by either human listeners or speech recognition systems.
TL;DR: In this paper, a secure narrowband digital conferencing system is proposed, which uses a multipulse or a code-excited linear predictive (CELP) speech processing algorithm for colding the speech signals of the respective participants.
Abstract: A secure narrowband digital conferencing system is capable of handing multiple speakers simultaneously and in a full duplex mode. The system uses a multipulse or a code-excited linear predictive (CELP) speech processing algorithm for colding the speech signals of the respective participants. A conference director receives the multipluse or CELP encrypted voice signal streams over normal telephone links, decrypts them, then synthesizes a composite speech signal and uses an analysis-by-synthesis algorithm to compress them, and then encrypts the composite signal and transmits it back to all the participants.
TL;DR: This chapter discusses Digital Signal Processing methods, Information Theory and Probability Models, and some Useful Practical Classes of Random Processes.
Abstract: Preface Acknowledgement Symbols Abbreviations Part I Basic Digital Signal Processing 1 Introduction 11 Signals and Information 12 Signal Processing Methods 13 Applications of Digital Signal Processing 14 Summary 2 Fourier Analysis and Synthesis 21 Introduction 22 Fourier Series: Representation of Periodic Signals 23 Fourier Transform: Representation of Nonperiodic Signals 24 Discrete Fourier Transform 25 Short-Time Fourier Transform 26 Fast Fourier Transform (FFT) 27 2-D Discrete Fourier Transform (2-D DFT) 28 Discrete Cosine Transform (DCT) 29 Some Applications of the Fourier Transform 210 Summary 3 z-Transform 31 Introduction 32 Derivation of the z-Transform 33 The z-Plane and the Unit Circle 34 Properties of z-Transform 35 z-Transfer Function, Poles (Resonance) and Zeros (Anti-resonance) 36 z-Transform of Analysis of Exponential Transient Signals 37 Inverse z-Transform 38 Summary 4 Digital Filters 41 Introduction 42 Linear Time-Invariant Digital Filters 43 Recursive and Non-Recursive Filters 44 Filtering Operation: Sum of Vector Products, A Comparison of Convolution and Correlation 45 Filter Structures: Direct, Cascade and Parallel Forms 46 Linear Phase FIR Filters 47 Design of Digital FIR Filter-banks 48 Quadrature Mirror Sub-band Filters 49 Design of Infinite Impulse Response (IIR) Filters by Pole-zero Placements 410 Issues in the Design and Implementation of a Digital Filter 411 Summary 5 Sampling and Quantisation 51 Introduction 52 Sampling a Continuous-Time Signal 53 Quantisation 54 Sampling Rate Conversion: Interpolation and Decimation 55 Summary Part II Model-Based Signal Processing 6 Information Theory and Probability Models 61 Introduction: Probability and Information Models 62 Random Processes 63 Probability Models of Random Signals 64 Information Models 65 Stationary and Non-Stationary Random Processes 66 Statistics (Expected Values) of a Random Process 67 Some Useful Practical Classes of Random Processes 68 Transformation of a Random Process 69 Search Engines: Citation Ranking 610 Summary 7 Bayesian Inference 71 Bayesian Estimation Theory: Basic Definitions 72 Bayesian Estimation 73 Expectation Maximisation Method 74 Cramer-Rao Bound on the Minimum Estimator Variance 75 Design of Gaussian Mixture Models (GMM) 76 Bayesian Classification 77 Modelling the Space of a Random Process 78 Summary 8 Least Square Error, Wiener-Kolmogorov Filters 81 Least Square Error Estimation: Wiener-Kolmogorov Filter 82 Block-Data Formulation of the Wiener Filter 83 Interpretation of Wiener Filter as Projection in Vector Space 84 Analysis of the Least Mean Square Error Signal 85 Formulation of Wiener Filters in the Frequency Domain 86 Some Applications of Wiener Filters 87 Implementation of Wiener Filters 88 Summary 9 Adaptive Filters: Kalman, RLS, LMS 91 Introduction 92 State-Space Kalman Filters 93 Sample Adaptive Filters 94 Recursive Least Square (RLS) Adaptive Filters 95 The Steepest-Descent Method 96 LMS Filter 97 Summary 10 Linear Prediction Models 101 Linear Prediction Coding 102 Forward, Backward and Lattice Predictors 103 Short-Term and Long-Term Predictors 104 MAP Estimation of Predictor Coefficients 105 Formant-Tracking LP Models 106 Sub-Band Linear Prediction Model 107 Signal Restoration Using Linear Prediction Models 108 Summary 11 Hidden Markov Models 111 Statistical Models for Non-Stationary Processes 112 Hidden Markov Models 113 Training Hidden Markov Models 114 Decoding Signals Using Hidden Markov Models 115 HMM in DNA and Protein Sequences 116 HMMs for Modelling Speech and Noise 117 Summary 12 Eigenvector Analysis, Principal Component Analysis and Independent Component Analysis 121 Introduction - Linear Systems and Eigenanalysis 122 Eigenvectors and Eigenvalues 123 Principal Component Analysis (PCA) 124 Independent Component Analysis 125 Summary Part III Applications of Digital Signal Processing to Speech, Music and Telecommunications 13 Music Signal Processing and Auditory Perception 131 Introduction 132 Musical Notes, Intervals and Scales 133 Musical Instruments 134 Review of Basic Physics of Sounds 135 Music Signal Features and Models 136 Anatomy of the Ear and the Hearing Process 137 Psychoacoustics of Hearing 138 Music Coding (Compression) 139 High Quality Audio Coding: MPEG Audio Layer-3 (MP3) 1310 Stereo Music Coding 1311 Summary 14 Speech Processing 141 Speech Communication 142 Acoustic Theory of Speech: The Source-filter Model 143 Speech Models and Features 144 Linear Prediction Models of Speech 145 Harmonic Plus Noise Model of Speech 146 Fundamental Frequency (Pitch) Information 147 Speech Coding 148 Speech Recognition 149 Summary 15 Speech Enhancement 151 Introduction 152 Single-Input Speech Enhancement Methods 153 Speech Bandwidth Extension - Spectral Extrapolation 154 Interpolation of Lost Speech Segments - Packet Loss Concealment 155 Multi-Input Speech Enhancement Methods 156 Speech Distortion Measurements 157 Summary 16 Echo Cancellation 161 Introduction: Acoustic and Hybrid Echo 162 Telephone Line Hybrid Echo 163 Hybrid (Telephone Line) Echo Suppression 164 Adaptive Echo Cancellation 165 Acoustic Echo 166 Sub-Band Acoustic Echo Cancellation 167 Echo Cancellation with Linear Prediction Pre-whitening 168 Multi-Input Multi-Output Echo Cancellation 169 Summary 17 Channel Equalisation and Blind Deconvolution 171 Introduction 172 Blind Equalisation Using Channel Input Power Spectrum 173 Equalisation Based on Linear Prediction Models 174 Bayesian Blind Deconvolution and Equalisation 175 Blind Equalisation for Digital Communication Channels 176 Equalisation Based on Higher-Order Statistics 177 Summary 18 Signal Processing in Mobile Communication 181 Introduction to Cellular Communication 182 Communication Signal Processing in Mobile Systems 183 Capacity, Noise, and Spectral Efficiency 184 Multi-path and Fading in Mobile Communication 185 Smart Antennas - Space-Time Signal Processing 186 Summary Index
TL;DR: Findings are interpreted as showing that an age-related reduction in working memory efficiency contributes to age differences in processing discourse for memory.
Abstract: Adult age differences in processing speech were examined with a dual-task paradigm. Subjects listened to spoken passages for later recall while performing a concurrent reaction time task intended to index cognitive capacity usage on the speech memory task. Age differences in secondary task decision latencies were eliminated when subgroups of young and older subjects were matched on working memory span. These findings are interpreted as showing that an age-related reduction in working memory efficiency contributes to age differences in processing discourse for memory.
TL;DR: In order to estimate the fundamental frequency (fO) of pseudoperiodical sounds with a wide band of possible fO, a theoretical model based on a maximum likelihood for fO is proposed.
Abstract: In order to estimate the fundamental frequency (fO) of pseudoperiodical sounds with a wide band of possible fO, a theoretical model based on a maximum likelihood for fO is proposed. The model is simplified to make it fast enough for extensive tests. The resulting algorithm is tested on musical speech sounds. As a musical application, an instrument follower based on the algorithm and operating in real time is implemented. >
TL;DR: The techniques and experiments described are the first demonstration of a complete system that accepts speech messages as input and produces as estimated message class as output and demonstrate the feasibility of the technology and illustrate the need for further work.
Abstract: The components of a speech message information retrieval system include an acoustic front end which provides an incomplete transcription of a spoken message, and a message classifier that interprets the incomplete transcription and classifies the message according to message category. The techniques and experiments described are concerned with the integration of these components and represent the first demonstration of a complete system that accepts speech messages as input and produces as estimated message class as output. The complete system has been implemented on special-purpose digital signal processing hardware and demonstrated using live speech input. The results obtained on a conversational speech task have demonstrated the feasibility of the technology and also illustrate the need for further work. Even with a perfect acoustic front end, a message classification accuracy of only 78% was obtained with a 126 keyword vocabulary. >
TL;DR: A novel method of coding voiced speech is introduced, which transmits an encoded prototype waveform at 20-30 ms intervals, and is quantized using analysis-by-synthesis methods, which results in excellent speech quality at rates between 3.0 and 4.0 kb/s.
Abstract: A major source of audible distortion in current low-bit-rate speech coding algorithms is an inaccurate degree of periodicity of the voiced speech signal. If the correlations between neighboring pitch cycles are accurately reproduced, these audible distortions can be reduced significantly. To this purpose, a novel method of coding voiced speech is introduced, which transmits an encoded prototype waveform at 20-30 ms intervals. The prototype waveform describes a pitch cycle representative for the interval, and is quantized using analysis-by-synthesis methods. The speech signal is reconstructed by concatenation of interpolated prototype waveforms. The short-term and the long-term correlations between pitch cycles can be controlled explicitly. Unquantized reconstructed speech is virtually indistinguishable from the original signal. The method results in excellent speech quality at rates between 3.0 and 4.0 kb/s. >
TL;DR: A new method based on the assumption that, for voiced speech, a perceptually accurate speech signal can be reconstructed from a description of the waveform of a single, representative pitch cycle per interval of 20-30 ms is presented, which retains the natural quality of coders which encode the entire waveform, but requires a bit rate close to that of the parametric coders.
TL;DR: The feasibility of processing the Fourier transform (FT) phase of a speech signal to derive the smooth log magnitude spectrum corresponding to the vocal tract system is demonstrated and a technique to extract the vocal tracts component of the group delay function is proposed by using the spectral properties of the excitation signal.