TL;DR: A speech processing algorithm was developed to create more salient versions of the rapidly changing elements in the acoustic waveform of speech that have been shown to be deficiently processed by language-learning impaired (LLI) children as discussed by the authors.
Abstract: A speech processing algorithm was developed to create more salient versions of the rapidly changing elements in the acoustic waveform of speech that have been shown to be deficiently processed by language-learning impaired (LLI) children. LLI children received extensive daily training, over a 4-week period, with listening exercises in which all speech was translated into this synthetic form. They also received daily training with computer "games" designed to adaptively drive improvements in temporal processing thresholds. Significant improvements in speech discrimination and language comprehension abilities were demonstrated in two independent groups of LLI children.
TL;DR: The genetic algorithm is introduced as an emerging optimization algorithm for signal processing and a number of applications, such as IIR adaptive filtering, time delay estimation, active noise control, and speech processing, that are being successfully implemented are described.
Abstract: This article introduces the genetic algorithm (GA) as an emerging optimization algorithm for signal processing. After a discussion of traditional optimization techniques, it reviews the fundamental operations of a simple GA and discusses procedures to improve its functionality. The properties of the GA that relate to signal processing are summarized, and a number of applications, such as IIR adaptive filtering, time delay estimation, active noise control, and speech processing, that are being successfully implemented are described.
TL;DR: A new approach is then developed which achieves a trade-off between effective noise reduction and low computational load for real-time operations and demonstrates that the subjective and objective results are much better than existing methods.
Abstract: This paper addresses the problem of single microphone frequency domain speech enhancement in noisy environments. The main characteristics of available frequency domain noise reduction algorithms are presented. We have confirmed that the a priori SNR estimation leads to the best subjective results. According to these conclusions, a new approach is then developed which achieves a trade-off between effective noise reduction and low computational load for real-time operations. The obtained solutions demonstrate that the subjective and objective results are much better than existing methods.
TL;DR: Four approaches for automatic language identification of speech utterances are compared: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languaged dependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single- language phone recognizers, each trained in a different language; and languagedependent parallel phone recognition (PPR).
Abstract: Abstruct- We have compared the performance of four approaches for automatic language identification of speech utterances: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languagedependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single-language phone recognizers, each trained in a different language; and languagedependent parallel phone recognition (PPR). These approaches, which span a wide range of training requirements and levels of recognition complexity, were evaluated with the Oregon Graduate Institute Multi-Language Telephone Speech Corpus. Systems containing phone recognizers performed better than the simpler GMM classifier. The top-performing system was parallel PRLM, which exhibited an error rate of 2% for 45-s utterances and 5% for 10-s utterances in two-language, closed-set, forcedchoice classification. The error rate for 11-language, closed-set, forced-choice classification was 11 % for 45-s utterances and 21% for 10-s utterances.
TL;DR: A technique which is successful at discriminating speech from music on broadcast FM radio is described, which provides the capability to robustly distinguish the two classes and runs easily in real time.
Abstract: We describe a technique which is successful at discriminating speech from music on broadcast FM radio. The computational simplicity of the approach could lend itself to wide application including the ability to automatically change channels when commercials appear. The algorithm provides the capability to robustly distinguish the two classes and runs easily in real time. Experimental results to date show performance approaching 98% correct classification.
TL;DR: In this paper, a cochlear implant system includes an implant portion and an external portion, each performing at least the function of generating electrical stimuli, modulated and classified in response to the sensed acoustic signals, and intended for direct electrical stimulation of the auditory nerve.
Abstract: A cochlear implant system includes an implant portion and an external portion. The external portion performs at least the function of sensing acoustic signals and converting such sensed signals to electrical signals. The implant portion performs at least the function of generating electrical stimuli, modulated and classified in response to the sensed acoustic signals, and intended for direct electrical stimulation of the auditory nerve in accordance with a selected speech processing strategy. Control data defines the selected speech processing strategy, i.e., the pulsatile stimulation pattern to be used by implantable portion. Such control data is transmitted to and stored within the implantable portion of the system only once, when a particular speech processing strategy is selected, thereby eliminating the need to continually resend such speech-processing-defining data over a bandwidth-limited link between the implantable and external portions of the system. The control data that defines the selected speech processing strategy is stored in a stimulation template (also referred to as a “pulse table”), which template or table is stored digitally within the implanted portion of the system. Weighting coefficients (or weighting factors) are stored in the template or table at specified locations to define the speech processing strategy. For example, the columns of the template or table may be used to represent the different current sources, or “stimulous channels”, of the implanted portion, and the rows may be used to represent intervals of time. The “stimulous channels” and increments of time thus form the two ordinates of the table, and the table thus consists of a modest number of intervals whose total duration defines a complete “cycle” of stimulation. The instantaneous current flow to be generated by the implanted portion is defined at the beginning of each stimulation cycle by multiplying the weighting factor stored in a particular location within the pulse table by modulation data derived from the sensed acoustic signal. Only modulation data need be sent to the implanted portion on a continuous (real time) basis for the cochlear implant system to function.
TL;DR: Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described, including the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.
Abstract: The future commercialization of speaker- and speech-recognition technology is impeded by the large degradation in system performance due to environmental differences between training and testing conditions. This is known as the "mismatched condition." Studies have shown [l] that most contemporary systems achieve good recognition performance if the conditions during training are similar to those during operation (matched conditions). Frequently, mismatched conditions axe present in which the performance is dramatically degraded as compared to the ideal matched conditions. A common example of this mismatch is when training is done on clean speech and testing is performed on noise- or channel-corrupted speech. Robust speech techniques [2] attempt to maintain the performance of a speech processing system under such diverse conditions of operation. This article presents an overview of current speaker-recognition systems and the problems encountered in operation, and it focuses on the front-end feature extraction process of robust speech techniques as a method of improvement. Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described. Also described is the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.
TL;DR: An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filter-bank in mel-frequency cepstrum feature analysis are presented.
Abstract: In an effort to reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, a frequency warping approach to speaker normalization is investigated. A set of low complexity, maximum likelihood based frequency warping procedures have been applied to speaker normalization for a telephone based connected digit recognition task. This paper presents an efficient means for estimating a linear frequency warping factor and a simple mechanism for implementing frequency warping by modifying the filter-bank in mel-frequency cepstrum feature analysis. An experimental study comparing these techniques to other well-known techniques for reducing variability is described. The results showed that frequency warping was consistently able to reduce word error rate by 20% even for very short utterances.
TL;DR: A more sophisticated handwriting recognition system that achieves a writer independent recognition rate of 94.5% on 3,823 unconstrained handwritten word samples from 18 writers covering a 32 word vocabulary is built.
Abstract: Hidden Markov model (HMM) based recognition of handwriting is now quite common, but the incorporation of HMM's into a complex stochastic language model for handwriting recognition is still in its infancy. We have taken advantage of developments in the speech processing field to build a more sophisticated handwriting recognition system. The pattern elements of the handwriting model are subcharacter stroke types modeled by HMMs. These HMMs are concatenated to form letter models, which are further embedded in a stochastic language model. In addition to better language modeling, we introduce new handwriting recognition features of various kinds. Some of these features have invariance properties, and some are segmental, covering a larger region of the input pattern. We have achieved a writer independent recognition rate of 94.5% on 3,823 unconstrained handwritten word samples from 18 writers covering a 32 word vocabulary.
TL;DR: It is suggested that recent studies based on a Source Generator Framework can provide a viable foundation in which to establish robust speech recognition techniques, and three novel approaches for signal enhancement and stress equalization are considered to address the issue of recognition under noisy stressful conditions.
TL;DR: In this article, signals are accepted corresponding to interspersed speech elements including text elements corresponding to text to be recognized and commands corresponding to commands to be executed in a manner which depends on whether they represent text or commands.
Abstract: In a method for use in recognizing continuous speech, signals are accepted corresponding to interspersed speech elements including text elements corresponding to text to be recognized and command elements corresponding to commands to be executed. The elements are recognized. The recognized elements are acted on in a manner which depends on whether they represent text or commands.
TL;DR: Experimental results show that this vocabulary-independent discriminative utterance verification method significantly outperforms a baseline method commonly used in wordspotting tasks.
Abstract: An integral part of any deployable speech recognition system is the capability to detect if the input speech does not contain any of the words in the recognizer vocabulary set. This capability, which is called utterance verification (or keyword recognition and nonkeyword rejection), is therefore becoming increasingly important as speech recognition systems continue to migrate from the laboratory to actual applications. We present a framework and a method for vocabulary independent utterance verification in subword-based speech recognition. The verification process is cast as a statistical hypothesis test, where vocabulary independence is accomplished through a two-stage verification process: subword-level verification followed by string-level verification. A verification function is defined and discriminatively trained to perform subword-level verification. String-level verification is accomplished by defining and evaluating an overall string-level log likelihood ratio that is a function of the subword-level verification scores. Experimental results show that this vocabulary-independent discriminative utterance verification method significantly outperforms a baseline method commonly used in wordspotting tasks.
TL;DR: A new system for warp scale selection which uses a simple generic voiced speech model to rapidly select appropriate frequency scales and is sufficiently streamlined that it can moved completely into the front-end processing.
Abstract: This paper reports on a simplified system for determining vocal tract normalization. Such normalization has led to significant gains in recognition accuracy by reducing variability among speakers and allowing the pooling of training data and the construction of sharper models. But standard methods for determining the warp scale have been extremely cumbersome, generally requiring multiple recognition passes. We present a new system for warp scale selection which uses a simple generic voiced speech model to rapidly select appropriate frequency scales. The selection is sufficiently streamlined that it can moved completely into the front-end processing. Using this system on a standard test of the Switchboard Corpus, we have achieved relative reductions in word error rates of 12% over unnormalized gender-independent models and 6% over our best unnormalized gender-dependent models.
TL;DR: A computer system for user speech actuation of access to stored information, the system including a central processing unit, a memory and a user input/output interface including a microphone for input of user speech utterances and audible sound signal processing circuitry, and a file system for accessing and storing information in the memory of the computer.
Abstract: A computer system for user speech actuation of access to stored information, the system including a central processing unit, a memory and a user input/output interface including a microphone for input of user speech utterances and audible sound signal processing circuitry, and a file system for accessing and storing information in the memory of the computer. A speech recognition processor operating on the computer system recognizes words based on the input speech utterances of the user in accordance with a set of language/acoustic model and speech recognition search parameters. Software running on the CPU scans a document accessed by a web browser to form a web triggered word set from a selected subset of information in the document. The language/acoustic model and speech recognition search parameters are modified dynamically using the web triggered word set, and used by the speech recognition processor for generating a word string for input to the browser to initiate a change in the information accessed.
TL;DR: The SBR method, integrated into a discrete density HMM, is applied to telephone speech recognition where the contamination due to extraneous signal components is assumed to be unknown and to enable real-time implementation, a sequential method for the estimation of the bias is presented.
Abstract: An acoustical mismatch between the training and the testing conditions of hidden Markov model (HMM)-based speech recognition systems often causes a severte degradation in the recognition performance. In telephone speech recognition, for example, undesirable signal components due to ambient noise and channel distortion, as well as due to different variations of telephone handsets render the recognizer unusable for real- world applications. This paper presents a signal bias removal (SBR) method based on maximum likelihood1 estimation for the minimization of these undesirable effects. The proposed method is readily applicable in various architectures, i.e., dis- crete (vector-quantization based), semicontinuous and continuous density HMM. In this paper, the SBR method, integrated into a discrete density HMM, is applied to telephone speech recognition where the contamination due to extraneous signal components is assumed to be unknown. To enable real-time implementation, a sequential method for the estimation of the bias is presented. Experimental results for speaker-independent connected digit recognition show a reduction in the per digit error rate by up to 41% and 14% during mismatched and matclhed training and testing conditions, respectively.
TL;DR: Aspeech signal transmitting receiving apparatus, such as a portable telephone set, includes a speech signal transmitting encoding circuit, a noise domain detection unit, a Noise level detection unit and a controller.
Abstract: A speech signal transmitting receiving apparatus, such as a portable telephone set, includes a speech signal transmitting encoding circuit, a noise domain detection unit, a noise level detection unit and a controller. The speech signal transmitting encoding circuit compresses input speech signals by digital signal processing at a high efficiency. The noise domain detection unit detects the noise domain using an analytic pattern produced by the speech signal transmitting encoding circuit. The noise level detection unit detects the noise level of the noise domain detected by the noise domain detection unit. The controller controls the received sound volume responsive to the noise level detected by the noise level detection unit.
TL;DR: The database structure and techniques adopted to improve the performance of a Discrete Hidden Markov Model (DHMM) labeler used to assign initial phoneme labels to the elements of the Nemours database are described.
Abstract: The Nemours database is a collection of 814 short nonsense sentences; 74 sentences spoken by each of 11 male speakers with varying degrees of dysarthria. Additionally, the database contains two connected-speech paragraphs produced by each of the 11 speakers. The database was designed to test the intelligibility of dysarthric speech before and after enhancement by various signal processing methods, and is available on CD-ROM. It can also be used to investigate general characteristics of dysarthric speech such as production error patterns. The entire database has been marked at the word level and sentences for 10 of the 11 talkers have been marked at the phoneme level as well. The paper describes the database structure and techniques adopted to improve the performance of a Discrete Hidden Markov Model (DHMM) labeler used to assign initial phoneme labels to the elements of the database. These techniques may be useful in the design of automatic recognition systems for persons with speech disorders, especially when limited amounts of training data are available.
TL;DR: A new highly parallel approach to automatic recognition of speech, inspired by early Fetcher's research on articulation index, and based on independent probability estimates in several sub-bands of the available speech spectrum, is presented.
Abstract: A new highly parallel approach to automatic recognition of speech, inspired by early Fetcher's research on articulation index, and based on independent probability estimates in several sub-bands of the available speech spectrum, is presented. The approach is especially suitable for situations when part of the spectrum of speech is computed. In such cases, it can yield an order-of-magnitude improvement in the error rate over a conventional full-band recognizer.
TL;DR: In this article, a speaker specific and non-speaker specific method and apparatus is provided for enabling speech-based remote control and for recognizing the speech of an unspecified speaker at extremely high recognition rates regardless of the speaker's age, sex, or individual speech mannerisms.
Abstract: Bifurcated speaker specific and non-speaker specific method and apparatus is provided for enabling speech-based remote control and for recognizing the speech of an unspecified speaker at extremely high recognition rates regardless of the speaker's age, sex, or individual speech mannerisms. A device main unit is provided with a speech recognition processor for recognizing speech and taking an appropriate action, and with a user terminal containing specific speaker capture and/or preprocessing capabilities. The user terminal exchanges data with the speech recognition processor using radio transmission. The user terminal may be provided with a conversion rule generator that compares the speech of a user with previously compiled standard speech feature data and, based on this comparison result, generates a conversion rule for converting the speaker's speech feature parameters to corresponding standard speaker's feature information. The speech recognition processor, in turn, may reference the conversion rule developed in the user terminal and perform speech recognition based on the input speech feature parameters that have been converted above.
TL;DR: In this article, the authors proposed a formalized aliasing approach to improve the quality of electronic speech synthesis using pre-recorded segments of speech to fill in for other missing segments.
Abstract: The present invention improves upon electronic speech synthesis using pre-recorded segments of speech to fill in for other missing segments of speech. The formalized aliasing approach of the present invention overcomes the ad hoc aliasing approach of the prior art which oftentimes generated less than satisfactory speech synthesis sound output. By formalizing the relationship between missing speech sound samples and available speech sound samples, the present invention provides a structured approach to aliasing which results in improved synthetic speech sound quality. Further, the formalized aliasing approach of the present invention can be used to lessen storage requirements for speech sound samples by only storing as many sound samples as memory capacity can support.
TL;DR: A language model is introduced that predicts disfluencies probabilistically and uses an edited, fluent context to predict following words and finds that the model reduces the word perplexity in the neighborhood of disfluency events; however, overall differences are small and have no significant impact on the recognition accuracy.
Abstract: Speech disfluencies (such as filled pauses, repetitions, restarts) are among the characteristics distinguishing spontaneous speech from planned or read speech. We introduce a language model that predicts disfluencies probabilistically and uses an edited, fluent context to predict following words. The model is based on a generalization of the standard N-gram language model. It uses dynamic programming to compute the probability of a word sequence, taking into account possible hidden disfluency events. We analyze the model's performance for various disfluency types on the Switchboard corpus. We find that the model reduces the word perplexity in the neighborhood of disfluency events; however, overall differences are small and have no significant impact on the recognition accuracy. We also note that for modeling of the most frequent type of disfluency, filled pauses, a segmentation of utterances into linguistic (rather than acoustic) units is required. Our analysis illustrates a generally useful technique for language model evaluation based on local perplexity comparisons.
TL;DR: The authors highlight two broader domains surrounding specific attributions of emotion and the specific features of speech that underlie them, and argue for caution over compartmentalising these, broader domains.
Abstract: The authors highlight two broader domains surrounding specific attributions of emotion and the specific features of speech that underlie them, and argue for caution over compartmentalising these, broader domains. It seems to be a general rule that variations in what we call the augmented prosodic domain (APD) are emotive-perhaps because they signal departure from a reference point corresponding to a well-controlled, neutral state. The studies show that various departures from that reference point are reflected in the APD, including central and sensory impairments (schizophrenia and deafness) as well as emotion. Intuitively it seems right to acknowledge that departures from well-controlled neutrality are highly confusable, and it is unclear that phonetics should to try draw those distinctions more sharply than listeners tend to. A system called ASSESS automatically measures properties in the APD, opening the way to explore it in an empirical spirit.
TL;DR: In this article, a system and method for identifying the phoneme sound types that are contained within an audio speech signal is disclosed, which includes a microphone (12) and associated conditioning circuitry (14, 15, 16, 17, 18).
Abstract: A system and method for identifying the phoneme sound types that are contained within an audio speech signal is disclosed. The system includes a microphone (12) and associated conditioning circuitry (14), for receiving an audio speech signal and converting it to a representative electrical signal. The electrical signal is then sampled and converted to a digital audio signal with a digital-to-analog converter (34). The digital audio signal is input to a programmable digital sound processor (18), which digitally processes the sound so as to extract various time domain and frequency domain sound characteristics. These characteristics are input to a programmable host sound processor (20) which compares the sound characteristics to standard sound data. Based on this comparison, the host sound processor (20) identifies the specific phoneme sounds that are contained within the audio speech signal. The programmable host sound processor (20) further includes linguistic processing program methods to convert the phoneme sounds into English words or other natural language words. These words are input to a host processor (22), which then utilizes the words as either data or commands.
TL;DR: A system and method for performing speaker adaptation in a speech recognition system which includes a set of reference models corresponding to speech data from a plurality of speakers that is adapted to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of thespeech recognition system.
Abstract: A system and method for performing speaker adaptation in a speech recognition system which includes a set of reference models corresponding to speech data from a plurality of speakers. The speech data is represented by a plurality of acoustic models and corresponding sub-events, and each sub-event includes one or more observations of speech data. A degree of lateral tying is computed between each pair of sub-events, wherein the degree of tying indicates the degree to which a first observation in a first sub-event contributes to the remaining sub-events. When adaptation data from a new speaker becomes available, a new observation from adaptation data is assigned to one of the sub-events. Each of the sub-events is then populated with the observations contained in the assigned sub-event based on the degree of lateral tying that was computed between each pair of sub-events. The reference models corresponding to the populated sub-events are then adapted to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of the speech recognition system.
TL;DR: In this paper, a language identification and verification system is described whereby language identification is determined by finding the closest match of a speech utterance to multiple speaker sets, and a language decision is arrived on based on a closest match between the unknown speech features and speech features for such well matched reference speakers in a particular language.
Abstract: A language identification and verification system is described whereby language identification is determined by finding the closest match of a speech utterance to multiple speaker sets. The language identification and verification system is implemented through use of a speaker identification/verification system as a baseline to find a set of well matched speakers in each of a plurality of languages. A comparison of unknown speech to speech features from such well-matched speakers is then made and a language decision is arrived on based on a closest match between the unknown speech features and speech features for such well matched reference speakers in a particular language. To avoid a problem associated with prior-art language identification systems, wherein speech feature are based on short-term spectral features determined at a system frame rate--thereby seriously limiting the resolution and accuracy of such prior-art systems, the invention uses speech features derived from vocalic or syllabic nuclei, from which related phonetic speech features may then be extracted. Detection of such vocalic centers or syllabic nuclei is accomplished using a trained back-error propagation multi-level neural network.
TL;DR: This work shows how knowledge sources in the course of a (man-machine) dialogue may be utilized in a stochastic framework to improve speech understanding.
Abstract: In the course of a (man-machine) dialogue, the system's belief concerning the user's intention is continuously being built up. Moreover, restricting the discourse to a narrow application domain further constrains the variety of possible user reactions. We show how these knowledge sources may be utilized in a stochastic framework to improve speech understanding. On field test data collected with our automatic exchange board prototype PADIS, a relative reduction of attribute errors by 27% was obtained.
TL;DR: Several speech features are considered as potential stress-sensitive relayers using a previously established stressed speech database (SUSAS) and a neural network-based classifier is formulated based on an extended delta-bar-delta learning rule.
Abstract: It is well known that the variability in speech production due to task-induced stress contributes significantly to loss in speech processing algorithm performance. If an algorithm could be formulated that detects the presence of stress in speech, then such knowledge could be used to monitor speaker state, improve the naturalness of speech coding algorithms, or increase the robustness of speech recognizers. The goal in this study is to consider several speech features as potential stress-sensitive relayers using a previously established stressed speech database (SUSAS). The following speech parameters are considered: mel, delta-mel, delta-delta-mel, auto-correlation-mel, and cross-correlation-mel cepstral parameters. Next, an algorithm for speaker-dependent stress classification is formulated for the 11 stress conditions: angry, clear, cond50, cond70, fast, Lombard, loud, normal, question, slow, and soft. It is suggested that additional feature variations beyond neutral conditions reflect the perturbation of vocal tract articulator movement under stressed conditions. Given a robust set of features, a neural network-based classifier is formulated based on an extended delta-bar-delta learning rule. The performance is considered for the following three test scenarios: monopartition (nontargeted) and tripartition (both nontargeted and targeted) input feature vectors.
TL;DR: It is shown that a clean speech VQ codebook is more effective in providing intraframe constraints and, hence, better convergence of the iterative filtering scheme.
Abstract: Speech enhancement using iterative Wiener filtering has been shown to require interframe and intraframe constraints in all-pole parameter estimation We show that a clean speech VQ codebook is more effective in providing intraframe constraints and, hence, better convergence of the iterative filtering scheme Satisfactory speech enhancement results are obtained with a small codebook of 128, and the algorithm is effective for both white noise and pink noise up to 0 dB SNR