TL;DR: The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
Abstract: This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task. >
TL;DR: High performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identity, evaluated on four publically available speech databases.
TL;DR: The survey indicates that the essential points in noisy speech recognition consist of incorporating time and frequency correlations, giving more importance to high SNR portions of speech in decision making, exploiting task-specific a priori knowledge both of speech and of noise, using class-dependent processing, and including auditory models in speech processing.
TL;DR: In this article, a knowledge-based speech recognition apparatus and methods are provided for translating an input speech signal to text, which employ a largely speaker independent dictionary based upon the application of phonological and phonetic/acoustic rules to generate acoustic event transcriptions against which the series of hypothesized acoustic feature vectors are compared to select word choices.
Abstract: Knowledge based speech recognition apparatus and methods are provided for translating an input speech signal to text. The speech recognition apparatus captures an input speech signal, segments it based on the detection of pitch period, and generates a series of hypothesized acoustic feature vectors for the input speech signal that characterizes the signal in terms of primary acoustic events, detectable vowel sounds and other acoustic features. The apparatus and methods employ a largely speaker-independent dictionary based upon the application of phonological and phonetic/acoustic rules to generate acoustic event transcriptions against which the series of hypothesized acoustic feature vectors are compared to select word choices. Local and global syntactic analysis of the word choices is provided to enhance the recognition capability of the methods and apparatus.
TL;DR: The motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools are described, and comparative results on these tasks for British and American English are concluded.
Abstract: A significant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAMO constitutes one of the largest corpora of spoken British English currently in existence. It has been specifically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been verified and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been defined using standard 5000 word bigram and 20000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.
TL;DR: A new adaptive filtering algorithm called fast affine projections (FAP), which includes LMS like complexity and memory requirements (low), and RLS like convergence (fast) for the important case where the excitation signal is speech.
Abstract: This paper discusses a new adaptive filtering algorithm called fast affine projections (FAP). FAP's key features include LMS like complexity and memory requirements (low), and RLS like convergence (fast) for the important case where the excitation signal is speech. Another of FAP's important features is that it causes no delay in the input or output signals. In addition, the algorithm is easily regularized resulting in robust performance even for highly colored excitation signals. The combination of these features make FAP an excellent candidate for the adaptive filter in the acoustic echo cancellation problem. A simple, low complexity numerical stabilization method for the algorithm is also introduced.
TL;DR: Two new techniques are presented to estimate the noise spectra or the noise characteristics for noisy speech signals and can be combined with a nonlinear spectral subtraction scheme to enhance noisy speech and to improve the performance of speech recognition systems.
Abstract: Two new techniques are presented to estimate the noise spectra or the noise characteristics for noisy speech signals No explicit speech pause detection is required Past noisy segments of just about 400 ms duration are needed for the estimation Thus the algorithm is able to quickly adapt to slowly varying noise levels or slowly changing noise spectra This techniques can be combined with a nonlinear spectral subtraction scheme The ability can be shown to enhance noisy speech and to improve the performance of speech recognition systems Another application is the realization of a robust voice activity detection
TL;DR: In this paper, a speech coding system employing an adaptive codebook model of periodicity is augmented with a pitch-predictive filter (PPF), which has a delay equal to the integer component of the pitch-period and a gain which is adaptive based on a measure of the periodicity of the speech signal.
Abstract: A speech coding system employing an adaptive codebook model of periodicity is augmented with a pitch-predictive filter (PPF). This PPF has a delay equal to the integer component of the pitch-period and a gain which is adaptive based on a measure of periodicity of the speech signal. In accordance with an embodiment of the present invention, speech processing systems which include a first portion comprising an adaptive codebook and corresponding adaptive codebook amplifier and a second portion comprising a fixed codebook coupled to a pitch filter, are adapted to delay the adaptive codebook gain; determine the pitch filter gain based on the delayed adaptive codebook gain, and amplify samples of a signal in the pitch filter based on said determined pitch filter gain. The adaptive codebook gain is delayed for one subframe. The pitch filter gain equals the delayed. adaptive codebook gain, except when the adaptive codebook gain is either less than 0.2 or greater than 0.8., in which cases the pitch filter gain is set equal to 0.2 or 0.8, respectively.
TL;DR: Whereas 9-month-olds appear to be capable of integrating sequential and suprasegmental information in forming worldlike (multisyllabic) phonological percepts, 6- month-olds are not, and the emergence of integrative abilities portends increased efficiency in speech processing and may contribute to the formation of an initial lexicon.
Abstract: 5 studies examined contributions of syllable-ordering and rhythmic properties of syllable strings to 6- and 9-month-old infants' speech segmentation. A pair of methods measuring complementary properties of representational units was used: a noise detection task sensitive to perceived cohesiveness of pairs of syllables, and a discrimination maintenance task sensitive to compactness of representations of syllable pairs. For 9-month-olds, results show that a key pair of syllables was represented as a unit when the grouping of these syllables was supported by correlated regularities of ordering and rhythm in the set of stimulus strings, but not when such grouping was supported by only rhythmic or only syllable-ordering regularity. For 6-month-olds, results show that a key pair of syllables was represented as a unit whenever grouping was supported by rhythmic regularity in the stimulus strings, regardless of whether syllable-ordering regularity was also present. Thus, whereas 9-month-olds appear to be capable of integrating sequential and suprasegmental information in forming wordlike (multisyllabic) phonological percepts, 6-month-olds are not. The emergence of integrative abilities portends increased efficiency in speech processing and may contribute to the formation and use of an initial lexicon.
TL;DR: It is shown that the nonlinear adaptive predictor outperforms the traditional linear adaptive scheme in a significant way for the case of a speech signal.
Abstract: We describe a computationally efficient scheme for the nonlinear adaptive prediction of nonstationary signals whose generation is governed by a nonlinear dynamical mechanism. The complete predictor consists of two subsections. One performs a nonlinear mapping from the input space to an intermediate space with the aim of linearizing the input signal, and the other performs a linear mapping from the new space to the output space. The nonlinear subsection consists of a pipelined recurrent neural network (PRNN), and the linear section consists of a conventional tapped-delay-line (TDL) filter. The nonlinear adaptive predictor described is of general application. The dynamic behavior of the predictor is demonstrated for the case of a speech signal; for this application, it is shown that the nonlinear adaptive predictor outperforms the traditional linear adaptive scheme in a significant way. >
TL;DR: In this article, a word tagging and editing system for speech recognition receives recognized speech text from a speech recognition engine, and creates tagging information that follows the speech text as it is received by a word processing program or other program.
Abstract: A word tagging and editing system for speech recognition receives recognized speech text from a speech recognition engine, and creates tagging information that follows the speech text as it is received by a word processing program or other program. The body of text to be edited in connection with the word processing program may be selected and cut and pasted and otherwise manipulated, and the tags follow the speech text. A word may be selected by a user, and the tag information used to point to a sound bite within the audio data file created initially by the speech recognition engine. The sound bite may be replayed to the user through a speaker. The practical results include that the user may confirm the correctness of a particular recognized word, in real time whilst editing text in the word processor. If the recognition is manually corrected, the correction information may be supplied to the engine for use in updating a user profile for the user who dictated the audio that was recognized. Particular tagging approaches are employed depending on the particular word processor being used.
TL;DR: The need for multidisciplinary research is reviewed, for development of shared corpora and related resources, for computational support and far rapid communication among researchers, and the expected benefits of this technology are reviewed.
Abstract: A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the person's words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language systems range from simple tasks, such as retrieving information from an existing database (traffic reports, airline schedules), to interactive problem solving tasks involving complex planning and reasoning (travel planning, traffic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: (1) robust speech recognition; (2) automatic training and adaptation; (3) spontaneous speech; (4) dialogue models; (5) natural language response generation; (6) speech synthesis and speech generation; (7) multilingual systems; and (8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and far rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area. >
TL;DR: In this paper, the subjectively annoying "swishing" or "waterfall" effects encountered in conventional LPC speech processing systems are reduced or eliminated using LPC coefficients calculated as described above.
Abstract: In methods and apparatus for processing a speech signals comprising a plurality of successive signal intervals, each signal interval containing no speech sounds is classified as a noise interval, and LPC coefficients are calculated for each noise interval based on the samples of that noise interval and on the samples of a plurality of preceding signal intervals. When noise intervals encoded using LPC coefficients calculated as described above are reconstructed, the subjectively annoying "swishing" or "waterfall" effects encountered in conventional LPC speech processing systems are reduced or eliminated.
TL;DR: In this article, a speech recognition system provides a user with graphical and textual feedback, which is displayed in windows but occupies little of the available display space and is displayed only for a short period of time.
Abstract: A speech recognition system provides a user with graphical and textual feedback. The textual feedback is displayed in windows but occupies little of the available display space and are displayed only for a short period of time. The graphical feedback is displayed in a designated notification area and does not obscure any other displayed items. The feedback provided by the speech recognition system may indicate a current mode of operation of the speech recognition system as well as a state of processing of audio input by the speech recognition system.
TL;DR: The use of some of the proposed measures as a reference benchmark to evaluate the intrinsic complexity of a given database under a given protocol is suggested as a conclusion to this work.
TL;DR: A modular system and method is provided for encoding and decoding of speech signals using voicing probability determination and the use of the system in the generation of a variety of voice effects.
Abstract: A modular system and method is provided for encoding and decoding of speech signals using voicing probability determination. The continuous input speech is divided into time segments of a predetermined length. For each segment the encoder of the system computes the signal pitch and a parameter which is related to the relative content of voiced and unvoiced portions in the spectrum of the signal, which is expressed as a ratio Pv, defined as a voicing probability. The voiced portion of the signal spectrum, as determined by the parameter Pv, is encoded using a set of harmonically related amplitudes corresponding to the estimated pitch. The unvoiced portion of the signal is processed in a separate processing branch which uses a modified linear predictive coding algorithm. Parameters representing both the voiced and the unvoiced portions of a speech segment are combined in data packets for transmission. In the decoder, speech is synthesized from the transmitted parameters representing voiced and unvoiced portions of the speech in a reverse order. Boundary conditions between voiced and unvoiced segments are established to ensure amplitude and phase continuity for improved output speech quality. Perceptually smooth transition between frames is ensured by using an overlap and add method of synthesis. Also disclosed is the use of the system in the generation of a variety of voice effects.
TL;DR: A model of spectral shape analysis in the central auditory system is developed based on neurophysiological mappings in the primary auditory cortex and on results from psychoacoustical experiments in human subjects, showing that this representation is equivalent to performing an affine wavelet transform of the spectral pattern.
Abstract: A model of spectral shape analysis in the central auditory system is developed based on neurophysiological mappings in the primary auditory cortex and on results from psychoacoustical experiments in human subjects. The model suggests that the auditory system analyzes an input spectral pattern along three independent dimensions: a logarithmic frequency axis, a local symmetry axis, and a local spectral bandwidth axis. It is shown that this representation is equivalent to performing an affine wavelet transform of the spectral pattern and preserving both the magnitude (a measure of the scale or local bandwidth of the spectrum) and phase (a measure of the local symmetry of the spectrum). Such an analysis is in the spirit of the cepstral analysis commonly used in speech recognition systems, the major difference being that the double Fourier-like transformation that the auditory system employs is carried out in a local fashion. Examples of such a representation for various speech and synthetic signals are discussed, together with its potential significance and applications for speech and audio processing. >
TL;DR: A new algorithm is proposed for foreign accent classification of American English using a source generator framework, and it is shown that as ascent sensitive word count increases, the ability to correctly classify accent also increases, achieving an overall classification rate of 92% among four accent classes.
Abstract: Speaker accent is an important issue in the formulation of robust speaker independent recognition systems. Knowledge gained from a reliable accent classification approach could improve overall recognition performance. In this paper, a new algorithm is proposed for foreign accent classification of American English. A series of experimental studies are considered which focus on establishing how speech production is varied to convey accent. The proposed method uses a source generator framework, recently proposed for analysis and recognition of speech under stress [5]. An accent sensitive database is established using speakers of American English with foreign language accents. An initial version of the classification algorithm classified speaker accent from among four different accents with an accuracy of 81.5% in the case of unknown text, and 88.9% assuming known text. Finally, it is shown that as ascent sensitive word count increases, the ability to correctly classify accent also increases, achieving an overall classification rate of 92% among four accent classes.
TL;DR: A new search strategy particularly effective for very large vocabulary word recognition, performs a tree based, time synchronous, left-to-right beam search that develops time-dependent acoustic and phonetic hypotheses.
Abstract: The paper presents a fast segmental Viterbi algorithm. A new search strategy particularly effective for very large vocabulary word recognition. It performs a tree based, time synchronous, left-to-right beam search that develops time-dependent acoustic and phonetic hypotheses. At any given time, it makes active a sub-word unit associated to an arc of a lexical tree only if that time is likely to be the boundary between the current and the next unit. This new technique, tested with a vocabulary of 188892 directory entries, achieves the same results obtained with the Viterbi algorithm, with a 35% speedup. Results are also presented for a 718 word, speaker independent continuous speech recognition task.
TL;DR: It is suggested that phone rate is a more meaningful measure of speech rate than the more common word rate, and it is found that when data sets are clustered according to the phone rate metric, recognition errors increase when thePhone rate is more than 1 standard deviation greater than the mean.
Abstract: It is well known that a higher-than-normal speech rate will cause the rate of recognition errors in large vocabulary automatic speech recognition (ASR) systems to increase. In this paper we attempt to identify and correct for errors due to fast speech. We first suggest that phone rate is a more meaningful measure of speech rate than the more common word rate. We find that when data sets are clustered according to the phone rate metric, recognition errors increase when the phone rate is more than 1 standard deviation greater than the mean. We propose three methods to improve the recognition accuracy of fast speech, each addressing different aspects of performance degradation. The first method is an implementation of Baum-Welch codebook adaptation. The second method is based on the adaptation of HMM state-transition probabilities. In the third method, the pronunciation dictionaries are modified using rule-based techniques and compound words are added. We compare improvements in recognition accuracy for each method using data sets clustered according to the phone rate metric. Adaptation of the HMM state-transition probabilities to fast speech improves recognition of fast speech by a relative amount of 4 to 6 percent.
TL;DR: In the SWITCHBOARD corpus as mentioned in this paper, an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers by warping the spectrum of each speaker linearly over a 20% range, and finding the maximum a posteriori probability of the data given the warp.
Abstract: The performance of speech recognition systems is often improved by accounting explicitly for sources of variability in the data. In the SWITCHBOARD corpus, studied during the 1994 CAIP workshop [Frontiers in Speech Processing Workshop II, CAIP (August 1994)], an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers. The method found a maximum probability parameter for each speaker which mapped an acoustic model to the mean of the models taken from a homogeneous speaker population. The underlying acoustic model was that of a straight tube, and the parameter estimation was accomplished by warping the spectrum of each speaker linearly over a 20% range (actually accomplished by digitally resampling the data), and finding the maximum a posteriori probability of the data given the warp. The technique produces statistically significant improvements in accuracy on a speech transcription task using each of four different speech recognition systems. The best parametrizations were later found to correlate well with vocal tract estimates computed manually from spectrograms.
TL;DR: In this article, the CS-ACELP decoder generates a speech excitation signal selectively based on output signals from said first and second portions when said decoder fails to receive reliably at least a portion of a current frame of compressed speech information.
Abstract: A CELP speech decoder includes a first portion comprising an adaptive codebook and a second portion comprising a fixed codebook. The CS-ACELP decoder generates a speech excitation signal selectively based on output signals from said first and second portions when said decoder fails to receive reliably at least a portion of a current frame of compressed speech information. The decoder does this by classifying the speech signal to be generated as periodic (voiced) or non-periodic (unvoiced) and then generating an excitation signal based on this classification. If the speech signal is classified as periodic, the excitation signal is generated based on the output signal from the first portion and not on the output signal from the second portion. If the speech signal is classified as non-periodic, the excitation signal is generated based on the output signal from said second portion and not on the output signal from said first portion.
TL;DR: In this article, a voice activity detector uses an energy estimate to detect the presence of speech in a received speech signal in a noise environment, and a set of high pass filters are used to filter the signal based upon the background noise level.
Abstract: A method and apparatus for improving sound quality in a digital cellular radio system receiver. A voice activity detector uses an energy estimate to detect the presence of speech in a received speech signal in a noise environment. When no speech is present the system attenuates the signal and inserts low pass filtered white noise. In addition, a set of high pass filters are used to filter the signal based upon the background noise level. This high pass filtering is applied to the signal regardless of whether speech is present. Thus, a combination of signal attenuation with insertion of low pass filtered white noise during periods of non-speech, along with high pass filtering of the signal, improves sound quality when decoding speech which has been encoded in a noisy environment.
TL;DR: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned againstspeech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group.
Abstract: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned against speech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group. A decision tree classifies speech sounds into such groups, and related speech sound groups descend from common tree nodes. New speech samples time aligned against a given speech sound group's model update models of related speech sound groups, decreasing the training data required to adapt the system. The phonetic context classifications can be based on knowledge of which contextual features are associated with acoustic similarity. The computerized system samples speech sounds using a first, larger, parameter set; automatically selects combinations of phonetic context classifications which divide the speech sounds into groups whose frames are acoustically similar, such as by use of a decision tree; selects a second, smaller, set of parameters based on that set's ability to separate the frames aligned with each speech sound group, such as by used of linear discriminant analysis; and then uses these new parameters to represent frames and speech sound models. Then, using the new parameters, a decision tree classifier can be used to re-classify the speech sounds and to calculate new acoustic models for the resulting groups of speech sounds.
TL;DR: An instantaneous context switching speech recognition system is disclosed which enables a speech recognition application to be changed without loading new pattern matching data into the system.
Abstract: An instantaneous context switching speech recognition system is disclosed which enables a speech recognition application to be changed without loading new pattern matching data into the system. Selectable pointer maps are included in the memory of the system which selectively change the relationship between words and phonemes between a first application context and the pattern matching logic to a second application context and the pattern matching logic.
TL;DR: The Whisper (Windows Highly Intelligent Speech Recognizer) represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system.
Abstract: Since January 1993, the authors have been working to refine and extend Sphinx-II technologies in order to develop practical speech recognition at Microsoft. The result of that work has been the Whisper (Windows Highly Intelligent Speech Recognizer). Whisper represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system. In addition Whisper offers speech input capabilities for Microsoft Windows and can be scaled to meet different PC platform configurations. It provides features such as continuous speech recognition, speaker-independence, on-line adaptation, noise robustness, dynamic vocabularies and grammars. For typical Windows Command-and-Control applications (less than 1000 words), Whisper provides a software only solution on PCs equipped with a 486DX, 4MB of memory, and a standard sound card and a desk-top microphone.
TL;DR: A speech circuit is disclosed which solves the serious problem of the degradation of the articulation of received speech voice in conventional circuits and permits pleasant communications at places where the background noise level is high.
Abstract: A speech circuit is disclosed which solves the serious problem of the degradation of the articulation of received speech voice in conventional circuits and permits pleasant communications at places where the background noise level is high. The circuit has a construction in which an input signal from a microphone is attenuated in correspondence to the background noise level to form a sidetone signal and a received speech signal from a speech channel is amplified in correspondence to the background noise level to form a new received speech signal.
TL;DR: In this paper, a method for providing described television services includes the steps of generating description data corresponding to an audiovisual program, converting the description data to a speech signal corresponding to the description signals, synchronizing the speech signal with the audi-cation program using a time code signal from the audio-coding program, and mixing the synchronized speech signals with the audio track of the audiovi cation program to create a combined audio signal.
Abstract: An apparatus for providing described television services includes a receiver for receiving description data corresponding to an audiovisual program; a text-to-speech converter for converting the description data into a speech signal corresponding to the description data; a memory device for receiving and storing the speech signal and a corresponding time code from the audiovisual program; a mixing circuit for retrieving the speech signal from the memory device and mixing the retrieved speech signal with the audio track of the audiovisual program to produce a combined audio signal; and a transmitter for simultaneously providing the combined speech signal and the audiovisual program to a viewer. The apparatus provides the combined speech signal to the viewer via the SAP channel. The apparatus may also include a translator for translating the description data into a foreign language prior to converting the description data into the speech signal. A method for providing described television services includes the steps of generating description data corresponding to an audiovisual program; converting the description data to a speech signal corresponding to the description data; synchronizing the speech signal with the audiovisual program using a time code signal from the audiovisual program; mixing the synchronized speech signal with the audio track of the audiovisual program to create a combined audio signal; and simultaneously transmitting the combined audio signal and the audiovisual program to the viewer.
TL;DR: The proposed algorithm is a variation of the well-known spectral subtraction method which is attractive because of its simplicity, but introduces an unnatural and unpleasant residual noise.
Abstract: This paper addresses the problem of the intelligibility enhancement of speech corrupted by additive background noise in a single channel system. The proposed algorithm uses a criterion based on the human perception. It is a variation of the well-known spectral subtraction method which is attractive because of its simplicity, but introduces an unnatural and unpleasant residual noise. The proposed approach incorporates in this method considerations about noise masking of the auditory system. It succeeds in finding the best trade-off between noise reduction and speech distortion in a perceptual sense. Simulations show perceptually very satisfactory results and objective measures indicate a quality improvement. The speech processed with this new algorithm sounds more pleasant to a human listener than those obtained by the classical methods. This shows the relevance to incorporate perceptual aspects in the enhancement process.
TL;DR: This paper provides a fast projection algorithm and a step size control to obtain the same steady-state excess mean squared error (MSE) for various projection orders.
Abstract: Of the many adaptive filtering algorithms, the normalized LMS (NLMS) algorithm is generally used in practice because of its simplicity. The computational complexity of the NLMS algorithm is low, however, convergence is very slow and tracking is poor for a colored input signal such as speech. The projection algorithm was proposed as a generalization of the NLMS algorithm. This paper provides a fast projection algorithm and a step size control to obtain the same steady-state excess mean squared error (MSE) for various projection orders. Computer simulations for colored noise and speech input signal confirm the effectiveness of the projection algorithm and the step size control.