TL;DR: This dissertation aims to provide a history of web exceptionalism from 1989 to 2002, a period chosen in order to explore its roots as well as specific cases up to and including the year in which descriptions of “Web 2.0” began to circulate.
Abstract: • to make derivative works • to make commercial use of the work Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor. • For any reuse or distribution, you must make clear to others the license terms of this work. • Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above.
TL;DR: It is argued that the favorable performance of the subharmonic-summation algorithm stems from its corresponding more closely with current pitch-perception theories than does the harmonic sieve.
Abstract: In order to account for the phenomenon of virtual pitch, various theories assume implicitly or explicitly that each spectral component introduces a series of subharmonics. The spectral-compression method for pitch determination can be viewed as a direct implementation of this principle. The widespread application of this principle in pitch determination is, however, impeded by numerical problems with respect to accuracy and computational efficiency. A modified algorithm is described that solves these problems. Its performance is tested for normal speech and "telephone" speech, i.e., speech high-pass filtered at 300 Hz. The algorithm out-performs the harmonic-sieve method for pitch determination, while its computational requirements are about the same. The algorithm is described in terms of nonlinear system theory, i.c., subharmonic summation. It is argued that the favorable performance of the subharmonic-summation algorithm stems from its corresponding more closely with current pitch-perception theories than does the harmonic sieve.
TL;DR: A first attempt to develop and test run an observation procedure for assessing the syntactic and morphological development of adult learners of English as a second language (ESL) as evidenced in spontaneous speech production is reported on.
Abstract: This article reports on a first attempt to develop and test run an observation procedure for assessing the syntactic and morphological development of adult learners of English as a second language (ESL) as evidenced in spontaneous speech production. The procedure is based on the profile analysis approach, which was first developed by Crystal, Fletcher, and Gorman (1976) for the assessment of impaired speech (English) and later adapted to the assessment of second language development (German) by Clahsen (1985). The theoretical basis of the procedure is the multidimensional model of second language acquisition (SLA) developed by Meisel, Clahsen, and Pienemann (1981) and extended to ESL acquisition by Pienemann and Johnston (1987a). According to the model, invariant developmental stages in the acquisition of certain syntactic and morphological elements in German and English can be predicted and explained in terms of hierarchically ordered speech processing constraints.In order to assess the developmental stage of ESL learners, an observation form was drawn up, incorporating a selection of morphosyntactic features whose presence or absence in a taped sample of natural speech was monitored by assessors. The ratings made by the assessors were then compared to those assigned through a detailed linguistic analysis to test the feasibility of using a “shorthand” version of a profile analysis.Analysis of the outcomes of the test run revealed significant correlations between the assessments and the linguistic analysis. But some variation was found in the assessors' ability to apply the assessment criteria, and the extent of agreement between the assessors' observations and the linguistic analysis was less than would be acceptable in the given theoretical framework. However, the source of these problems was identified through the first test run and suggestions were made for further refining the procedure to improve its accuracy.
TL;DR: The authors explore the trade-off between packing information into sequences of feature vectors and being able to model them accurately and investigate a method of parameter estimation which is designed to cope with inaccurate modeling assumptions.
Abstract: The acoustic-modelling problem in automatic speech recognition is examined from an information theoretic point of view. This problem is to design a speech-recognition system which can extract from the speech waveform as much information as possible about the corresponding word sequence. The information extraction process is factored into two steps: a signal-processing step which converts a speech waveform into a sequence of informative acoustic feature vectors, and a step which models such a sequence. The authors are primarily concerned with the use of hidden Markov models to model sequences of feature vectors which lie in a continuous space. They explore the trade-off between packing information into such sequences and being able to model them accurately. The difficulty of developing accurate models of continuous-parameter sequences is addressed by investigating a method of parameter estimation which is designed to cope with inaccurate modeling assumptions. >
TL;DR: An improved version of a previously described automatic lipreading system has been developed which uses vector quantization, dynamic time warping, and a new heuristic distance measure to improve acoustic speech recognition.
Abstract: Current acoustic speech recognition technology performs well with very small vocabularies in noise or with large vocabularies in very low noise. Accurate acoustic speech recognition in noise with vocabularies over 100 words has yet to be achieved. Humans frequently lipread the visible facial speech articulations to enhance speech recognition, especially when the acoustic signal is degraded by noise or hearing impairment. Automatic lipreading has been found to improve significantly acoustic speech recognition and could be advantageous in noisy environments such as offices, aircraft and factories.An improved version of a previously described automatic lipreading system has been developed which uses vector quantization, dynamic time warping, and a new heuristic distance measure. This paper presents visual speech recognition results from multiple speakers under optimal conditions. Results from combined acoustic and visual speech recognition are also presented which show significantly improved performance compared to the acoustic recognition system alone.
TL;DR: In this paper, the authors employ a multiple-stage, delayed-decision adaptive digital signal processing algorithm implemented through the use of commonly available electronic circuit components to examine audio signal frames having harmonic content to identify voiced phonemes and determine whether the signal frame contains primarily speech or noise.
Abstract: A voice operated switch employs digital signal processing techniques to examine audio signal frames having harmonic content to identify voiced phonemes and to determined whether the signal frame contains primarily speech or noise. The method and apparatus employ a multiple-stage, delayed-decision adaptive digital signal processing algorithm implemented through the use of commonly available electronic circuit components. Specifically the method and apparatus comprise a plurality of stages, including (1) a low-pass filter to limit examination of input signals to below about one kHz, (2) a digital center-clipped autocorrelation processor whih recognizes that the presence of periodic components of the input signal below and above a peak-related threshold identifies a frame as containing speech or noise, and (3) a nonlinear filtering processor which includes nonlinear smoothing of the frame-level decisions and incorporates a delay, and further incorporates a forward and backward decision extension at the speech-segment level of several tenths of milliseconds to determine whether adjacent frames are primarily speech or primarily noise.
TL;DR: The most effective estimation technique for packets containing 16 ms of speech in a pulse-code-modulation format is pitch waveform replication, which extends the acceptable ratio of missing packets to 10%.
Abstract: Missing packets are a major cause of impairment in packet voice networks. While it is easiest to allow these gaps in received speech to appear as silent intervals in reconstructed speech, speech quality is improved by filling the gaps with estimates of the transmitted waveform. Several estimation techniques have been investigated for packets containing 16 ms of speech in a pulse-code-modulation format. The simplest method, packet repetition, extends from 2% to 5%, the acceptable ratio of missing packets. Here, acceptability is defined as a mean opinion score midway between fair and good on a five-point opinion scale. The most effective estimation technique (although not the most complex) is pitch waveform replication. It extends the acceptable ratio of missing packets to 10%. >
TL;DR: On a 997-word task using a bigram grammar, SPHINX achieved a word accuracy of 93%.
Abstract: SPHINX, the first large-vocabulary speaker-independent continuous-speech recognizer is described. SPHINX is a hidden-Markov-model (HMM)-based recognizer using multiple codebooks of various LPC-derived features. Two types of HMMs are used in SPHINX: context-independent phone models and function-word-dependent phone models. On a 997-word task using a bigram grammar, SPHINX achieved a word accuracy of 93%. This demonstrates the feasibility of speaker-independent continuous-speech recognition, and the appropriateness of hidden Markov models for such a task. >
TL;DR: In this paper, a system for generating high quality speech uses coarticulated speech segment data extracted from spoken carrier syllables and digitally compressed for storage using adaptive differential pulse code modulation (ADPCM).
Abstract: A system (87) for generating high quality speech uses coarticulated speech segment data extracted from spoken carrier syllables and digitally compressed for storage using adaptive differential pulse code modulation (ADPCM). The system includes a programmed digital microprocessor (89) with an associated read only memory (91) containing the compressed coarticulated speech segment library, random access memory (93) containing system variables and the sequence of coarticulated speech segments required to generate a desired spoken message, and text to speech chip (95) which provides the sequence of coarticulated speech segments to the RAM (93). The microprocessor (89) operates in accordance with a program stored in ROM (91) to recover the compressed coarticulated speech segment data stored in ROM (91) in a sequence called for by the text to speech chip (95), to reconstruct or ''blow back'' the stored ADPCM data to PCM data, and to concatenate the PCM data into waveforms to produce a real time digital speech waveform. The digital speech waveform is converted to an analog signal via digital to analog converter (97), amplified in amplifier (99) and applied to an audio speaker (101) which generates a high quality spoken message. In the preferred embodiment of the invention, the coarticulated speech segments are diphones.
TL;DR: The results indicated that under instructions of speeded responding, listeners could, on some trials, ignore some later occurring contextual information within the word that specified rate and lexical status, but could not ignore speaking rate entirely.
Abstract: Among the contextual factors known to play a role in segmental perception are the rate at which the speech was produced and the lexical status of the item, that is, whether it is a meaningful word of the language. In a series of experiments on the word-initial /b/-/p/ voicing distinction, we investigated the conditions under which these factors operate during speech processing. The results indicated that under instructions of speeded responding, listeners could, on some trials, ignore some later occurring contextual information within the word that specified rate and lexical status. Importantly, however, they could not ignore speaking rate entirely. Although they could base their decision on only the early portion of the word, when doing so they treated the word as if it were physically short--that is to say, as if there were no later occurring information specifying a slower rate. This suggests that listeners always take account of rate when identifying the voicing value of a consonant, but precisely which information within the word is used to specify rate can vary with task demands.
TL;DR: A probabilistic mixture model is described for a frame (the short-term spectrum) of each component of each to be used in speech recognition, which model the energy as the larger of the separate energies of signal and noise in the band.
Abstract: A probabilistic mixture model is described for a frame (the short-term spectrum) of each to be used in speech recognition. Each component of the mixture is regarded as a prototype for the labeling phase of a hidden Markov model based speech recognition system. Since the ambient noise during recognition can differ from the ambient noise present in the training data, the model is designed for convenient updating in changing noise. Based on the observation that the energy in a frequency band is at any fixed time dominated either by signal energy or by noise energy, the authors model the energy as the larger of the separate energies of signal and noise in the band. Statistical algorithms are given for training this as a hidden variables model. The hidden variables are the prototype identities and the separate signal and noise components. A series of speech recognition experiments that successfully utilize this model is also discussed. >
TL;DR: An adaptive method that utilizes the ordering property of the LSP parameters is introduced and a combination of this adaptive algorithm with nonuniform-step-size quantization is shown to be very effective for encoding the L SP parameters.
Abstract: The performance of several algorithms for the quantization of the LSP (line spectrum polar) parameters is studied. An adaptive method that utilizes the ordering property of the LSP parameters is introduced. A combination of this adaptive algorithm with nonuniform-step-size quantization is shown to be very effective for encoding the LSP parameters. The performance of the different quantization schemes is studied on several sequences of speech samples. For the spectra distortion measure, appropriate performance comparisons between the different quantization schemes are rendered. >
TL;DR: For patients with indications of poor nerve survival, test scores were significantly higher with the interleaved pulses processors, and it is believed that the substantial release from channel interactions provided by nonsimultaneous stimuli and a fast enough rotation among the channels to support adequate temporal and spectral resolution of perceived speech sounds were responsible.
Abstract: A wide variety of speech processing strategies for multichannel auditory prostheses were compared in studies of two patients implanted with the UCSF electrode array. Each strategy was evaluated using tests of vowel and consonant confusions, with and without lipreading. Included among the strategies were the compressed analog processor of the present UCSF/Storz prosthesis and a group of interleaved pulses processors in which the amplitudes of nonsimultaneous pulses code the spectral variations of speech. For these patients, each with indications of poor nerve survival, test scores were significantly higher with the interleaved pulses processors. We believe this superior performance was a result of 1. the substantial release from channel interactions provided by nonsimultaneous stimuli and 2. a fast enough rotation among the channels to support adequate temporal and spectral resolution of perceived speech sounds.
TL;DR: Well, someone can decide by themselves what they want to do and need to do but sometimes, that kind of person will need some speech recognition by machine references.
Abstract: Well, someone can decide by themselves what they want to do and need to do but sometimes, that kind of person will need some speech recognition by machine references. People with open minded will always try to seek for the new things and information from many sources. On the contrary, people with closed mind will always think that they can do it by their principals. So, what kind of person are you?
TL;DR: The authors propose a text-to-speech synthesis method based on automatic synthesis unit generation techniques using a natural speech database that is more consistent that those obtained through other methods, with the result that more intelligible speech can be reconstructed.
Abstract: The authors propose a text-to-speech synthesis method based on automatic synthesis unit generation techniques using a natural speech database. They have termed the automatic procedure context oriented clustering (COC). Using the COC procedure, 627 phonetic synthesis units were generated automatically based on 432 words uttered by a male speaker. This systematic approach has several advantages. First, as synthesis units can be generated automatically without any a priori phonological knowledge, it is easy to change the number of units and voices. Second, following from this, the technique can be applied to any language. Third, the generation of allophonic synthesis units is not dependent on the human decisions but on the statistical characteristics of spectral parameters in natural speech. Thus, the generated units are more consistent that those obtained through other methods, with the result that more intelligible speech can be reconstructed. >
TL;DR: In this paper, a new approach to the extraction of formant information from speech signals is presented, which exploits the additive and high resolution properties of group delay functions to resolve even closely spaced formants.
Abstract: A new approach to the extraction of formant information from speech signals is presented. The method exploits the additive and high resolution properties of group delay functions to resolve even closely spaced formants. The group delay function (or the negative derivative of Fourier transform phase) is derived for a minimum phase signal, which in turn is derived from the Fourier transform magnitude of the speech signal. The method is shown to give highly consistent estimation of formants without resorting to any modelling approach for smoothing the magnitude spectrum.
TL;DR: It has been found that the LPC parameter bit rate required to achieve high-quality synthetic speech is only 1300 b/s, and when SIVP is combined with scalar quantization, the bit rate can be reduced even further without introducing any perceivable quantization noise in the reconstructed speech.
Abstract: An efficient, low-complexity method called switched-adaptive interframe vector prediction (SIVP) has been developed for linear predictive coding (LPC) of spectral parameters in the development of low-bit-rate speech coding systems. SIVP utilizes vector linear prediction to exploit the high frame-to-frame redundancy present in the successive frames of LPC parameters. When SIVP is combined with scalar quantization, it has been found that the LPC parameter bit rate required to achieve high-quality synthetic speech is only 1300 b/s. With vector quantization, the bit-rate can be reduced even further (to 1000 b/s) without introducing any perceivable quantization noise in the reconstructed speech. >
TL;DR: It is pointed out that in the analysis of transient signals such as those encounters in speech, or in certain kinds of image processing, standard Fourier analysis is often non satisfactory because the basic functions of the Fourier Analysis extend over infinite time, whereas the signals to be analyzed are short-time transients.
Abstract: It is pointed out that in the analysis of transient signals such as those encounters in speech, or in certain kinds of image processing, standard Fourier analysis is often non satisfactory because the basic functions of the Fourier analysis (sines, cosines, complex exponentials) extend over infinite time, whereas the signals to be analyzed are short-time transients. Reference is made to a method for dealing with transient signals which has recently appeared in the literature. The basis functions are referred to as wavelets, and they utilize time compression (or dilation) rather than a variation of frequency of the modulated sinusoid. Hence, all the wavelets have the same number of cycles. The analyzing wavelets must satisfy a few simple conditions, but are not otherwise specified. There is a wide latitude in the choice of these functions and they can be tailored to specific applications. The wavelets are founded on rigorous mathematical theory, and the expansions are robust. They are applied to detect ventricular delayed potentials (VLP) in the electrocardiogram. >
TL;DR: The result of information listening tests indicate that this system can achieve high quality for both clean and noisy speech as the MBE speech is extremely robust to the presence of background noise in speech.
Abstract: A speech model, referred to as the multiband excitation (MBE) speech model, has been shown to be capable of synthesizing speech without the artifacts common to model-based speech systems and has been used to develop a 4.8 kb/s speech coder. This system was developed using several new approaches to quantize the MBE model parameters. These techniques were designed to utilize additional redundancy amongst these parameters, thereby permitting more efficient quantization. The result of information listening tests indicate that this system can achieve high quality for both clean and noisy speech as the MBE speech is extremely robust to the presence of background noise in speech. >
TL;DR: Results indicated that amplitude processing was associated with 10% to 12% improvement in intelligibility at the lower level but failed to yield any significant effect at the high level of presentation, while increasing consonant duration provided no benefit at the low level but gave modest benefit at 95 dB.
Abstract: This study reports the effects on speech intelligibility of two types of digital speech processing: amplitude enhancement of consonants to produce near-zero consonant/vowel intensity ratios and inc...
TL;DR: For instance, this article found that infants and non-human animals exhibit auditory perceptual categories that conform to those defined by the phonetic categories of language, suggesting the possibility that in evolutionary history the ability to perceive rudimentary speech categories preceded the ability of producing articulate speech.
Abstract: Among topics related to the evolution of language, the evolution of speech is particularly fascinating. Early theorists believed that it was the ability to produce articulate speech that set the stage for the evolution of the «special» speech processing abilities that exist in modern-day humans. Prior to the evolution of speech production, speech processing abilities were presumed not to exist. The data reviewed here support a different view. Two lines of evidence, one from young human infants and the other from infrahuman species, neither of whom can produce articulate speech, show that in the absence of speech production capabilities, the perception of speech sounds is robust and sophisticated. Human infants and non-human animals evidence auditory perceptual categories that conform to those defined by the phonetic categories of language. These findings suggest the possibility that in evolutionary history the ability to perceive rudimentary speech categories preceded the ability to produce articulate speech. This in turn suggests that it may be audition that structured, at least initially, the formation of phonetic categories.
TL;DR: The vowel-separation test, and subjective listening, suggest that harmonic selection, which is the more computationally expensive method, produces the more effective voice separation.
Abstract: Two signal‐processing algorithms, designed to separate the voiced speech of two talkers speaking simultaneously at similar intensities in a single channel, were compared and evaluated. Both algorithms exploit the harmonic structure of voiced speech and require a difference in fundamental frequency (F0) between the voices to operate successfully. One attenuates the interfering voice by filtering the cepstrum of the combined signal. The other uses the method of harmonic selection [T. W. Parsons, J. Acoust. Soc. Am. 60, 911–918 (1976)] to resynthesize the target voice from fragmentary spectral information. Two perceptual evaluations were carried out. One involved the separation of pairs of vowels synthesized on static F0’s; the other involved the recovery of consonant–vowel (CV) words masked by a synthesized vowel. Normal‐hearing listeners and four listeners with moderate‐to‐severe, bilateral, symmetrical, sensorineural hearing impairments were tested. All listeners showed increased accuracy of identificatio...
TL;DR: A microprocessor-based real-time speech recognition system that is able to produce orthographic transcriptions for arbitrary words or phrases uttered in Finnish or Japanese and can also be used as a large-vocabulary isolated word recognizer.
Abstract: A microprocessor-based real-time speech recognition system is described. It is able to produce orthographic transcriptions for arbitrary words or phrases uttered in Finnish or Japanese. It can also be used as a large-vocabulary isolated word recognizer. The acoustic processor of the system transcribing speech into phonemes is based on neural network principles. The so-called phonotopic maps constructed by a self-organizing process are employed. The coarticulation effects in phonetic transcriptions are compensated by means of automatically derived rules which describe the morphology of errors at the acoustic processor output. Without applying any language model, the recognition result is correct up to 92 or even 97 per cent referring to individual letters. >
TL;DR: Quality assessment methodologies for speech waveform coding, source coding, and speech synthesis by rule from the viewpoints of naturalness and intelligibility are reviewed.
Abstract: The concept of speech quality assessment is examined. Quality assessment methodologies for speech waveform coding, source coding, and speech synthesis by rule from the viewpoints of naturalness and intelligibility are reviewed. Both subjective and objective measures are considered. >
TL;DR: A speech processing system having an encoder comprising apparatus for receiving successive samples of PCM (pulse code modulated) encoded speech signals and apparatus, for applying sequential groups of the PCM encodedspeech signals as primary vector signals to anencoder code book memory for selecting code words stored in the memory most closely approximating the vector signals.
Abstract: A speech processing system having an encoder comprising apparatus for receiving successive samples of PCM (pulse code modulated) encoded speech signals and apparatus, for applying sequential groups of the PCM encoded speech signals as primary vector signals to an encoder code book memory for selecting code words stored in the memory most closely approximating the vector signals. Apparatus is included for outputting to an output line the selected code words at a first bit rate. Further apparatus connects the selected code words to converted vector signals. The primary vector signals and converted vector signals are compared and difference signals result. The difference signals are quantized error signals are provided thereby. The quantized error signals are applied to the output line at the first or at a second bit rate to enable the processing system to transmit the speech signals at an effective bit rate constituting double the first bit rate or the sum of the first and second bit rate, or the quantized error signals is inhibited from being applied to the line. The processing system can thereby transmit the speech signals at an effective rate of the first bit rate thus incressing the traffic carrying capacity of the line.
TL;DR: The possibility of undoing the effects of coarticulation is the major contribution of this work, and the identification of corrected targets is therefore possible with no further contextual rules.
Abstract: The automatic recognition of continuous speech may use a symbolic representation of the acoustic signal in order to facilitate lexical access. The allophones of the language form a practical set of symbols. A major issue is a reliable localisation of these units in the speech stream and their identification. Localisation is obtained using a robust implementation of temporal decomposition, a technique originally proposed by Atal (1983), for speech coding. Speech is decomposed in terms of overlapping events characterized by both a spectral target and a time-limited interpolation function. An undershot target may be reestimated using neighbours and the associated functions. The possibility of undoing the effects of coarticulation is the major contribution of this work. The identification of these corrected targets is therefore possible with no further contextual rules. The recognition of spelled surnames (letters of the alphabet) is used for evaluation. 76% of correct phones allow 70% of correct letters. >
TL;DR: A set of iterative speech enhancement techniques using spectral constraints is extended and evaluated to determine their usefulness as preprocessors for recognition in extremely noisy environments in the vicinity of 0 dB SNR.
Abstract: A set of iterative speech enhancement techniques using spectral constraints is extended and evaluated. The approaches apply inter- and intraframe spectral constraints to ensure optimum speech quality across all classes of speech. Constraints are applied on the basis of the presence of perceptually important speech characteristics found during the enhancement procedure. Results show improvement over past techniques for additive white noise distortions. Three points are addressed in the present study. First, a convenient and consistent terminating point for the iterative technique is presented which was previously unavailable. Second, the techniques have been generalized to allow for slowly varying, colored noise. Finally, a comparative evaluation has been performed to determine their usefulness as preprocessors for recognition in extremely noisy environments in the vicinity of 0 dB SNR. >
TL;DR: In this paper, a plurality of processors, including template processors, are used for automatic speech recognition, in which there are stored templates representative of both speech and non-speech sounds.
Abstract: An apparatus for automatic speech recognition includes a plurality of processors, including template processors in which there are stored templates representative of both speech and non-speech sounds. Incoming sounds are continuously converted into digital signals in respective frames representative of speech and non-speech sounds, respectively. Sequences of such frames are compared with both the speech and non-speech templates to determine the closest matches. Endpoints of respective speech utterances are determined in response to the detection of respective non-speech-speech-non-speech sequences, whereupon such speech utterances are processed to recognize the same.
TL;DR: In this paper, a low cost speech recognition system was proposed to generate frames of received speech having binary feature components. But the received speech frames were compared with reference templates, and error values representing the difference between the Received Speech and the Reference Templates were generated.
Abstract: A low cost speech recognition system generates frames of received speech having binary feature components. The received speech frames are compared (18) with reference templates (22) , and error values representing the difference between the received speech and the reference templates (22) are generated. At the end of an utterance, if one template resulted in a sufficiently small error value, the word represented by that template is selected (26) as the recognized word.