TL;DR: The evaluation of correlations of several objective measures with these three subjective rating scales is reported on and several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques.
Abstract: In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two signal-to-noise ratio levels by four classes of speech enhancement algorithms: spectral subtractive, subspace, statistical-model based, and Wiener algorithms. The subjective quality ratings were obtained using the ITU-T P.835 methodology designed to evaluate the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. This paper reports on the evaluation of correlations of several objective measures with these three subjective rating scales. Several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques.
TL;DR: This article reviews Microphone Array Signal Processing by Jacob Benesty, Jingdong Chen, Yiteng Huang, Y iteng Huang , Berlin, 2008.
Abstract: This article reviews Microphone Array Signal Processing by Jacob Benesty, Jingdong Chen, Yiteng Huang , Berlin, 2008. 240 pp. price $119 (hardcover). ISBN: 3540786112
TL;DR: A novel estimation algorithm is presented that demonstrates high accuracy on a variety of databases and studies the impact of the maximum approximation in training and transcription, the interaction of model size parameters, n-best list generation, confidence measures, and phoneme-to-grapheme conversion.
TL;DR: It is proposed that the function of the STS varies depending on the nature of network coactivations with different regions in the frontal cortex and medial-temporal lobe, more in keeping with the notion that the same brain region can support different cognitive operations depending on task-dependent network connections.
Abstract: The superior temporal sulcus (STS) is the chameleon of the human brain. Several research areas claim the STS as the host brain region for their particular behavior of interest. Some see it as one of the core structures for theory of mind. For others, it is the main region for audiovisual integration. It plays an important role in biological motion perception, but is also claimed to be essential for speech processing and processing of faces. We review the foci of activations in the STS from multiple functional magnetic resonance imaging studies, focusing on theory of mind, audiovisual integration, motion processing, speech processing, and face processing. The results indicate a differentiation of the STS region in an anterior portion, mainly involved in speech processing, and a posterior portion recruited by cognitive demands of all these different research areas. The latter finding argues against a strict functional subdivision of the STS. In line with anatomical evidence from tracer studies, we propose that the function of the STS varies depending on the nature of network coactivations with different regions in the frontal cortex and medial-temporal lobe. This view is more in keeping with the notion that the same brain region can support different cognitive operations depending on task-dependent network connections, emphasizing the role of network connectivity analysis in neuroimaging.
TL;DR: This contribution presents a recently collected database of spontaneous emotional speech in German which is being made available to the research community and provides emotion labels for a great part of the data.
Abstract: The lack of publicly available annotated databases is one of the major barriers to research advances on emotional information processing. In this contribution we present a recently collected database of spontaneous emotional speech in German which is being made available to the research community. The database consists of 12 hours of audio-visual recordings of the German TV talk show ldquoVera am Mittagrdquo, segmented into broadcasts, dialogue acts and utterances. This corpus contains spontaneous and very emotional speech recorded from unscripted, authentic discussions between the guests of the talk show. In addition to the audio-visual data and the segmented utterances we provide emotion labels for a great part of the data. The emotion labels are given on a continuous valued scale for three emotion primitives: valence, activation and dominance, using a large number of human evaluators. Such data is of great interest to all research groups working on spontaneous speech analysis, emotion recognition in both speech and facial expression, natural language understanding, and robust speech recognition.
TL;DR: This paper explored whether Spanish-learning children's early experiences with language predict efficiency in real-time comprehension and vocabulary learning and found that the influences of caregiver speech on speed of word recognition and vocabulary were largely overlapping.
Abstract: It is well established that variation in caregivers' speech is associated with language outcomes, yet little is known about the learning principles that mediate these effects. This longitudinal study (n = 27) explores whether Spanish-learning children's early experiences with language predict efficiency in real-time comprehension and vocabulary learning. Measures of mothers' speech at 18 months were examined in relation to children's speech processing efficiency and reported vocabulary at 18 and 24 months. Children of mothers who provided more input at 18 months knew more words and were faster in word recognition at 24 months. Moreover, multiple regression analyses indicated that the influences of caregiver speech on speed of word recognition and vocabulary were largely overlapping. This study provides the first evidence that input shapes children's lexical processing efficiency and that vocabulary growth and increasing facility in spoken word comprehension work together to support the uptake of the information that rich input affords the young language learner.
TL;DR: It is shown how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream by exploiting the structure of repeating patterns within the speech signal.
Abstract: We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.
TL;DR: The authors used real-time measures of the time course of young children's gaze patterns in response to speech and coded frame-by-frame, each 5min experiment response latencies can be coded with millisecond precision on multiple trials over multiple items, based on data from thousands of frames in each experiment.
Abstract: The “looking-while-listening” methodology uses real-time measures of the time course of young children’s gaze patterns in response to speech. This procedure is low in task demands and does not require automated eyetracking technology, similar to “preferential-looking” procedures. However, the looking-whilelistening methodology differs critically from preferential-looking procedures in the methods used for data reduction and analysis, yielding high-resolution measures of speech processing from moment to moment, rather than relying on summary measures of looking preference. Because children’s gaze patterns are time-locked to speech and coded frame-by-frame, each 5-min experiment response latencies can be coded with millisecond precision on multiple trials over multiple items, based on data from thousands of frames in each experiment. The meticulous procedures required in the collection, reduction, and multiple levels of analysis of such detailed data are demanding, but well worth the effort, revealing a dynamic and nuanced picture of young children’s developing skill in finding meaning in spoken language.
TL;DR: It is shown that in the context of noise reduction the squared PCC has many appealing properties and can be used as an optimization cost function to derive many optimal and suboptimal noise-reduction filters.
Abstract: Noise reduction, which aims at estimating a clean speech from noisy observations, has attracted a considerable amount of research and engineering attention over the past few decades. In the single-channel scenario, an estimate of the clean speech can be obtained by passing the noisy signal picked up by the microphone through a linear filter/transformation. The core issue, then, is how to find an optimal filter/transformation such that, after the filtering process, the signal-to-noise ratio (SNR) is improved but the desired speech signal is not noticeably distorted. Most of the existing optimal filters (such as the Wiener filter and subspace transformation) are formulated from the mean-square error (MSE) criterion. However, with the MSE formulation, many desired properties of the optimal noise-reduction filters such as the SNR behavior cannot be seen. In this paper, we present a new criterion based on the Pearson correlation coefficient (PCC). We show that in the context of noise reduction the squared PCC (SPCC) has many appealing properties and can be used as an optimization cost function to derive many optimal and suboptimal noise-reduction filters. The clear advantage of using the SPCC over the MSE is that the noise-reduction performance (in terms of the SNR improvement and speech distortion) of the resulting optimal filters can be easily analyzed. This shows that, as far as noise reduction is concerned, the SPCC-based cost function serves as a more natural criterion to optimize as compared to the MSE.
TL;DR: It is argued that moving forward will require the dichotomy argument in favour of a more integrated approach, and there is now extensive evidence that spectral and temporal acoustical properties predict the relative specialization of right and left auditory cortices.
Abstract: The idea that speech processing relies on unique, encapsulated, domain-specific mechanisms has been around for some time. Another well-known idea, often espoused as being in opposition to the first proposal, is that processing of speech sounds entails general-purpose neural mechanisms sensitive to the acoustic features that are present in speech. Here, we suggest that these dichotomous views need not be mutually exclusive. Specifically, there is now extensive evidence that spectral and temporal acoustical properties predict the relative specialization of right and left auditory cortices, and that this is a parsimonious way to account not only for the processing of speech sounds, but also for non-speech sounds such as musical tones. We also point out that there is equally compelling evidence that neural responses elicited by speech sounds can differ depending on more abstract, linguistically relevant properties of a stimulus (such as whether it forms part of one's language or not). Tonal languages provide a particularly valuable window to understand the interplay between these processes. The key to reconciling these phenomena probably lies in understanding the interactions between afferent pathways that carry stimulus information, with top-down processing mechanisms that modulate these processes. Although we are still far from the point of having a complete picture, we argue that moving forward will require us to abandon the dichotomy argument in favour of a more integrated approach.
TL;DR: An overview of ERPs frequently used to examine the processing of speech and other sound stimuli, which include the P1–N1–P2 complex, acoustic change complex, mismatch negativity, and P3 responses are provided.
Abstract: Speech-evoked auditory event-related potentials (ERPs) provide insight into the neural mechanisms underlying speech processing. For this reason, ERPs are of great value to hearing scientists and audiologists. This article will provide an overview of ERPs frequently used to examine the processing of speech and other sound stimuli. These ERPs include the P1-N1-P2 complex, acoustic change complex, mismatch negativity, and P3 responses. In addition, we focus on the application of these speech-evoked potentials for the assessment of (1) the effects of hearing loss on the neural encoding of speech allowing for behavioral detection and discrimination; (2) improvements in the neural processing of speech with amplification (hearing aids, cochlear implants); and (3) the impact of auditory training on the neural processing of speech. Studies in these three areas are reviewed and implications for audiologists are discussed.
TL;DR: In this paper, a speech processing system which exploits statistical modeling and formal logic to receive and process speech input, which may represent data to be received, such as dictation, or commands to be processed by an operating system, application or process.
Abstract: A speech processing system which exploits statistical modeling and formal logic to receive and process speech input, which may represent data to be received, such as dictation, or commands to be processed by an operating system, application or process. A command dictionary and dynamic grammars are used in processing speech input to identify, disambiguate and extract commands. The logical processing scheme ensures that putative commands are complete and unambiguous before processing. Context sensitivity may be employed to differentiate data and commands. A multi faceted graphic user interface may be provided for interaction with a user to speech enable interaction with applications and processes that do not necessarily have native support for speech input.
TL;DR: Right-hemisphere auditory cortex was 100% more accurate in following contours of the speech envelope and had a 33% larger response magnitude while following the envelope compared with the left hemisphere, providing evidence that the right hemisphere plays a specific and important role in speech processing and support the hypothesis that acoustic processing of speech involves the decomposition of the signal into constituent temporal features by rate-specialized neurons in right- and left-hemicycle auditory cortex.
Abstract: Cortical analysis of speech has long been considered the domain of left-hemisphere auditory areas. A recent hypothesis poses that cortical processing of acoustic signals, including speech, is mediated bilaterally based on the component rates inherent to the speech signal. In support of this hypothesis, previous studies have shown that slow temporal features (3-5 Hz) in nonspeech acoustic signals lateralize to right-hemisphere auditory areas, whereas rapid temporal features (20-50 Hz) lateralize to the left hemisphere. These results were obtained using nonspeech stimuli, and it is not known whether right-hemisphere auditory cortex is dominant for coding the slow temporal features in speech known as the speech envelope. Here we show strong right-hemisphere dominance for coding the speech envelope, which represents syllable patterns and is critical for normal speech perception. Right-hemisphere auditory cortex was 100% more accurate in following contours of the speech envelope and had a 33% larger response magnitude while following the envelope compared with the left hemisphere. Asymmetries were evident regardless of the ear of stimulation despite dominance of contralateral connections in ascending auditory pathways. Results provide evidence that the right hemisphere plays a specific and important role in speech processing and support the hypothesis that acoustic processing of speech involves the decomposition of the signal into constituent temporal features by rate-specialized neurons in right- and left-hemisphere auditory cortex.
TL;DR: Increased accuracy in pitch tracking after training is found, including a decrease in the number of pitch-tracking errors and a refinement in the energy devoted to encoding pitch, as native English-speaking adults learn to incorporate foreign speech sounds in word identification.
Abstract: Peripheral and central structures along the auditory pathway contribute to speech processing and learning. However, because speech requires the use of functionally and acoustically complex sounds which necessitates high sensory and cognitive demands, long-term exposure and experience using these sounds is often attributed to the neocortex with little emphasis placed on subcortical structures. The present study examines changes in the auditory brainstem, specifically the frequency following response (FFR), as native English-speaking adults learn to incorporate foreign speech sounds (lexical pitch patterns) in word identification. The FFR presumably originates from the auditory midbrain and can be elicited preattentively. We measured FFRs to the trained pitch patterns before and after training. Measures of pitch tracking were then derived from the FFR signals. We found increased accuracy in pitch tracking after training, including a decrease in the number of pitch-tracking errors and a refinement in the energy devoted to encoding pitch. Most interestingly, this change in pitch-tracking accuracy only occurred in the most acoustically complex pitch contour (dipping contour), which is also the least familiar to our English-speaking subjects. These results not only demonstrate the contribution of the brainstem in language learning and its plasticity in adulthood but also demonstrate the specificity of this contribution (i.e., changes in encoding only occur in specific, least familiar stimuli, not all stimuli). Our findings complement existing data showing cortical changes after second-language learning, and are consistent with models suggesting that brainstem changes resulting from perceptual learning are most apparent when acuity in encoding is most needed.
TL;DR: A database of dysarthric speech produced by 19 speakers with cerebral palsy provides a fundamental resource for automatic speech recognition development for people with neuromotor disability.
Abstract: This paper describes a database of dysarthric speech produced by 19 speakers with cerebral palsy. Speech materials consist of 765 isolated words per speaker: 300 distinct uncommon words and 3 repetitions of digits, computer commands, radio alphabet and common words. Data is recorded through an 8-microphone array and one digital video camera. Our database provides a fundamental resource for automatic speech recognition development for people with neuromotor disability. Research on articulation errors in dysarthria will benefit clinical treatments and contribute to our knowledge of neuromotor mechanisms in speech production. Data files are available via secure ftp upon request.
TL;DR: Direct evidence to the often implied functional distinction of the two cerebral hemispheres in speech processing is supplied: increases in the temporal detail of the signal were most effective in driving brain activation of the left anterolateral superior temporal sulcus, whereas the right homolog areas exhibited stronger sensitivity to the variations in spectral detail.
Abstract: Speech comprehension has been shown to be a strikingly bilateral process, but the differential contributions of the subfields of left and right auditory cortices have remained elusive. The hypothesis that left auditory areas engage predominantly in decoding fast temporal perturbations of a signal whereas the right areas are relatively more driven by changes of the frequency spectrum has not been directly tested in speech or music. This brain-imaging study independently manipulated the speech signal itself along the spectral and the temporal domain using noise-band vocoding. In a parametric design with five temporal and five spectral degradation levels in word comprehension, a functional distinction of the left and right auditory association cortices emerged: increases in the temporal detail of the signal were most effective in driving brain activation of the left anterolateral superior temporal sulcus (STS), whereas the right homolog areas exhibited stronger sensitivity to the variations in spectral detail. In accordance with behavioral measures of speech comprehension acquired in parallel, change of spectral detail exhibited a stronger coupling with the STS BOLD signal. The relative pattern of lateralization (quantified using lateralization quotients) proved reliable in a jack-knifed iterative reanalysis of the group functional magnetic resonance imaging model. This study supplies direct evidence to the often implied functional distinction of the two cerebral hemispheres in speech processing. Applying direct manipulations to the speech signal rather than to low-level surrogates, the results lend plausibility to the notion of complementary roles for the left and right superior temporal sulci in comprehending the speech signal.
TL;DR: Analysis of discriminating feature sets used in the study reflect a clear indication that glottal descriptors are vital components of vocal affect analysis.
Abstract: The motivation for this work is in an attempt to rectify the current lack of objective tools for clinical analysis of emotional disorders. This study involves the examination of a large breadth of objectively measurable features for use in discriminating depressed speech. Analysis is based on features related to prosodics, the vocal tract, and parameters extracted directly from the glottal waveform. Discrimination of the depressed speech was based on a feature selection strategy utilizing the following combinations of feature domains: prosodic measures alone, prosodic and vocal tract measures, prosodic and glottal measures, and all three domains. The combination of glottal and prosodic features produced better discrimination overall than the combination of prosodic and vocal tract features. Analysis of discriminating feature sets used in the study reflect a clear indication that glottal descriptors are vital components of vocal affect analysis.
TL;DR: This tutorial examines the problem area, its methods, successes and failures, focusing on the nature of the speech signal and techniques to accomplish useful data reduction, and compares it with other areas of PR.
TL;DR: In this article, improved capabilities for a mobile environment speech processing facility are described for the entering of text into a software application resident on a mobile communication facility, where recorded speech may be presented by the user using the mobile communications facility's resident capture facility.
Abstract: In embodiments of the present invention improved capabilities are described for a mobile environment speech processing facility. The present invention may provide for the entering of text into a software application resident on a mobile communication facility, where recorded speech may be presented by the user using the mobile communications facility's resident capture facility. Transmission of the recording may be provided through a wireless communication facility to a speech recognition facility, and may be accompanied by information related to the software application. Results may be generated utilizing the speech recognition facility that may be independent of structured grammar, and may be based at least in part on the information relating to the software application and the recording. The results may then be transmitted to the mobile communications facility, where they may be loaded into the software application.
TL;DR: A new approach for extracting and representing prosodic features directly from the speech signal, and syllable-like unit is chosen as the basic unit for representing the prosodic characteristics.
TL;DR: As a group, hearing-impaired subjects benefited less than normal-hearing subjects from the additional TFS information that was available as CO increased, which may partially explain why subjects with cochlear hearing loss get less benefit from listening in a fluctuating background.
Abstract: Speech reception thresholds (SRTs) were measured with a competing talker background for signals processed to contain variable amounts of temporal fine structure (TFS) information, using nine normal-hearing and nine hearing-impaired subjects. Signals (speech and background talker) were bandpass filtered into channels. Channel signals for channel numbers above a “cut-off channel” (CO) were vocoded to remove TFS information, while channel signals for channel numbers of CO and below were left unprocessed. Signals from all channels were combined. As a group, hearing-impaired subjects benefited less than normal-hearing subjects from the additional TFS information that was available as CO increased. The amount of benefit varied between hearing-impaired individuals, with some showing no improvement in SRT and one showing an improvement similar to that for normal-hearing subjects. The reduced ability to take advantage of TFS information in speech may partially explain why subjects with cochlear hearing loss get less benefit from listening in a fluctuating background than normal-hearing subjects. TFS information may be important in identifying the temporal “dips” in such a background.
TL;DR: A speech recognition method includes a model selection step which selects a recognition model and translation dictionary information based on characteristic information of input speech and a speech recognition step which translates input speech into text data based on the selected recognition model as mentioned in this paper.
Abstract: A speech recognition method includes a model selection step which selects a recognition model and translation dictionary information based on characteristic information of input speech and a speech recognition step which translates input speech into text data based on the selected recognition model and translation step which translates the text data based on the selected translation dictionary information.
TL;DR: In this article, an electronic device analyzes a voice communication for actionable speech using speech recognition, and when actionability speech is detected, the electronic device may carry out a corresponding function, including storing information in a log or presenting one or more programs, services and/or control functions to the user.
Abstract: An electronic device (10, 16) analyzes a voice communication for actionable speech using speech recognition. When actionable speech is detected, the electronic device may carry out a corresponding function, including storing information in a log or presenting one or more programs, services and/or control functions to the user. The actionable speech may be predetermined commands and/or speech patterns that are detected using an expert system as potential command or data input to a program.
TL;DR: In this article, an algorithm for synthesizing speech used to identify media assets is presented. But this algorithm is implemented on a system including several dedicated render engines, and the system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesised speech.
Abstract: Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back.
TL;DR: In this article, the system includes a voice recognition unit and a speech processing server that work together to enable users to interact with the system using voice commands guided by navigation context sensitive voice prompts, and provide user-requested data in a verbalized format back to the users.
Abstract: A method for providing access to data via a voice interface. In one embodiment, the system includes a voice recognition unit and a speech processing server that work together to enable users to interact with the system using voice commands guided by navigation context sensitive voice prompts, and provide user-requested data in a verbalized format back to the users. Digitized voice waveform data are processed to determine the voice commands of the user. The system also uses a “grammar” that enables users to retrieve data using intuitive natural language speech queries. In response to such a query, a corresponding data query is generated by the system to retrieve one or more data sets corresponding to the query. The user is then enabled to browse the data that are returned through voice command navigation, wherein the system “reads” the data back to the user using text-to-speech (TTS) conversion and system prompts.
TL;DR: This paper builds an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates, and focuses on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level.
Abstract: With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus.
TL;DR: A theoretical analysis that models the number of correctly classified utterances as a hypergeometric random variable enables the derivation of an accurate estimate of the variance of the correct classification rate during cross-validation by employing a fast SFFS variant.
TL;DR: Investigation of a corpus of spoken Dutch consisting of interviews with 160 high-school teachers shows that speech tempo depends mainly on phrase length, due to anticipatory shortening, and on the speaker's country, with different speaking styles in The Netherlands and in Flanders.
Abstract: Speech tempo (articulation rate) varies both between and within speakers. The present study investigates several factors affecting tempo in a corpus of spoken Dutch, consisting of interviews with 160 high-school teachers. Speech tempo was observed for each phrase separately, and analyzed by means of multilevel modeling of the speaker's sex, age, country, and dialect region (between speakers) and length, sequential position of phrase, and autocorrelated tempo (within speakers). Results show that speech tempo in this corpus depends mainly on phrase length, due to anticipatory shortening, and on the speaker's country, with different speaking styles in The Netherlands (faster, less varied) and in Flanders (slower, more varied). Additional analyses showed that phrase length itself is shorter in The Netherlands than in Flanders, and decreases with speaker's age. Older speakers tend to vary their phrase length more (within speakers), perhaps due to their accumulated verbal proficiency.
TL;DR: A fundamental frequency (F(0) tracking algorithm is presented that is extremely robust for both high quality and telephone speech, at signal to noise ratios ranging from clean speech to very noisy speech.
Abstract: In this paper, a fundamental frequency (F(0)) tracking algorithm is presented that is extremely robust for both high quality and telephone speech, at signal to noise ratios ranging from clean speech to very noisy speech. The algorithm is named "YAAPT," for "yet another algorithm for pitch tracking." The algorithm is based on a combination of time domain processing, using the normalized cross correlation, and frequency domain processing. Major steps include processing of the original acoustic signal and a nonlinearly processed version of the signal, the use of a new method for computing a modified autocorrelation function that incorporates information from multiple spectral harmonic peaks, peak picking to select multiple F(0) candidates and associated figures of merit, and extensive use of dynamic programming to find the "best" track among the multiple F(0) candidates. The algorithm was evaluated by using three databases and compared to three other published F(0) tracking algorithms by using both high quality and telephone speech for various noise conditions. For clean speech, the error rates obtained are comparable to those obtained with the best results reported for any other algorithm; for noisy telephone speech, the error rates obtained are lower than those obtained with other methods.
TL;DR: In this article, an algorithm for synthesizing speech used to identify media assets is presented. But this algorithm is implemented on a system including several dedicated render engines, and the system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesised speech.
Abstract: Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back.