TL;DR: This paper overviews emotional speech recognition having in mind three goals to provide an up-to-date record of the available emotional speech data collections, and examines separately classification techniques that exploit timing information from which that ignore it.
TL;DR: An automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise revealed that cues to voicing are degraded more in the model than in human auditory processing.
Abstract: Do listeners process noisy speech by taking advantage of "glimpses"-spectrotemporal regions in which the target signal is least affected by the background? This study used an automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise. Twelve masking conditions were chosen to create a range of glimpse sizes. Several different glimpsing models were employed, differing in the local signal-to-noise ratio (SNR) used for detection, the minimum glimpse size, and the use of information in the masked regions. Recognition results were compared with behavioral data. A quantitative analysis demonstrated that the proportion of the time-frequency plane glimpsed is a good predictor of intelligibility. Recognition scores in each noise condition confirmed that sufficient information exists in glimpses to support consonant identification. Close fits to listeners' performance were obtained at two local SNR thresholds: one at around 8 dB and another in the range -5 to -2 dB. A transmitted information analysis revealed that cues to voicing are degraded more in the model than in human auditory processing.
TL;DR: In this paper, the authors construct new classes of Parseval frames for a Hilbert space which allow signal reconstruction from the absolute value of the frame coefficients without using phase or its estimation.
TL;DR: This paper studies the quantitative performance behavior of the Wiener filter in the context of noise reduction and shows that in the single-channel case the a posteriori signal-to-noise ratio (SNR) is greater than or equal to the a priori SNR (defined before theWiener filter), indicating that the Wieners filter is always able to achieve noise reduction.
Abstract: The problem of noise reduction has attracted a considerable amount of research attention over the past several decades. Among the numerous techniques that were developed, the optimal Wiener filter can be considered as one of the most fundamental noise reduction approaches, which has been delineated in different forms and adopted in various applications. Although it is not a secret that the Wiener filter may cause some detrimental effects to the speech signal (appreciable or even significant degradation in quality or intelligibility), few efforts have been reported to show the inherent relationship between noise reduction and speech distortion. By defining a speech-distortion index to measure the degree to which the speech signal is deformed and two noise-reduction factors to quantify the amount of noise being attenuated, this paper studies the quantitative performance behavior of the Wiener filter in the context of noise reduction. We show that in the single-channel case the a posteriori signal-to-noise ratio (SNR) (defined after the Wiener filter) is greater than or equal to the a priori SNR (defined before the Wiener filter), indicating that the Wiener filter is always able to achieve noise reduction. However, the amount of noise reduction is in general proportional to the amount of speech degradation. This may seem discouraging as we always expect an algorithm to have maximal noise reduction without much speech distortion. Fortunately, we show that speech distortion can be better managed in three different ways. If we have some a priori knowledge (such as the linear prediction coefficients) of the clean speech signal, this a priori knowledge can be exploited to achieve noise reduction while maintaining a low level of speech distortion. When no a priori knowledge is available, we can still achieve a better control of noise reduction and speech distortion by properly manipulating the Wiener filter, resulting in a suboptimal Wiener filter. In case that we have multiple microphone sensors, the multiple observations of the speech signal can be used to reduce noise with less or even no speech distortion
TL;DR: The proposed noise-estimation algorithm when integrated in speech enhancement was preferred over other noise-ESTimation algorithms, indicating that the local minimum estimation algorithm adapts very quickly to highly non-stationary noise environments.
TL;DR: This chapter discusses models of Speech Production and Hearing, performance of the Auditory Organs, and statistical properties of Speech Signals in the DFT Domain.
Abstract: 1 Introduction. 2 Models of Speech Production and Hearing. 2.1 Organs of Speech Production. 2.2 Characteristics of Speech Signals. 2.3 Model of Speech Production. 2.4 Anatomy of Hearing. 2.5 Performance of the Auditory Organs. Bibliography. 3 Spectral Transformations. 3.1 Fourier Transform of Continuous Signals. 3.2 Fourier Transform of Discrete Signals. 3.3 Linear Shift Invariant Systems. 3.4 The z-Transform. 3.5 The Discrete Fourier Transform. 3.6 Fast Convolution. 3.7 Cepstral Analysis. Bibliography. 4 Filter Banks for Spectral Analysis and Synthesis. 4.1 Spectral Analysis Using Narrow-Band Filters. 4.2 Polyphase Network Filter Banks. 4.3 QuadratureMirror Filter Banks. Bibliography. 5 Stochastic Signals and Estimation. 5.1 Basic Concepts. 5.2 Expectations andMoments. 5.3 Bivariate Statistics. 5.4 Probability and Information. 5.5 Multivariate Statistics. 5.6 Stochastic Processes. 5.7 Estimation of Statistical Quantities by Time Averages. 5.8 Power Spectral Densities. 5.9 Estimation of the Power Spectral Density. 5.10 Statistical Properties of Speech Signals. 5.11 Statistical Properties of DFT Coe.cients. 5.12 Optimal Estimation. Bibliography. 6 Linear Prediction. 6.1 Vocal TractModels and Short-TermPrediction. 6.2 Optimal Prediction Coe.cients for Stationary Signals. 6.3 Predictor Adaptation. 6.4 Long-TermPrediction. Bibliography. 7 Quantization. 7.1 Analog Samples and Digital Presentation. 7.2 Uniform Quantization. 7.3 Non-uniformQuantization. 7.4 OptimalQuantization. 7.5 Adaptive Quantization. 7.6 Vector Quantization. 7.6.1 Principle. Bibliography. 8 Speech Coding. 8.1 Classi.cation of Speech Coding Algorithms. 8.2 Model-Based Predictive Coding. 8.3 Di.erentialWaveform Coding. 8.4 Parametric Coding. 8.5 Hybrid Coding. 8.6 Adaptive Post.ltering. Bibliography. 9 Error Concealment and Softbit Decoding. 9.1 Hardbit Source Decoding. 9.2 Conventional Error Concealment. 9.3 Softbits and L-Values. 9.4 Softbit Source Decoding (SD). 9.5 Application toModel Parameters. 9.6 Further Improvements. Bibliography. 10 Bandwidth Extension of Speech Signals (BWE). 10.1 Narrowband versusWideband Telephony. 10.2 Speech Coding with Integrated BWE. 10.3 BWE without Auxiliary Transmission. Bibliography. 11 Single and Dual Channel Noise Reduction. 11.1 Introduction. 11.2 LinearMMSE Estimators. 11.3 Speech Enhancement in the DFT Domain. 11.4 Optimal Non-Linear Estimators. 11.5 Joint Optimum Detection and Estimation of Speech. 11.6 Computation of Likelihood Ratios. 11.7 Estimation of the A Priory Probability of Speech Presence. 11.8 VAD and Noise Estimation Techniques. 11.9 Dual-Channel Noise Reduction. Bibliography. 12 Multi-Channel Noise Reduction. 12.1 Introduction. 12.2 Spatial Sampling of Sound Fields. 12.3 Beamforming. 12.4 PerformanceMeasures and Spatial Aliasing. 12.5 Design of Fixed Beamformers. 12.6 Adaptive Beamformers. Bibliography. 13 Acoustic Echo Control. 13.1 The Echo Control Problem. 13.2 Evaluation Criteria. 13.3 TheWiener Solution. 13.4 The LMS and NLMS Algorithm. 13.5 Convergence Analysis and Control of the LMS Algorithm. 13.6 Geometric Projection Interpretation of the NLMS Algorithm. 13.7 The A.ne Projection Algorithm. 13.8 Least-Squares and Recursive Least-Squares Algorithms. 13.9 Block Processing and Frequency-Domain Adaptive Filters. 13.9.1 Block LMS Algorithm. 13.10 Additional Measures for Echo Control. 13.11 Stereophonic Acoustic Echo Control. A Codec Standards. B Speech Quality Assessment. Bibliography.
TL;DR: Analysis of relationships between infants' early speech processing performance and later language and cognitive outcomes suggests speech segmentation ability is an important prerequisite for successful language development, and measures to detect language impairment at an earlier age offer potential.
Abstract: Two studies examined relationships between infants' early speech processing performance and later language and cognitive outcomes. Study 1 found that performance on speech segmentation tasks before 12 months of age related to expressive vocabulary at 24 months. However, performance on other tasks was not related to 2-year vocabulary. Study 2 assessed linguistic and cognitive skills at 4-6 years of age for children who had participated in segmentation studies as infants. Children who had been able to segment words from fluent speech scored higher on language measures, but not general IQ, as preschoolers. Results suggest that speech segmentation ability is an important prerequisite for successful language development, and they offer potential for developing measures to detect language impairment at an earlier age.
TL;DR: This review outlines historical backgrounds, architecture, underlying principles, and representative applications of STRAIGHT.
Abstract: STRAIGHT, a speech analysis, modification synthesis system, is an extension of the classical channel VOCODER that exploits the advantages of progress in information processing technologies and a new conceptualization of the role of repetitive structures in speech sounds. This review outlines historical backgrounds, architecture, underlying principles, and representative applications of STRAIGHT.
TL;DR: The results demonstrate that even when equally informative and discriminable, acoustic cues are not necessarily equally weighted in categorization; listeners exhibit biases when integrating multiple acoustic dimensions.
Abstract: The ability to integrate and weight information across dimensions is central to perception and is particularly important for speech categorization. The present experiments investigate cue weighting by training participants to categorize sounds drawn from a two-dimensional acoustic space defined by the center frequency (CF) and modulation frequency (MF) of frequency-modulated sine waves. These dimensions were psychophysically matched to be equally discriminable and, in the first experiment, were equally informative for accurate categorization. Nevertheless, listeners' category responses reflected a bias for use of CF. This bias remained even when the informativeness of CF was decreased by shifting distributions to create more overlap in CF. A reversal of weighting (MF over CF) was obtained when distribution variance was increased for CF. These results demonstrate that even when equally informative and discriminable, acoustic cues are not necessarily equally weighted in categorization; listeners exhibit biases when integrating multiple acoustic dimensions. Moreover, changes in weighting strategies can be affected by changes in input distribution parameters. This methodology provides potential insights into acquisition of speech sound categories, particularly second language categories. One implication is that ineffective cue weighting strategies for phonetic categories may be alleviated by manipulating variance of uninformative dimensions in training stimuli.
TL;DR: Results lead this cross-language study of the categorical nature of tone perception to adopt a memory-based, multistore model of perception in which categorization is domain-general but influenced by long-term categorical representations.
Abstract: Whether or not categorical perception results from the operation of a special, language-specific, speech mode remains controversial. In this cross-language (Mandarin Chinese, English) study of the categorical nature of tone perception, we compared native Mandarin and English speakers’ perception of a physical continuum of fundamental frequency contours ranging from a level to rising tone in both Mandarin speech and a homologous (nonspeech) harmonic tone. This design permits us to evaluate the effect of language experience by comparing Chinese and English groups; to determine whether categorical perception is speech-specific or domain-general by comparing speech to nonspeech stimuli for both groups; and to examine whether categorical perception involves a separate categorical process, distinct from regions of sensory discontinuity, by comparing speech to nonspeech stimuli for English listeners. Results show evidence of strong categorical perception of speech stimuli for Chinese but not English listeners. Categorical perception of nonspeech stimuli was comparable to that for speech stimuli for Chinese but weaker for English listeners, and perception of nonspeech stimuli was more categorical for English listeners than was perception of speech stimuli. These findings lead us to adopt a memory-based, multistore model of perception in which categorization is domain-general but influenced by long-term categorical representations.
TL;DR: The findings of these experiments indicate that regional accent normalization involves a short-term adjustment mechanism that develops as a certain amount of accented signal is available, resulting in a temporary perturbation in speech processing.
Abstract: The processing costs involved in regional accent normalization were evaluated by measuring differences in lexical decision latencies for targets placed at the end of sentences with different French regional accents. Over a series of 6 experiments, the authors examined the time course of comprehension disruption by manipulating the duration and presentation conditions of accented speech. Taken together, the findings of these experiments indicate that regional accent normalization involves a short-term adjustment mechanism that develops as a certain amount of accented signal is available, resulting in a temporary perturbation in speech processing.
TL;DR: Functional benefits from bilateral stimulation in 20 children ages 4–14 show that both groups perform similarly when speech reception thresholds are evaluated, but there appears to be benefit from wearing two devices compared with a single device that is significantly greater in the group with two CI than in the bimodal group.
Abstract: This study evaluated functional benefits from bilateral stimulation in 20 children ages 4-14, 10 use two CIs and 10 use one CI and one HA. Localization acuity was measured with the minimum audible angle (MAA). Speech intelligibility was measured in quiet, and in the presence of 2-talker competing speech using the CRISP forced-choice test. Results show that both groups perform similarly when speech reception thresholds are evaluated. However, there appears to be benefit (improved MAA and speech thresholds) from wearing two devices compared with a single device that is significantly greater in the group with two CI than in the bimodal group. Individual variability also suggests that some children perform similarly to normal-hearing children, while others clearly do not. Future advances in binaural fitting strategies and improved speech processing schemes that maximize binaural sensitivity will no doubt contribute to increasing the binaurally-driven advantages in persons with bilateral CIs.
TL;DR: In two studies the first formant of monosyllabic consonant-vowel-consonant words was shifted electronically and fed back to the participant very quickly so that participants perceived the modified speech as their own productions and appeared to more actively stabilize their productions from trial-to-trial.
Abstract: Auditory feedback during speech production is known to play a role in speech sound acquisition and is also important for the maintenance of accurate articulation. In two studies the first formant (F1) of monosyllabic consonant-vowel-consonant words (CVCs) was shifted electronically and fed back to the participant very quickly so that participants perceived the modified speech as their own productions. When feedback was shifted up (experiment 1 and 2) or down (experiment 1) participants compensated by producing F1 in the opposite frequency direction from baseline. The threshold size of manipulation that initiated a compensation in F1 was usually greater than 60Hz. When normal feedback was returned, F1 did not return immediately to baseline but showed an exponential deadaptation pattern. Experiment 1 showed that this effect was not influenced by the direction of the F1 shift, with both raising and lowering of F1 exhibiting the same effects. Experiment 2 showed that manipulating the number of trials that F1 ...
TL;DR: In this article, a time-varying Wiener filter is used to specify the ratio of a target signal and a noisy mixture in a local time-frequency unit, which is then fed to a conventional speech recognizer operating in the cepstral domain.
TL;DR: The rapid formant compensations found here suggest that auditory feedback control is similar for both F0 and formants.
Abstract: Auditory feedback influences human speech production, as demonstrated by studies using rapid pitch and loudness changes. Feedback has also been investigated using the gradual manipulation of formants in adaptation studies with whispered speech. In the work reported here, the first formant of steady-state isolated vowels was unexpectedly altered within trials for voiced speech. This was achieved using a real-time formant tracking and filtering system developed for this purpose. The first formant of vowel /epsilon/ was manipulated 100% toward either /ae/ or /I/, and participants responded by altering their production with average Fl compensation as large as 16.3% and 10.6% of the applied formant shift, respectively. Compensation was estimated to begin <460 ms after stimulus onset. The rapid formant compensations found here suggest that auditory feedback control is similar for both F0 and formants.
TL;DR: This paper proposes a class of VAD algorithms based on several statistical models based on the Gaussian model, and incorporates the complex Laplacian and Gamma probability density functions to the analysis of statistical properties.
Abstract: One of the key issues in practical speech processing is to achieve robust voice activity detection (VAD) against the background noise. Most of the statistical model-based approaches have tried to employ the Gaussian assumption in the discrete Fourier transform (DFT) domain, which, however, deviates from the real observation. In this paper, we propose a class of VAD algorithms based on several statistical models. In addition to the Gaussian model, we also incorporate the complex Laplacian and Gamma probability density functions to our analysis of statistical properties. With a goodness-of-fit tests, we analyze the statistical properties of the DFT spectra of the noisy speech under various noise conditions. Based on the statistical analysis, the likelihood ratio test under the given statistical models is established for the purpose of VAD. Since the statistical characteristics of the speech signal are differently affected by the noise types and levels, to cope with the time-varying environments, our approach is aimed at finding adaptively an appropriate statistical model in an online fashion. The performance of the proposed VAD approaches in both the stationary and nonstationary noise environments is evaluated with the aid of an objective measure.
TL;DR: A content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds is described.
Abstract: We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multilinear dimensionality reduction technique and classified by a support vector machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches (Scheirer and Slaney, 2002 and Kingsbury et al., 2002). The results demonstrate the advantages of the auditory model over the other two systems, especially at low signal-to-noise ratios (SNRs) and high reverberation.
TL;DR: Tests of the smooth signal redundancy hypothesis with a very high-quality corpus collected for speech synthesis confirm the duration/language redundancy results achieved in previous work, and show a significant relationship between language redundancy factors and the first two formants, although these results vary considerably by vowel.
Abstract: The language redundancy of a syllable, measured by its predictability given its context and inherent frequency, has been shown to have a strong inverse relationship with syllabic duration. This relationship is predicted by the smooth signal redundancy hypothesis, which proposes that robust communication in a noisy environment can be achieved with an inverse relationship between language redundancy and the predictability given acoustic observations (acoustic redundancy). A general version of the hypothesis predicts similar relationships between the spectral characteristics of speech and language redundancy. However, investigating this claim is hampered by difficulties in measuring the spectral characteristics of speech within large conversational corpora, and difficulties in forming models of acoustic redundancy based on these spectral characteristics. This paper addresses these difficulties by testing the smooth signal redundancy hypothesis with a very high-quality corpus collected for speech synthesis, and presents both durational and spectral data from vowel nuclei on a vowel-by-vowel basis. Results confirm the duration/ language redundancy results achieved in previous work, and show a significant relationship between language redundancy factors and the first two formants, although these results vary considerably by vowel. In general, however, vowels show increased centralization with increased language redundancy.
TL;DR: An overview of the various vocoder-centric processing strategies proposed for cochlear implants since the late 1990s is provided including the strategies used in different commercially available implant processors.
Abstract: The principles of the most recent cochlear implant processors are similar to that of the channel vocoder, originally used for transmitting speech over telephone lines with much less bandwidth than that required for transmitting the unprocessed speech signal. An overview of the various vocoder-centric processing strategies proposed for cochlear implants since the late 1990s is provided including the strategies used in different commercially available implant processors. Special emphasis is placed on reviewing the strategies designed to enhance pitch information for potentially better music perception. The various noise suppression strategies proposed over the years based on multi-microphone and single-microphone inputs are also described.
TL;DR: A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that the proposed algorithm performs substantially better.
Abstract: Under noise-free conditions, the quality of reverberant speech is dependent on two distinct perceptual components: coloration and long-term reverberation. They correspond to two physical variables: signal-to-reverberant energy ratio (SRR) and reverberation time, respectively. Inspired by this observation, we propose a two-stage reverberant speech enhancement algorithm using one microphone. In the first stage, an inverse filter is estimated to reduce coloration effects or increase SRR. The second stage employs spectral subtraction to minimize the influence of long-term reverberation. The proposed algorithm significantly improves the quality of reverberant speech. A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that our algorithm performs substantially better.
TL;DR: In this paper, a method for improving the quality of a speech signal extracted from a noisy acoustic environment is provided, where a signal separation process (180) is associated with a voice activity detector (185).
Abstract: A method for improving the quality of a speech signal extracted from a noisy acoustic environment is provided. In one approach, a signal separation process (180) is associated with a voice activity detector (185). The voice activity detector (185) is a two-channel (178,182) detector, which enables a particularly robust and accurate detection of voice activity. When a speech is detected, the voice activity detector generates a control signal (411). The control signal (411) is used to activate, adjust, or control signal separation processes or post -processing operations (195) to improve the quality of the resulting speech signal. In another approach, a signal separation process (180) is provided as a learning stage (752) and an output stage (756). The learning stage (752) aggressively adjus to current acoustic conditions and passes coefficients to the output stage (756). The output stage (756) adapts more slowly and generates a speech-content signal (181,770) and a noise dominant signal (407,773). When the learning stage (752) becomes unstable only the learning stage (752) is reset, allowing the output stage (756) to continue outputting a high quality speech signal.
TL;DR: In this article, the authors proposed a speech enhancement system that is able to suppress highly non-stationary noise, which can be adapted to a hearing aid or a headset, using a speech model and a noise model having at least one shape and gain.
Abstract: A central aspect of the invention relates to a method of enhancing speech, the method comprising the steps of, receiving noisy speech comprising a clean speech component and a non-stationary noise component, providing a speech model, providing a noise model having at least one shape and a gain, dynamically modifying the noise model based on the speech model and the received noisy speech, enhancing the noisy speech at least based on the modified noise model Hereby is achieved a method of speech enhancement that is able to suppress highly non-stationary noise Another aspect of the invention relates to a speech enhancement system that may be adapted to be used in a hearing system, such as a hearing aid or a headset
TL;DR: Researchers examine how the infant brain processes verbal stimuli before learning to reveal a structural and functional organization close to what is described in adults and suggest a strong bias for speech processing in these regions that might guide infants as they discover the properties of their native language.
TL;DR: This study applies quantitative methods to the speech and music of England and France to reveal that music reflects patterns of durational contrast between successive vowels in spoken sentences, as well as patterns of pitch interval variability in speech.
Abstract: For over half a century, musicologists and linguists have suggested that the prosody of a culture's native language is reflected in the rhythms and melodies of its instrumental music. Testing this idea requires quantitative methods for comparing musical and spoken rhythm and melody. This study applies such methods to the speech and music of England and France. The results reveal that music reflects patterns of durational contrast between successive vowels in spoken sentences, as well as patterns of pitch interval variability in speech. The methods presented here are suitable for studying speech-music relations in a broad range of cultures.
TL;DR: A fast-acting level control circuit for the cGC filter is described and it is shown how psychophysical data involving two-tone suppression and compression can be used to estimate the parameter values for this dynamic version of the c GC filter (referred to as the "dcGC" filter).
Abstract: It is now common to use knowledge about human auditory processing in the development of audio signal processors. Until recently, however, such systems were limited by their linearity. The auditory filter system is known to be level-dependent as evidenced by psychophysical data on masking, compression, and two-tone suppression. However, there were no analysis/synthesis schemes with nonlinear filterbanks. This paper describe 18300060s such a scheme based on the compressive gammachirp (cGC) auditory filter. It was developed to extend the gammatone filter concept to accommodate the changes in psychophysical filter shape that are observed to occur with changes in stimulus level in simultaneous, tone-in-noise masking. In models of simultaneous noise masking, the temporal dynamics of the filtering can be ignored. Analysis/synthesis systems, however, are intended for use with speech sounds where the glottal cycle can be long with respect to auditory time constants, and so they require specification of the temporal dynamics of auditory filter. In this paper, we describe a fast-acting level control circuit for the cGC filter and show how psychophysical data involving two-tone suppression and compression can be used to estimate the parameter values for this dynamic version of the cGC filter (referred to as the "dcGC" filter). One important advantage of analysis/synthesis systems with a dcGC filterbank is that they can inherit previously refined signal processing algorithms developed with conventional short-time Fourier transforms (STFTs) and linear filterbanks
TL;DR: It is proposed that pSTM arises from the cycling of information between two phonological buffers, one involved in speech perception and one in speech production, and the understanding of their neural bases will benefit from incorporating them.
Abstract: Traditionally, models of speech comprehension and production do not depend on concepts and processes from the phonological short-term memory (pSTM) literature. Likewise, in working memory research, pSTM is considered to be a language-independent system that facilitates language acquisition rather than speech processing per se. We discuss couplings between pSTM, speech perception and speech production, and we propose that pSTM arises from the cycling of information between two phonological buffers, one involved in speech perception and one in speech production. We discuss the specific role of these processes in speech processing, and argue that models of speech perception and production, and our understanding of their neural bases, will benefit from incorporating them.
TL;DR: The ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages, uses a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations.
Abstract: In this paper, we describe the ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages (Japanese and Chinese). There are three main modules of our S2ST system: large-vocabulary continuous speech recognition, machine text-to-text (T2T) translation, and text-to-speech synthesis. All of them are multilingual and are designed using state-of-the-art technologies developed at ATR. A corpus-based statistical machine learning framework forms the basis of our system design. We use a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations. Recent evaluation of the overall system showed that speech-to-speech translation quality is high, being at the level of a person having a Test of English for International Communication (TOEIC) score of 750 out of the perfect score of 990.
TL;DR: Online measures of speech processing are used in a looking-while-listening procedure and suggest familiar frames may enable the infant to 'listen ahead' more efficiently for the focused word at the end of the sentence.
Abstract: In child-directed speech (CDS), adults often use utterances with very few words; many include short, frequently used sentence frames, while others consist of a single word in isolation. Do such features of CDS provide perceptual advantages for the child? Based on descriptive analyses of parental speech, some researchers argue that isolated words should help infants in word recognition by facilitating segmentation, while others predict no advantage. To address this question directly, we used online measures of speech processing in a looking-while-listening procedure. In two experiments, 18-month-olds were presented with familiar object names in isolation and in a sentence frame. Infants were 120 ms slower to interpret target words in isolation than when the same words were preceded by a familiar carrier phrase, suggesting that the sentence frame facilitated word recognition. Familiar frames may enable the infant to ‘listen ahead’ more efficiently for the focused word at the end of the sentence.
TL;DR: The present findings suggest that activation of the neural speech representations in the left STSp might be a pre-requisite for hearing sounds as speech.
TL;DR: The improvement of speech in noise and melody recognition is linked to the ability to distinguish fine pitch differences as the result of preserved residual low-frequency acoustic hearing.
Abstract: Aim: This communication details the latest preliminary results from an ongoing multicenter single-subject design clinical trial of the Iowa/Nucleus Hybrid 10-mm cochlear implant. Selection criteria, surgical strategies used for hearing preservation, and the benefits of preserved residual low-frequency hearing, improved word understanding in noise, and music appreciation are described. Patients and Methods: The device has been implanted in 48 individuals with residual low-frequency hearing. Results:Hearing preservation has been accomplished in 46/48 subjects. Acoustic speech perception has also been preserved. Combined acoustic plus electric speech processing has enabled most of this group of volunteers to gain improved word understanding as compared to their preoperative hearing with bilateral hearing aids. A subset of subjects with 12 months or more experience demonstrates CNC word understanding continues to improve more than 24 months after implantation. Improved word understanding in noise is also a benefit of acoustic plus electric speech processing. Conclusions:The improvement of speech in noise and melody recognition is linked to the ability to distinguish fine pitch differences as the result of preserved residual low-frequency acoustic hearing. Both of these measures are very important in real life to the hearing impaired. Preservation of residual low-frequency hearing should be considered when expanding candidate selection criteria for standard cochlear implants.