Triphone

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1109/TASL.2011.2134090•

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

[...]

George E. Dahl¹, Dong Yu², Li Deng², Alex Acero²•Institutions (2)

University of Toronto¹, Microsoft²

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

...read moreread less

Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

...read moreread less

3,656 citations

Proceedings Article•10.1109/ASRU.2011.6163899•

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

[...]

Frank Seide¹, Gang Li¹, Xie Chen¹, Dong Yu¹•Institutions (1)

Microsoft¹

1 Dec 2011

TL;DR: This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.

...read moreread less

Abstract: We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

...read moreread less

773 citations

Journal Article•10.1109/29.45616•

An overview of the SPHINX speech recognition system

[...]

Kai-Fu Lee¹, H.-W. Hon¹, Raj Reddy¹•Institutions (1)

Carnegie Mellon University¹

01 May 1990-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: SPHINX is a system that demonstrates the feasibility of accurate, large-vocabulary, speaker-independent, continuous speech recognition, based on discrete hidden Markov models with LPC- (linear-predictive-coding) derived parameters.

...read moreread less

Abstract: A description is given of SPHINX, a system that demonstrates the feasibility of accurate, large-vocabulary, speaker-independent, continuous speech recognition. SPHINX is based on discrete hidden Markov models (HMMs) with LPC- (linear-predictive-coding) derived parameters. To provide speaker independence, knowledge was added to these HMMs in several ways: multiple codebooks of fixed-width parameters, and an enhanced recognizer with carefully designed models and word-duration modeling. To deal with coarticulation in continuous speech, yet still adequately represent a large vocabulary, two new subword speech units are introduced: function-word-dependent phone models and generalized triphone models. With grammars of perplexity 997, 60, and 20, SPHINX attained word accuracies of 71, 94, and 96%, respectively, on a 997-word task. >

...read moreread less

515 citations

Journal Article•10.1006/CSLA.2001.0182•

Large scale discriminative training of hidden Markov models for speech recognition

[...]

Philip C. Woodland¹, Daniel Povey¹•Institutions (1)

University of Cambridge¹

01 Jan 2002-Computer Speech & Language

TL;DR: It is shown that HMMs trained with MMIE benefit as much as MLE-trained HMMs from applying model adaptation using maximum likelihood linear regression (MLLR), which has allowed the straightforward integration of MMIe- trained HMMs into complex multi-pass systems for transcription of conversational telephone speech.

...read moreread less

396 citations

Patent•10.1121/1.420245•

Single tree method for grammar directed, very large vocabulary speech recognizer

[...]

Richard Schwartz, Long Nguyen

19 Jan 1994-Journal of the Acoustical Society of America

TL;DR: The invention provides a method of large vocabulary speech recognition that employs a single tree-structured phonetic hidden Markov model (HMM) at each frame of a time-synchronous process, and phonetic context information is exploited, even before the complete context of a phoneme is known.

...read moreread less

Abstract: The invention provides a method of large vocabulary speech recognition that employs a single tree-structured phonetic hidden Markov model (HMM) at each frame of a time-synchronous process. A grammar probability is utilized upon recognition of each phoneme of a word, before recognition of the entire word is complete. Thus, grammar probabilities are exploited as early as possible during recognition of a word. At each frame of the recognition process, a grammar probability is determined for the transition from the most likely preceding grammar state to a set of words that share at least one common phoneme. The grammar probability is combined with accumulating phonetic evidence to provide a measure of the likelihood that a state in the HMM will lead to the word most likely to have been spoken. In a preferred embodiment, phonetic context information is exploited, even before the complete context of a phoneme is known. Instead of an exact triphone model, wherein the phonemes previous and subsequent to a phoneme are considered, a composite triphone model is used that exploits partial phonetic context information to provide a phonetic model that is more accurate than aphonetic model that ignores context. In another preferred embodiment, the single phonetic tree method is used as the forward pass of a forward/backward recognition process, wherein the backward pass employs a recognition process other than the single phonetic tree method.

...read moreread less

278 citations

...

Expand

Year	Papers
2021	10
2020	7
2019	22
2018	14
2017	9
2016	18

Topic Tools

Papers published on a yearly basis

Papers

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

An overview of the SPHINX speech recognition system

Large scale discriminative training of hidden Markov models for speech recognition

Single tree method for grammar directed, very large vocabulary speech recognizer

Related Topics (5)

Performance Metrics