TL;DR: WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs, and general transducer operations combine these representations flexibly and efficiently.
TL;DR: It is shown that HMMs trained with MMIE benefit as much as MLE-trained HMMs from applying model adaptation using maximum likelihood linear regression (MLLR), which has allowed the straightforward integration of MMIe- trained HMMs into complex multi-pass systems for transcription of conversational telephone speech.
TL;DR: This paper presents an approach to recognition confidence scoring and a set of techniques for integrating confidence scores into the understanding and dialogue components of a speech understanding system and demonstrates a relative reduction in concept error rate.
TL;DR: SPoT, a trainable sentence planner, and a new methodology for automatically training SPoT on the basis of feedback provided by human judges, which shows that SPiT performs better than the rule-based systems and the baselines, and as well as the hand-crafted system.
TL;DR: There is a direct meaning-to-speech mapping that eliminates the need to analyze linguistic structure for synthesis in the mercury flight reservation system, a mixed-initiative spoken dialogue system that supports both voice-only interaction and multi-modal interaction augmenting spoken inputs with typing or clicking at a displayed Web page.
TL;DR: How decisions for word ordering and word choice in surface natural language generation can be automatically learned from annotated data is studied to find the highest probability word sequence that is consistent with the rules and conditions of the grammar.
TL;DR: This paper defines two alternatives to the familiar perplexity statistic, respectively acoustic perplexity and the synthetic acoustic word error rate, and shows how to compute these statistics by effectively synthesizing a large acoustic corpus.
TL;DR: This paper reports on the application of across-word context dependent acoustic phoneme models in a single-pass large vocabulary continuous speech recognizer and derives a formal specification ofAcross-word word graphs, which are a good representation of the active search space.
TL;DR: A generation system for spoken dialogue that not only produces coherent, informative and responsive dialogue contributions, but also explicitly models human styles of interaction is described.
TL;DR: This paper describes how language generation and speech synthesis for spoken dialog systems can be efficiently integrated under a weighted finite state transducer architecture and shows that introducing flexible targets in generation leads to more natural sounding synthesis.
TL;DR: This research is motivated by several goals: improving the quality of synthesis by using the generator to provide information about the purpose, meaning, and linguistic structure of the utterance to the synthesis process, and making it possible to customize systems that generate spoken language to individual or sets of users or new domains very quickly.
TL;DR: Three groups of features are investigated: semantic, syntactic, and surface features produced by SURGE, a general-purpose surface natural language generator for English, deep semantic, and discourse features that are available during the domain modeling and content planning phases of generation, and information-based measures statistically derived from text.
TL;DR: Investigation into the use of phonologically-constrained morphological analysis (PCMA) in language modelling for continuous speech recognition shows that PCMA leads to smaller but more generative pronunciation lexicons, and that it does not weaken the quality of the acoustic decoding measured in terms of recognition lattices.
TL;DR: How information from natural language generation can be used to compute prosody in a concept-to-speech system, focusing on the automatic marking of contrastive accents on the basis of information about the preceding discourse, is discussed and compared.
TL;DR: A new form of factorial HMM which makes use of transformation streams is introduced which is a generalization of the standard factorialHMM and other related schemes in speech processing.
TL;DR: It is shown that a simple statistical model alone can generate appropriate language for a spoken dialog system, and a promising avenue for using a statistical approach in future NLG systems is described.
TL;DR: A number of decoding strategies for large vocabulary continuous speech recognition (LVCSR) are examined from the viewpoint of their search space representation, and the main approaches are compared and some prospective views are formulated regarding possible future avenues.
TL;DR: Improvements in recognition accuracy due to multiple microphones, HMM training on contaminated speech and incremental adaptation are additive on a connected digits task and the results show that unsupervised incremental adaptation receives the benefits of starting from models trained using contaminated speech.
TL;DR: A spoken language generation system that learns to describe objects in computer-generated visual scenes and generates syntactically well-formed compound adjective noun phrases, as well as relative spatial clauses was comparable to human-generated descriptions.
TL;DR: Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data.
TL;DR: The proposed algorithm, called structural MAPLR (SMAPLR), has been evaluated on the Spoke3 1993 test set of the WSJ task and it is shown that SMAPLR reduces the risk of overtraining and exploits the adaptation data much more efficiently than MLLR, leading to a significant reduction of the word error rate for any amount of adaptation data.