Scispace (Formerly Typeset)
  1. Home
  2. Journals
  3. International Journal of Speech Technology
  4. 2003
  1. Home
  2. Journals
  3. International Journal of Speech Technology
  4. 2003
Showing papers in "International Journal of Speech Technology in 2003"
Journal Article•10.1023/A:1023474405658•
Evolution of the Information-Retrieval System for Blind and Visually-Impaired People

[...]

Simon Dobrišek1, Jerneja Gros1, Boštjan Vesnicer1, Nikola Pavešić#x1•
University of Ljubljana1
01 Jul 2003-International Journal of Speech Technology
TL;DR: In the latest version of the system all the modules of the early version are being integrated into the user interface, which has some basic web-browsing functionalities and a text-to-speech screen-reader function controlled by the mouse as well.
Abstract: Blind and visually-impaired people face many problems in interacting with information retrieval systems. State-of-the-art spoken language technology offers potential to overcome many of them. In the mid-nineties our research group decided to develop an information retrieval system suitable for Slovene-speaking blind and visually-impaired people. A voice-driven text-to-speech dialogue system was developed for reading Slovenian texts obtained from the Electronic Information System of the Association of Slovenian Blind and Visually Impaired Persons Societies. The evolution of the system is presented. The early version of the system was designed to deal explicitly with the Electronic Information System where the available text corpora are stored in a plain text file format without any, or with just some, basic non-standard tagging. Further improvements to the system became possible with the decision to transfer the available corpora to the new web portal, exclusively dedicated to blind and visually-impaired users. The text files were reformatted into common HTML/XML pages, which comply with the basic recommendations set by the Web Access Initiative. In the latest version of the system all the modules of the early version are being integrated into the user interface, which has some basic web-browsing functionalities and a text-to-speech screen-reader function controlled by the mouse as well.

95 citations

Journal Article•10.1023/A:1023426522496•
Context-Independent Multilingual Emotion Recognition from Speech Signals

[...]

Vladimir Hozjan1, Zdravko Kacic1•
University of Maribor1
01 Jul 2003-International Journal of Speech Technology
TL;DR: Among speaker-dependent, monolingual, and multilingual emotion recognition, the difference between emotion Recognition with all high-level features and emotion recognition with database-specific emotional features is smallest for mult bilingual emotion recognition—3.84%.
Abstract: This paper presents and discusses an analysis of multilingual emotion recognition from speech with database-specific emotional features. Recognition was performed on English, Slovenian, Spanish, and French InterFace emotional speech databases. The InterFace databases included several neutral speaking styles and six emotions: disgust, surprise, joy, fear, anger and sadness. Speech features for emotion recognition were determined in two steps. In the first step, low-level features were defined and in the second high-level features were calculated from low-level features. Low-level features are composed from pitch, derivative of pitch, energy, derivative of energy, and duration of speech segments. High-level features are statistical presentations of low-level features. Database-specific emotional features were selected from high-level features that contain the most information about emotions in speech. Speaker-dependent and monolingual emotion recognisers were defined, as well as multilingual recognisers. Emotion recognition was performed using artificial neural networks. The achieved recognition accuracy was highest for speaker-dependent emotion recognition, smaller for monolingual emotion recognition and smallest for multilingual recognition. The database-specific emotional features are most convenient for use in multilingual emotion recognition. Among speaker-dependent, monolingual, and multilingual emotion recognition, the difference between emotion recognition with all high-level features and emotion recognition with database-specific emotional features is smallest for multilingual emotion recognition—3.84%.

76 citations

Journal Article•10.1023/A:1021052023237•
Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis

[...]

Bernd Möbius1•
University of Stuttgart1
01 Jan 2003-International Journal of Speech Technology
TL;DR: This paper discusses the problems pertinent to rare events in four components of speech synthesis systems: in linguistic text analysis, where productive word formation processes generate a potentially unbounded lexicon and cause heavily skewed word frequency distributions; in syllabification, where some syllables occur very frequently but most phonotactically possible syllables are very infrequent; in speech timing, where most constellations of factors affecting segmental duration are sparsely or not at all represented in training databases.
Abstract: One of the most serious challenges for speech synthesis is the systematic treatment of events in language and speech that are known to have low frequencies of occurrence. The problems that extremely unbalanced frequency distributions pose for rule-based or data-driven models are often underestimated or even unrecognized. This paper discusses the problems pertinent to rare events in four components of speech synthesis systems: in linguistic text analysis, where productive word formation processes generate a potentially unbounded lexicon and cause heavily skewed word frequency distributions; in syllabification, where some syllables occur very frequently but most phonotactically possible syllables are very infrequent; in speech timing, where most constellations of factors affecting segmental duration are sparsely or not at all represented in training databases; and in unit selection synthesis, where the uneven distribution of speech unit frequencies poses challenges to speech corpus design. Currently available techniques for coping with the problem of rare or unseen events in each of these components are reviewed. Finally, a distinction is made between a strictly closed domain with a fixed vocabulary and a merely restricted domain with loopholes for unseen words and names, and the consequences of the respective type of domain for appropriate synthesis strategies are discussed.

68 citations

Journal Article•10.1023/A:1021091720511•
Close Shadowing Natural Versus Synthetic Speech

[...]

Gérard Bailly1•
Centre national de la recherche scientifique1
01 Jan 2003-International Journal of Speech Technology
TL;DR: Preliminary results show that speakers are able to follow natural stimuli with an average delay of 70 ms whereas this delay typically exceeds 100 ms for stimuli produced by text-to-speech systems.
Abstract: Close shadowing experiments involving natural and synthetic stimuli are described. Preliminary results show that speakers are able to follow natural stimuli with an average delay of 70 ms whereas this delay typically exceeds 100 ms for stimuli produced by text-to-speech systems. A complementary experiment shows that this contrast is mainly due to the inappropriate or impoverished prosody generated by actual text-to-speech systems.

42 citations

Journal Article•10.1023/A:1022382413579•
To Mix or Not to Mix Synthetic Speech and Human Speech? Contrasting Impact on Judge-Rated Task Performance versus Self-Rated Performance and Attitudinal Responses

[...]

Li Gong, Jennifer Lai1•
IBM1
01 Apr 2003-International Journal of Speech Technology
TL;DR: A consistency framework drawn from human psychological processing is offered to explain the difference in task performance and Cognitive processing and attitudinal response are differentiated.
Abstract: Since it is impractical to prerecord human speech for dynamic content such as email messages and news, many commercial speech applications use recorded human speech for fixed content (e.g. system prompts) and synthetic speech for dynamic content. However, mixing human speech and synthetic speech may not be optimal from a consistency perspective. A two-condition between-participants experiment (N = 24) was conducted to compare two versions of a telephony application for Personal Information Management (PIM). In the first condition, all the system output was delivered with synthetic speech. In the second condition, users heard a mix of human speech and synthetic speech. Users managed several email and calendar tasks. Users' task performance was rated by two independent judges. Their self-ratings of task performance and attitudinal responses were also measured by means of questionnaires. Users interacting with the interface that used only synthetic speech performed the task significantly better, while users interacting with the mixed-speech interface thought they did better and had more positive attitudinal responses. A consistency framework drawn from human psychological processing is offered to explain the difference in task performance. Cognitive processing and attitudinal response are differentiated. Design implications and directions for future research are suggested.

37 citations

Journal Article•10.1023/A:1023462002932•
Spoken Language Resources at LUKS of the University of Ljubljana

[...]

Jerneja Gros1, Simon Dobrišek1, Janez Žibert1, Nikola Pavešić#x1•
University of Ljubljana1
01 Jul 2003-International Journal of Speech Technology
TL;DR: This paper presents the Slovene-language spoken resources that were acquired at the Laboratory of Artificial Perception, Systems and Cybernetics (LUKS) at the Faculty of Electrical Engineering, University of Ljubljana over the past ten years.
Abstract: This paper presents the Slovene-language spoken resources that were acquired at the Laboratory of Artificial Perception, Systems and Cybernetics (LUKS) at the Faculty of Electrical Engineering, University of Ljubljana over the past ten years. The resources consist of: • isolated-spoken-word corpora designed for phonetic research of the Slovene spoken language; • read-speech corpora from dialogues relating to air flight information; • isolated-word corpora, designed for studying the Slovene spoken diphthongs; • Slovene diphone corpora used for text-to-speech synthesis systems; • a weather forecast speech database, as an attempt to capture radio and television broadcast news in the Slovene language; and • read- and spontaneous-speech corpora used to study the effects of the psycho physical conditions of the speakers on their speech characteristics. All the resources are accompanied by relevant text transcriptions, lexicons and various segmentation labels. The read-speech corpora relating to the air flight information domain also are annotated prosodically and semantically. The words in the orthographic transcription were automatically tagged for their lemma and morphosyntactic description. Many of the mentioned speech resources are freely available for basic research purposes in speech technology and linguistics. In this paper we describe all the resources in more detail and give a brief description of their use in the spoken language technology products developed at LUKS.

35 citations

Journal Article•10.1023/A:1022378312670•
Speech-Based Disclosure Systems: Effects of Modality, Gender of Prompt, and Gender of User

[...]

Clifford Nass1, Erica Robles1, Charles Heenan1, Hilary Bienstock1, Marissa Treinen1 •
Stanford University1
01 Apr 2003-International Journal of Speech Technology
TL;DR: In this paper, the role of interface design in maximizing disclosure of personal information is explored, where participants were asked to disclose personal information to a telephone-based speech user interface (SUI) in a 3 (recorded speech vs. text-based interface) between-participants experiment (with no voice manipulation in the text conditions).
Abstract: Disclosure of personal information is valuable to individuals, governments, and corporations. This experiment explores the role interface design plays in maximizing disclosure. Participants (N = 100) were asked to disclose personal information to a telephone-based speech user interface (SUI) in a 3 (recorded speech vs. synthesized speech vs. text-based interface) by 2 (gender of participant) by 2 (gender of voice) between-participants experiment (with no voice manipulation in the text conditions). Synthetic speech participants exhibited significantly less disclosure and less comfort with the system than text-based or recorded-speech participants. Females were more sensitive to differences between synthetic and recorded speech. There were significant interactions between modality and gender of speech, while there were no gender identification effects. Implications for the design of speech-based information-gathering systems are outlined.

32 citations

Journal Article•10.1023/A:1022326328600•
Say it Like You Mean it: Priming for Structure in Caller Responses to a Spoken Dialog System

[...]

Tony Sheeder1, Jennifer Balogh1•
Nuance Communications1
01 Apr 2003-International Journal of Speech Technology
TL;DR: Findings indicate that examples encouraging a more natural structure, when presented prior to the initial query, result in significantly improved routing performance.
Abstract: In this paper we report results of a study undertaken to evaluate the initial prompts of ‘open prompt’ style call-routing applications. Specifically, we examined how placement and phrasing of examples in the initial query affected caller responses and routing success. We looked at the comparative effectiveness of placing examples before and after the initial query and of phrasing these examples such that they promoted either a succinct structure in the form of a keyword or phrase, or a more complex but natural structure in the form of a question or statement. Findings indicate that examples encouraging a more natural structure, when presented prior to the initial query, result in significantly improved routing performance. We discuss this result in the context of using initial prompts to prime for desired structure in caller responses.

28 citations

Journal Article•10.1023/A:1025761017833•
Speech Database Design for a Concatenative Text-to-Speech Synthesis System for Individuals with Communication Disorders

[...]

Akemi Iida1, Nick Campbell•
Keio University1
01 Oct 2003-International Journal of Speech Technology
TL;DR: This paper reports on a case study of the development of a VOCA using recordings of Japanese read speech from an individual with amyotrophic lateral sclerosis, and designed a speech database that could reproduce the characteristics of natural utterances in both general and specific situations.
Abstract: ATR's CHATR is a corpus-based text-to-speech (TTS) synthesis system that selects concatenation units from a natural speech database. The system's approach enables us to create a voice output communication aid (VOCA) using the voices of individuals who are anticipating the loss of phonatory functions. The advantage of CHATR is that individuals can use their own voice for communication even after vocal loss. This paper reports on a case study of the development of a VOCA using recordings of Japanese read speech (i.e., oral reading) from an individual with amyotrophic lateral sclerosis (ALS). In addition to using the individual's speech, we designed a speech database that could reproduce the characteristics of natural utterances in both general and specific situations. We created three speech corpora in Japanese to synthesize ordinary daily speech (i.e., in a normal speaking style): (1) a phonetically balanced sentence set, to assure that the system was able to synthesize all speech sounds; (2) readings of manuscripts, written by the same individual, for synthesizing talks regularly given as a source of natural intonation, articulation and voice quality; and (3) words and short phrases, to provide daily vocabulary entries for reproducing natural utterances in predictable situations. By combining one or more corpora, we were able to create four kinds of source database for CHATR synthesis. Using each source database, we synthesized speech from six test sentences. We selected the source database to use by observing selected units of synthesized speech and by performing perceptual experiments where we presented the speech to 20 Japanese native speakers. Analyzing the results of both observations and evaluations, we selected a source database compiled from all corpora. Incorporating CHATR, the selected source database, and an input acceleration function, we developed a VOCA for the individual to use in his daily life. We also created emotional speech source databases designed for loading separately to the VOCA in addition to the compiled speech database.

26 citations

Journal Article•10.1023/A:1021095805490•
Hierarchical structure and word strength prediction of Mandarin prosody

[...]

Greg Kochanski1, Chilin Shih1, Hongyan Jing1•
Alcatel-Lucent1
01 Jan 2003-International Journal of Speech Technology
TL;DR: An automatic learning system for Mandarin prosody is built that allows for quantitative measurements of prosodic strengths, and reveals strong alternating metrical patterns in words, and suggests that the speaker uses word strength to mark a hierarchy of sentence, clause, phrase, and word boundaries.
Abstract: We use Stem-ML to build an automatic learning system for Mandarin prosody that allows us to make quantitative measurements of prosodic strengths. Stem-ML is a phenomenological model of the muscle dynamics and planning process that controls the tension of the vocal folds. Because Stem-ML describes the interactions between nearby tones or accents, we were able to use a highly constrained model with only one accent template for each lexical tone category, and a single prosodic strength per word. The model accurately reproduces the intonation of the speaker, capturing 87% of the variance of the speech's fundamental frequency, f0. The result reveals strong alternating metrical patterns in words, and suggests that the speaker uses word strength to mark a hierarchy of sentence, clause, phrase, and word boundaries.

25 citations

Journal Article•10.1023/A:1023466103841•
Modelling Highly Inflected Slovenian Language

[...]

Mirjam Sepesy Maučec1, Tomaž Rotovnik1, Melita Zemljak1•
University of Maribor1
01 Jul 2003-International Journal of Speech Technology
TL;DR: This paper concerns the development of statistical language models of the Slovenian language for use in an automatic speech recognition system, and a novel variation on the N-gram modelling theme is examined where, instead of using words, modelling units are chosen to be stems and endings.
Abstract: This paper concerns the development of statistical language models of the Slovenian language for use in an automatic speech recognition system. The proposed techniques are language-independent and can be applied to other highly inflected Slavic languages. The large number of unique words in inflected languages is identified as the primary reason for performance degradation. This article discusses the concept of word-formation in the Slovenian language, which is also common to all Slavic languages. The main problems are outlined for word-based language models. A novel variation on the N-gram modelling theme is examined where, instead of using words, modelling units are chosen to be stems and endings. Only data-driven algorithms are employed, which decompose words automatically. A significant reduction in the OOV rate results when using stems and endings for modelling the Slovenian language. The final part of this article focuses on building a speech recogniser. Two different decoding strategies have been employed: one-pass and two-pass search decoders. Language modelling experiments have been performed using the VECER newswire text corpus, and recognition experiments have been conducted using the SNABI Slovenian speech database. The new language model resulted in the reduction of the OOV rate by 64%, and the recognition accuracy was improved by 4.34%.
Journal Article•10.1023/A:1025752731945•
A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perception Based Cost Functions

[...]

Minkyu Lee1, Daniel P. Lopresti1, Joseph P. Olive1•
Alcatel-Lucent1
01 Oct 2003-International Journal of Speech Technology
TL;DR: This paper describes a new corpus-based Bell Labs' TTS system that encompasses large acoustic inventories with a rich set of annotations, models and data structures for representing and managing such inventories, and proposes a new method for setting weights in the cost functions based on a perceptual preference test.
Abstract: In concatenative Text-to-Speech, the size of the speech corpus is closely related to synthetic speech quality. In this paper, we describe our work on a new corpus-based Bell Labs' TTS system. This encompasses large acoustic inventories with a rich set of annotations, models and data structures for representing and managing such inventories, and an optimal unit selection algorithm that accommodates a broad range of possible cost criteria. We also propose a new method for setting weights in the cost functions based on a perceptual preference test. Our results show that this approach can successfully predict human preference patterns. Synthetic speech using weights determined in this manner consistently demonstrates smoother transitions and higher voice quality than speech using manually set weights.
Journal Article•10.1023/A:1021099922328•
Evaluating the Quality of an Integrated Model of German Prosody

[...]

Hansjörg Mixdorff1, Hansjörg Mixdorff2, Oliver Jokisch2•
Humboldt University of Berlin1, Dresden University of Technology2
01 Jan 2003-International Journal of Speech Technology
TL;DR: The results indicate that the integrated model generally receives better ratings than degraded stimuli with comparable durational and F0 deviations from the original, and an important outcome is the observation that the accuracy of the predicted syllable durations appears to be a stronger factor with respect to the perceived quality than the accuracyof the predicted F0 contour.
Abstract: The perceived quality of synthetic speech strongly depends on its prosodic naturalness Departing from earlier works by Mixdorff on a linguistically motivated model of German intonation based on the Fujisaki model, an integrated approach to predicting F0 along with syllable duration and energy was developed The current paper first presents some statistical results concerning the relationship between linguistic and phonetic information underlying an utterance and its prosodic features These results were employed for training the MFN-based integrated prosodic model predicting syllable duration and energy along with syllable-aligned Fujisaki control parameters The paper then focusses on the method of perceptual evaluation developed, comparing resynthesis stimuli created by controlled prosodic degrading of natural speech with stimuli created using the integrated model The results indicate that the integrated model generally receives better ratings than degraded stimuli with comparable durational and F0 deviations from the original An important outcome is the observation that the accuracy of the predicted syllable durations appears to be a stronger factor with respect to the perceived quality than the accuracy of the predicted F0 contour
Journal Article•10.1023/A:1022338631326•
A Speech-Based Human-Computer Interaction System for Automating Directory Assistance Services

[...]

Kallirroi Georgila1, Kyriakos N. Sgarbas1, Anastasios Tsopanoglou, Nikos Fakotakis1, George Kokkinakis1 •
University of Patras1
01 Apr 2003-International Journal of Speech Technology
TL;DR: A spoken dialogue system for automating DAS with the use of Directed Acyclic Word Graphs and context-dependent phonological rules resulted in search space reduction and therefore in faster response, and also in improved accuracy.
Abstract: The automation of Directory Assistance Services (DAS) through speech is one of the most difficult and demanding applications of human-computer interaction because it deals with very large vocabulary recognition issues. In this paper, we present a spoken dialogue system for automating DAS.1 Taking into account the major difficulties of this endeavor a stepwise approach was adopted. In particular, two prototypes D1.1 (basic approach) and D1.2 (improved version) were developed successively. The results of D1.1 evaluation were used to refine D1.1 and gradually led to D1.2 that was also improved using a feedback approach. Furthermore, the system was extended and optimized so that it can be utilized in real-world conditions. We describe the general architecture and the three stages of the system's development in detail. Evaluation results concerning both the speech recognizer's accuracy and the overall system's performance are provided for all prototypes. Finally, we focus on techniques that handle large vocabulary recognition issues. The use of Directed Acyclic Word Graphs (DAWGs) and context-dependent phonological rules resulted in search space reduction and therefore in faster response, and also in improved accuracy.
Journal Article•10.1023/A:1021043804581•
The Role of Duration Models and Symbolic Representation for Timing in Synthetic Speech

[...]

Caren Brinckmann, Jürgen Trouvain
01 Jan 2003-International Journal of Speech Technology
TL;DR: It is important to derive an appropriate phonological symbolic representation in order to improve timing in synthetic speech before fine-tuning the duration prediction.
Abstract: In order to determine priorities for the improvement of timing in synthetic speech this study looks at the role of segmental duration prediction and the role of phonological symbolic representation in the perceptual quality of a text-to-speech system. In perception experiments using German speech synthesis, two standard duration models (Klatt rules and CART) were tested. The input to these models consisted of a symbolic representation which was either derived from a database or a text-to-speech system. Results of the perception experiments show that different duration models can only be distinguished when the symbolic representation is appropriate. Considering the relative importance of the symbolic representation, post-lexical segmental rules were investigated with the outcome that listeners differ in their preferences regarding the degree of segmental reduction. As a conclusion, before fine-tuning the duration prediction, it is important to derive an appropriate phonological symbolic representation in order to improve timing in synthetic speech.
Journal Article•10.1023/A:1025704800086•
Optimal Utterance Selection for Unit Selection Speech Synthesis Databases

[...]

Alan W. Black1, Kevin A. Lenzo1•
Carnegie Mellon University1
01 Oct 2003-International Journal of Speech Technology
TL;DR: This paper describes techniques to find an optimal data set for building high quality unit-selection speech synthesis inventories and a more complex acoustic modeling technique based on the database speaker's acoustic characteristics.
Abstract: This paper describes techniques to find an optimal data set for building high quality unit-selection speech synthesis inventories As the quality of unit-selection speech synthesis is dependent on the coverage of the database used in the selection, it is important to select the right data to record In this paper we describe some simple techniques as well as a more complex acoustic modeling technique based on the database speaker's acoustic characteristics Result of a simple evaluation procedure are presented justifying the technique
Journal Article•10.1023/A:1021060308216•
Prosodic Phrasing: Machine and Human Evaluation

[...]

Céu Viana1, Luis Oliveira2, Ana Isabel Mata1•
University of Lisbon1, Instituto Superior Técnico2
01 Jan 2003-International Journal of Speech Technology
TL;DR: In this paper, the authors describe a set of experiments aiming at the construction and evaluation of a new phrasing module for European Portuguese text-to-speech synthesis, using classification and regression trees learned from hand-labeled texts.
Abstract: This paper describes a set of experiments aiming at the construction and evaluation of a new phrasing module for European Portuguese text-to-speech synthesis, using classification and regression trees learned from hand-labelled texts. Using the assessment criteria of matching boundary predictions against the corresponding labelled ones, the best solution achieves an overall performance of 91.9%, with 86.3% of correctly assigned breaks and 4.3% of false insertions. Although in absolute terms such scores may be considered surprisingly good given the size of the training set, the total number of exact matches at the sentence level is much lower (22%). This suggested a more formal experiment to test the acceptability of the predicted phrasing in the judgement of human evaluators. As the model was not trained on a labelled speech corpus but on hand-labelled texts, the reference phrasing needed also to be assessed. The evaluation experiment involved 90 participants who were asked to grade both the automatic and the reference phrasings, and also to express their opinion on where the breaks should be placed. As expected, the results showed a large variability among the subjects in their acceptance of a specific sentence partition, and criteria had to be defined to summarise the data from the different evaluators. With the adopted criteria, the performance of the automatic assignment procedure at the sentence level is better rated by human evaluators than by simple matching with the reference corpus (78% vs. 22%, respectively).
Journal Article•10.1023/A:1025717202812•
Prosody Evaluation as a Diagnostic Process: Subjective vs. Objective Measurements

[...]

Albert Rilliard1, Véronique Aubergé1•
Stendhal University1
01 Jan 2003-International Journal of Speech Technology
TL;DR: This study points out the ability of listeners to retrieve pertinent information on the basis of pure prosodic stimuli and underlines which acoustic cues are used by listeners to judge the adequacy of prosody in performing a given function such as demarcation.
Abstract: A set of perception experiments, using reiterant and lexicalised speech, was designed to perform a diagnostic of the relative implication of prosody in the segmentation and hierarchisation of speech. Both natural and synthetic intonation were evaluated. Then, two distance measures—correlation and root-mean-square distance on the acoustic parameters (F0, syllabic duration and intensity)—were applied to match the perception results. This objective vs. subjective comparison underlines which acoustic cues are used by listeners to judge the adequacy of prosody in performing a given function such as demarcation. Results can be summarized by a scale of the perceptual distance between two demarcation functions. This study also points out the ability of listeners to retrieve pertinent information on the basis of pure prosodic stimuli.
Journal Article•10.1023/A:1023470304749•
SPEAKER (GOVOREC): A Complete Slovenian Text-to Speech System

[...]

Tomaz Sef1, Matjaz Gams1•
Jožef Stefan Institute1
01 Jul 2003-International Journal of Speech Technology
TL;DR: The SPEAKER system, capable of automatic conversion of any Slovenian text into speech, was developed at the “Jožef Stefan” Institute and was awarded with the first prize for innovation in the field of life improvements for people with disabilities.
Abstract: While text-to-speech (TTS) systems for major world languages are quite advanced, smaller languages, like our Slovenian language, lack quality TTS synthesis. At the “Jožef Stefan” Institute a system called SPEAKER (GOVOREC) has been developed. It is capable of automatic conversion of any Slovenian text into speech. The different phases of the synthesis task are performed by several sequentially operating independent modules: text analysis, prosody generation and segmental concatenation. The first module is comprised of text normalization and grapheme-to-phoneme conversion tasks. In order to generate rules for our synthesis scheme, data were collected by analysing the readings of ten speakers, five males and five females. A two-level approach has been used for duration modeling, and a so-called superpositional approach for pitch modeling. A speech waveform is synthesized using unit selection-based methods and a concatenative TD-PSOLA or HNM+ technique. The system was first implemented in the EMA employment agent, which provides information about available jobs in Slovenia and is now used by members of the Slovenian Foundation for the Blind and Vision-Impaired. Then, it was given free of charge to all people with disabilities. The system was awarded with the first prize for innovation in the field of life improvements for people with disabilities (given by the Government Office for the Disabled and Chronically Sick of the Republic of Slovenia). SPEAKER is freely accessible for non-commercial purposes through the Internet. Currently, several leading Slovenian telecommunication companies are testing the system for providing information (e-mail, short messaging service—SMS, weather reports, traffic information) through mobile phones.
Journal Article•10.1023/A:1022334530417•
The MSIIA Experiment: Using Speech to Enhance Human Performance on a Cognitive Task

[...]

Laurie Damianos1, Dan Loehr1, Carl Burke1, Steve Hansen1, Michael Viszmeg1 •
Mitre Corporation1
01 Apr 2003-International Journal of Speech Technology
TL;DR: The results show that the availability of speech may lead to improved performance of expert domain users on more complicated tasks, and Quantitative results suggest that people could potentially identify images faster with speech.
Abstract: We performed an exploratory study to examine the effects of speech-enabled input on a cognitive task involving analysis and annotation of objects in aerial reconnaissance videos. We added speech to an information fusion system to allow for hands-free annotation in order to examine the effect on efficiency, quality, task success, and user satisfaction. We hypothesized that speech recognition could be a cognitive-enabling technology by reducing the mental load of instrument manipulation and freeing up resources for the task at hand.
Journal Article•10.1023/A:1021056124145•
A Metrical Model of Prosody for Multilingual TTS

[...]

Alex I. C. Monaghan
01 Jan 2003-International Journal of Speech Technology
TL;DR: The model of prosody used in the Aculab TTS system is entirely knowledge-based: there are no stochastic components in the model, and is specifically designed for multilingual use: it currently handles several Germanic and Romance languages.
Abstract: The model of prosody used in the Aculab TTS system is unusual in several respects. Firstly, it is based firmly on current metrical theories of prosody. Secondly, it is entirely knowledge-based: there are no stochastic components in the model. Thirdly, it makes use of a quasi-random element to avoid the predictability of conventional synthetic prosody. Fourthly, it is specifically designed for multilingual use: it currently handles several Germanic and Romance languages.
Journal Article•10.1023/A:1023422421588•
Labeling of Symbolic Prosody Breaks for the Slovenian Language

[...]

Janez Stergar1, Vladimir Hozjan1, Bogomir Horvat1•
University of Maribor1
01 Jul 2003-International Journal of Speech Technology
TL;DR: A new interactive tool for word level prosody labeling (major/minor breaks) is presented together with a new semi-automatic approach for determining prosody breaks, achieving a high consistency in labeling and reducing the time needed for hand labeling.
Abstract: This paper presents the data-driven prediction of word level prosody breaks modelling for the Slovenian language. Automatic learning techniques depend on the construction of a large corpus labeled appropriately. This labeling can be done either automatically, or by hand. While automatic labeling can be less accurate than hand labeling, the latter is very time consuming and, in some cases, inconsistent. Therefore, a new interactive tool for word level prosody labeling (major/minor breaks) is presented together with a new semi-automatic approach for determining prosody breaks. This interactive tool combines the advantages of hand labeling and automatic labeling by achieving a high consistency in labeling and reducing the time needed for hand labeling. The labeled Slovenian corpus has been used to train our phrase break prediction module, implementing a neural network (NN) structure. Experiments for the data-driven prediction of major = minor and major/minor phrase breaks were performed. The prediction accuracy achieved marks state-of-the-art word level prosody breaks prediction for the Slovenian language and is comparable to the prediction accuracy of other approaches in which more complex NN structures (Muller et al., 2000) or other prediction methods (Black and Tailor, 1997) were applied, and a much larger corpus was used for training. The overall prediction accuracy achieved is 94% for major = minor breaks and over 98/92% for major/minor phrase breaks, respectively.
Journal Article•10.1023/A:1025765101903•
Modeling of Vocal Styles Using Portable Features and Placement Rules

[...]

Chilin Shih1, Greg Kochanski1•
Alcatel-Lucent1
01 Oct 2003-International Journal of Speech Technology
TL;DR: This paper uses an articulator-based model, Stem-ML, to resolve conflicts between intended accents or embellishments and their environment, and presents several examples to illustrate the modeling of accents and phrase curves, as well as the usefulness of style/content separation, and the similarity between speech and music.
Abstract: This paper presents a mathematical description of style in speech and singing These styles are represented as a set of portable prosodic features along with a set of rules to choose where the features are to be applied Speakers and singers make creative choices to express their personal style, which may involve specific phrase curve, accent shape, or, similarly, musical embellishment Therefore a quantitative model of style needs to support unconstrained accent and phrase curve description, and to solve potential conflicts that arise from this freedom Our current implementation modifies two acoustic parameters: f0 and amplitude We use an articulator-based model, Stem-ML, to resolve conflicts between intended accents or embellishments and their environment We present several examples to illustrate the modeling of accents and phrase curves, as well as the usefulness of style/content separation, and the similarity between speech and music
Journal Article•10.1023/A:1023414119770•
“LentInfo” Information—Providing System for the Festival Lent Programme

[...]

Andrej Žgank1, Matej Rojc1•
University of Maribor1
01 Jul 2003-International Journal of Speech Technology
TL;DR: This paper presents an application, “LentInfo”, which is a system used to provide information about programmes for the Festival Lent in Slovenia, based on a Hidden Markov Model speech recogniser, and the dialogue construction and management is done using the CSDP (Common Spoken Dialogue Platform) dialogue management system.
Abstract: This paper presents an application, “LentInfo”, which is a system used to provide information about programmes for the Festival Lent in Slovenia. The Festival Lent consists of different open-air theatre and music performances and raws more than 400,000 visitors per year. This application is based on a Hidden Markov Model (HMM) speech recogniser, and the dialogue construction and management is done using the CSDP (Common Spoken Dialogue Platform) dialogue management system. It is represented as a finite-state structure. The dialogue can be specified in a script using simple syntax description. The dialogue manager is multi-application oriented, so it can easily be upgraded for new applications. If some new concepts are needed, only new actions need be added to the existing ones. Currently, prompt messages are prerecorded, but it is also possible to include a speech synthesis system depending on the needs of the application. Error recovery during the dialogue is done with user confirmation of the recognised input speech. The results are presented for tests performed in the year 2001. The results are analyzed according to the phone type (fixed/mobile), signal to noise ratio, dialogue path, etc. Although some calls where carried out using mobile phones from noisy festival places, the performance of the system decreased only slightly under these conditions.
Journal Article•10.1023/A:1023418220679•
Efficient Development of Lexical Language Resources and their Representation

[...]

Matej Rojc1, Zdravko Kacic1•
University of Maribor1
01 Jul 2003-International Journal of Speech Technology
TL;DR: A system architecture for the rapid construction of morphologic and phonetic lexicons, two of the most important written language resources for the development of ASR (automatic speech recognition) and TTS (text-to-speech) systems, is presented.
Abstract: Statistical approaches in speech technology, whether used for statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g., text corpora, pronunciation and morphology lexicons, and speech databases. This paper presents a system architecture for the rapid construction of morphologic and phonetic lexicons, two of the most important written language resources for the development of ASR (automatic speech recognition) and TTS (text-to-speech) systems. The presented architecture is modular and is particularly suitable for the development of written language resources for inflectional languages. In this paper an implementation is presented for the Slovenian language. The integrated graphic user interface focuses on the morphological and phonetic aspects of language and allows experts to produce good performances during analysis. In multilingual TTS systems, many extensive external written language resources are used, especially in the text processing part. It is very important, therefore, that representation of these resources is time and space efficient. It is also very important that language resources for new languages can be easily incorporated into the system, without modifying the common algorithms developed for multiple languages. In this regard the use of large external language resources (e.g., morphology and phonetic lexicons) represent an important problem because of the required space and slow look-up time. This paper presents a method and its results for compiling large lexicons, using examples for compiling German phonetic and morphology lexicons (CISLEX), and Slovenian phonetic (SIflex) and morphology (SImlex) lexicons, into corresponding finite-state transducers (FSTs). The German lexicons consisted of about 300,000 words, SIflex consisted of about 60,000 and SImlex of about 600,000 words (where 40,000 words were used for representation using finite-state transducers). Representation of large lexicons using finite-state transducers is mainly motivated by considerations of space and time efficiency. A great reduction in size and optimal access time was achieved for all lexicons. The starting size for the German phonetic lexicon was 12.53 MB and 18.49 MB for the morphology lexicon. The starting size for the Slovenian phonetic lexicon was 1.8 MB and 1.4 MB for the morphology lexicon. The final size of the corresponding FSTs was 2.78 MB for the German phonetic lexicon, 6.33 MB for the German morphology lexicon, 253 KB for SIflex and 662 KB for the SImlex lexicon. The achieved look-up time is optimal, since it only depends on the length of the input word and not on the size of the lexicon. Integration of lexicons for new languages into the multilingual TTS system is easy when using such representations and does not require any changes in the algorithms used for such lexicons.
Journal Article•10.1023/A:1025708916924•
The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching

[...]

Marc Schröder, Jürgen Trouvain1•
Saarland University1
01 Oct 2003-International Journal of Speech Technology
TL;DR: The usefulness of the modular and transparent design approach is illustrated with an early prototype of an interface for emotional speech synthesis and examples of how this interface can be put to use in research, development and teaching.
Abstract: This paper introduces the German text-to-speech synthesis system MARY. The system's main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the user to access and modify intermediate processing steps without the need for a technical understanding of the system is described, along with examples of how this interface can be put to use in research, development and teaching. The usefulness of the modular and transparent design approach is further illustrated with an early prototype of an interface for emotional speech synthesis.
Journal Article•10.1023/A:1022342732234•
Speech Error Correction: The Story of the Alternates List

[...]

Kevin Larson1, David Mowatt1•
Microsoft1
31 Mar 2003-International Journal of Speech Technology
TL;DR: This work examined the use of four different error correction mechanisms and found the two error correction methods that users were most successful with were redictation and selection of a list of alternatives (“the alternates list”).
Abstract: Error correction with speech recognition products is extraordinarily difficult for users. Users spend much more time correcting errors than they spend dictating new text. In order to find ways to improve users' error correction experience, we examined the use of four different error correction mechanisms. The two error correction methods that users were most successful with were redictation and selection of a list of alternatives (“the alternates list”). Users rated the latter as the more satisfying method. User satisfaction with the alternates list was surprising as it was not a terribly accurate error correction method. On the Tablet PC we made several interface enhancements to facilitate the use of the alternates which included the use of (1) strong modes, (2) a push-to-talk model for microphone control, (3) a lighter weight alternates list which was easier to open and dismiss. Users performed transcription tasks with this new interface and we examined which error correction methods people preferred. Users of the new interface no longer compounded error upon error and were far more likely to use the alternates list than was the case for users of pre-existing interfaces. Users were very likely to switch modes from the alternates list to redictation when the alternates list did not contain the target word.
Journal Article•10.1023/A:1025700715107•
Audiovisual Speech Synthesis

[...]

Gérard Bailly1, Maxime Berar1, Frédéric Elisei1, Matthias Odisio1•
Centre national de la recherche scientifique1
01 Oct 2003-International Journal of Speech Technology
TL;DR: This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches, and an attempt is made to distinguish between facial synthesis itself and the manner in which facial movements are rendered on a computer screen.
Abstract: This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.
Journal Article•10.1023/A:1022390615396•
Expanding the MOS: Development and Psychometric Evaluation of the MOS-R and MOS-X

[...]

Melanie D. Polkosky1, James R. Lewis1•
IBM1
01 Apr 2003-International Journal of Speech Technology
TL;DR: This paper documents the motivation, method, and results of six experiments conducted from 1999 to 2002 that investigated the psychometric properties of the MOS and expanded the range of speech characteristics it evaluates, resulting in the Mos-Revised (MOS-R).
Abstract: The Mean Opinion Scale (MOS) is a questionnaire used to obtain listeners' subjective assessments of synthetic speech. This paper documents the motivation, method, and results of six experiments conducted from 1999 to 2002 that investigated the psychometric properties of the MOS and expanded the range of speech characteristics it evaluates. Our initial experiments documented the reliability, validity, sensitivity, and factor structure of the P.L. Salza et al. (Acta Acustica, Vol. 82, pp. 650–656, 1996) MOS and used psychometric principles to revise and improve the scale. This work resulted in the MOS-Revised (MOS-R). Four subsequent experiments expanded the MOS-R beyond its previous focus on Intelligibility and Naturalness, to include measurement of the Prosody and Social Impression of synthetic voices. As a result of this work, we created the MOS-Expanded (MOS-X), a rating scale shown to be reliable, valid, and sensitive for high-quality evaluation of synthetic speech in applied industrial settings.
Journal Article•10.1023/A:1023410018862•
Efficient Noise Robust Feature Extraction Algorithms for Distributed Speech Recognition (DSR) Systems

[...]

Bojan Kotnik1, Damjan Vlaj1, Bogomir Horvat1•
University of Maribor1
01 Jul 2003-International Journal of Speech Technology
TL;DR: Two innovative front-end processing techniques for noise robust speech recognition are presented and compared and include different forms of frame-attenuation, improvement of spectral subtraction based on minimum statistics, as well as a mel-cepstrum feature extraction procedure.
Abstract: The evolution of robust speech recognition systems that maintain a high level of recognition accuracy in difficult and dynamically-varying acoustical environments is becoming increasingly important as speech recognition technology becomes a more integral part of mobile applications. In distributed speech recognition (DSR) architecture the recogniser's front-end is located in the terminal and is connected over a data network to a remote back-end recognition server. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data channel to the remote back-end recogniser. DSR provides particular benefits for the applications of mobile devices such as improved recognition performance compared to using the voice channel and ubiquitous access from different networks with a guaranteed level of recognition performance. A feature extraction algorithm integrated into the DSR system is required to operate in real-time as well as with the lowest possible computational costs.

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve