Proceedings Article10.1109/SLT.2008.4777890
Efficient data selection for machine translation
Arindam Mandal,Dimitra Vergyri,Wen Wang,Jing Zheng,Andreas Stolcke,Gokhan Tur,Dilek Hakkani-Tur,Necip Fazil Ayan +7 more
- 01 Dec 2008
- pp 261-264
TL;DR: This paper introduces two methods for efficient selection of training data to be translated by humans and shows that one-fifth of the additional training data can achieve similar or better translation performance, compared to that of using all available data.
read more
Abstract: Performance of statistical machine translation (SMT) systems relies on the availability of a large parallel corpus which is used to estimate translation probabilities. However, the generation of such corpus is a long and expensive process. In this paper, we introduce two methods for efficient selection of training data to be translated by humans. Our methods are motivated by active learning and aim to choose new data that adds maximal information to the currently available data pool. The first method uses a measure of disagreement between multiple SMT systems, whereas the second uses a perplexity criterion. We performed experiments on Chinese-English data in multiple domains and test sets. Our results show that we can select only one-fifth of the additional training data and achieve similar or better translation performance, compared to that of using all available data.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Discriminative Corpus Weight Estimation for Machine Translation
Spyros Matsoukas,Antti-Veikko Rosti,Bing Zhang +2 more
- 06 Aug 2009
TL;DR: A novel approach for automatically detecting and down-weighing certain parts of the training corpus by assigning a weight to each sentence in the training bitext so as to optimize a discriminative objective function on a designated tuning set is described.
•Proceedings Article
Instance Selection for Machine Translation using Feature Decay Algorithms
Ergun Bicici,Deniz Yuret +1 more
- 30 Jul 2011
TL;DR: It is shown that the feature decay rate has a very strong effect on the final translation quality whereas the initial feature values, inclusion of higher order features, or sentence length normalizations do not.
Optimizing instance selection for statistical machine translation with feature decay algorithms
Ergun Bicici,Deniz Yuret +1 more
TL;DR: FDA5 is able to reduce the time to build a statistical machine translation system to about half with 1M words using only 3% of the space for the phrase table and 8% ofThe overall space when compared with a baseline system using all of the training data available yet still obtain only 0.58 BLEU points difference with the baseline system in out-of-domain translation.
Survey of data-selection methods in statistical machine translation
TL;DR: A comparative overview of research in statistical machine translation is provided based on application scenario, feature functions and search method.
The Regression Model of Machine Translation
Mehmet Ergun Biçici
- 01 Jan 2012
TL;DR: The results demonstrate that sparse regression models are better than L2 regularized regression for statistical machine translation in predicting target features, estimating word alignments, creating phrase tables, and generating translation outputs.
29
References
Minimum Error Rate Training in Statistical Machine Translation
Franz Josef Och
- 07 Jul 2003
TL;DR: It is shown that significantly better results can often be obtained if the final evaluation criterion is taken directly into account as part of the training procedure.
3.4K
•Proceedings Article
A Study of Translation Edit Rate with Targeted Human Annotation
Matthew Snover,Bonnie J. Dorr,Richard Schwartz,Linnea Micciulla,John Makhoul +4 more
- 08 Aug 2006
TL;DR: A new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments is defined.
A statistical approach to machine translation
Peter Fitzhugh Brown,John Cocke,Stephen A. Della Pietra,Vincent J. Della Pietra,F. Jelinek,John Lafferty,Robert Leroy Mercer,Paul S. Roossin +7 more
TL;DR: The application of the statistical approach to translation from French to English and preliminary results are described and the results are given.
Query by committee
H. S. Seung,Manfred Opper,Haim Sompolinsky +2 more
- 01 Jul 1992
TL;DR: It is suggested that asymptotically finite information gain may be an important characteristic of good query algorithms, in which a committee of students is trained on the same data set.
2K
Combining active and semi-supervised learning for spoken language understanding
TL;DR: This paper combined active and semi-supervised learning to reduce the amount of manual labeling when training a spoken language understanding model classifier with human-labeled utterance data, which reduced the number of manual labels.
235