TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.
Abstract: We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
TL;DR: The empirical results suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translation.
Abstract: We propose a new phrase-based translation model and decoding algorithm that enables us to evaluate and compare several, previously proposed phrase-based translation models. Within our framework, we carry out a large number of experiments to understand better and explain why phrase-based models out-perform word-based models. Our empirical results, which hold for all examined language pairs, suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translations. Surprisingly, learning phrases longer than three words and learning phrases from high-accuracy word-level alignment models does not have a strong impact on performance. Learning only syntactically motivated phrases degrades the performance of our systems.
TL;DR: A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.
Abstract: We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT). We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.
TL;DR: This paper proposes a novel method for phrase-based statistical machine translation based on the use of a pivot language, using BLEU as a metric, that significantly outperforms the standard model trained on a small bilingual corpus.
Abstract: This paper proposes a novel method for phrase-based statistical machine translation based on the use of a pivot language. To translate between languages L s and L t with limited bilingual resources, we bring in a third language, L p , called the pivot language. For the language pairs L s ? L p and L p ? L t , there exist large bilingual corpora. Using only L s ? L p and L p ? L t bilingual corpora, we can build a translation model for L s ? L t . The advantage of this method lies in the fact that we can perform translation between L s and L t even if there is no bilingual corpus available for this language pair. Using BLEU as a metric, our pivot language approach significantly outperforms the standard model trained on a small bilingual corpus. Moreover, with a small L s ? L t bilingual corpus available, our method can further improve translation quality by using the additional L s ? L p and L p ? L t bilingual corpora.
TL;DR: The phrase translation strategy significantly outperformed the sentence translation strategy and its relative performance was 0.92 to 0.97 compared to directly trained SMT systems.
Abstract: We compare two pivot strategies for phrase-based statistical machine translation (SMT), namely phrase translation and sentence translation. The phrase translation strategy means that we directly construct a phrase translation table (phrase-table) of the source and target language pair from two phrase-tables; one constructed from the source language and English and one constructed from English and the target language. We then use that phrase-table in a phrase-based SMT system. The sentence translation strategy means that we first translate a source language sentence into n English sentences and then translate these n sentences into target language sentences separately. Then, we select the highest scoring sentence from these target sentences. We conducted controlled experiments using the Europarl corpus to evaluate the performance of these pivot strategies as compared to directly trained SMT systems. The phrase translation strategy significantly outperformed the sentence translation strategy. Its relative performance was 0.92 to 0.97 compared to directly trained SMT systems.