TL;DR: Meteor Universal brings language specific evaluation to previously unsupported target languages by automatically extracting linguistic resources from the bitext used to train MT systems and using a universal parameter set learned from pooling human judgments of translation quality from several language directions.
Abstract: This paper describes Meteor Universal, released for the 2014 ACL Workshop on Statistical Machine Translation. Meteor Universal brings language specific evaluation to previously unsupported target languages by (1) automatically extracting linguistic resources (paraphrase tables and function word lists) from the bitext used to train MT systems and (2) using a universal parameter set learned from pooling human judgments of translation quality from several language directions. Meteor Universal is shown to significantly outperform baseline BLEU on two new languages, Russian (WMT13) and Hindi (WMT14).
TL;DR: The results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translationtask, a task for run-time estimation of machine translation quality, and a metrics task, are presented.
Abstract: This paper presents the results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translation task, a task for run-time estimation of machine translation quality, and a metrics task. This year, 143 machine translation systems from 23 institutions were submitted to the ten translation directions in the standard translation task. An additional 6 anonymized systems were included, and were then evaluated both automatically and manually. The quality estimation task had four subtasks, with a total of 10 teams, submitting 57 entries
TL;DR: Three of them are first proposed in this paper and they correlate better with human judgments on the sentence-level than other smoothing techniques, and the performance of using them in statistical machine transla- tion tuning is compared.
Abstract: BLEU is the de facto standard machine translation (MT) evaluation metric. How- ever, because BLEU computes a geo- metric mean of n-gram precisions, it of- ten correlates poorly with human judg- ment on the sentence-level. There- fore, several smoothing techniques have been proposed. This paper systemati- cally compares 7 smoothing techniques for sentence-level BLEU. Three of them are first proposed in this paper, and they correlate better with human judgments on the sentence-level than other smoothing techniques. Moreover, we also compare the performance of using the 7 smoothing techniques in statistical machine transla- tion tuning.
TL;DR: The joint submission of Fondazione Bruno Kessler, Universitat Politde Val` encia and University of Edinburgh to the Quality Estimation tasks of the Workshop on Statistical Machine Translation 2014 is described.
Abstract: This paper describes the joint submission of Fondazione Bruno Kessler, Universitat Politde Val` encia and University of Edinburgh to the Quality Estimation tasks of the Workshop on Statistical Machine Translation 2014. We present our submis- sions for Task 1.2, 1.3 and 2. Our systems ranked first for Task 1.2 and for the Binary and Level1 settings in Task 2.
TL;DR: A new version of Phrasal, an open-source toolkit for statistical phrasebased machine translation, is presented, which includes features that support emerging research trends such as tuning with large feature sets, and web-based interactive machine translation.
Abstract: We present a new version of Phrasal, an open-source toolkit for statistical phrasebased machine translation. This revision includes features that support emerging research trends such as (a) tuning with large feature sets, (b) tuning on large datasets like thebitext, and(c)web-basedinteractivemachine translation. A direct comparison with Moses shows favorable results in terms of decoding speed and tuning time.
TL;DR: UEDIN’s phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT) are described.
Abstract: This paper describes the University of Edinburgh’s (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) using unsupervised character-based models to translate unknown words in RussianEnglish and Hindi-English pairs, iii) synthesizing Hindi data from closely-related Urdu data, and iv) building huge language on the common crawl corpus.
TL;DR: Novel automatic metrics for machine translation evaluation that use discourse structure and convolution kernels to compare the discourse tree of an automatic translation with that of the human reference are presented.
Abstract: We present novel automatic metrics for machine translation evaluation that use discourse structure and convolution kernels to compare the discourse tree of an automatic translation with that of the human reference. We experiment with five transformations and augmentations of a base discourse tree representation based on the rhetorical structure theory, and we combine the kernel scores for each of them into a single score. Finally, we add other metrics from the ASIYA MT evaluation toolkit, and we tune the weights of the combination on actual human judgments. Experiments on the WMT12 and WMT13 metrics shared task datasets show correlation with human judgments that outperforms what the best systems that participated in these years achieved, both at the segment and at the system level.
TL;DR: This paper describes the string-to-tree systems built at the University of Edinburgh for the WMT 2014 shared translation task and improved the English-German system through target-side compound splitting, morphosyntactic constraints, and refinements to parse tree annotation.
Abstract: This paper describes the string-to-tree systems built at the University of Edinburgh for the WMT 2014 shared translation task. We developed systems for English-German, Czech-English, FrenchEnglish, German-English, Hindi-English, and Russian-English. This year we improved our English-German system through target-side compound splitting, morphosyntactic constraints, and refinements to parse tree annotation; we addressed the out-of-vocabulary problem using transliteration for Hindi and Russian and using morphological reduction for Russian; we improved our GermanEnglish system through tree binarization; and we reduced system development time by filtering the tuning sets.
TL;DR: It is shown that the use of number, case and Tree Adjoining Grammar information as factors helps to improve English-Hindi translation, primarily by generating morphological inflections correctly.
Abstract: In this paper, we describe our EnglishHindi and Hindi-English statistical systems submitted to the WMT14 shared task. The core components of our translation systems are phrase based (Hindi-English) and factored (English-Hindi) SMT systems. We show that the use of number, case and Tree Adjoining Grammar information as factors helps to improve English-Hindi translation, primarily by generating morphological inflections correctly. We show improvements to the translation systems using pre-procesing and post-processing components. To overcome the structural divergence between English and Hindi, we preorder the source side sentence to conform to the target language word order. Since parallel corpus is limited, many words are not translated. We translate out-of-vocabulary words and transliterate named entities in a post-processing stage. We also investigate ranking of translations from multiple systems to select the best translation.
TL;DR: The Word-level QE system for WMT 2014 shared task on Spanish-English pair is described, optimized by several ways: tuning the classification threshold, combining with WMT 2013 data, and refining using Feature Selection strategy on the development set, before dealing with the test set for submission.
Abstract: This paper describes our Word-level QE system for WMT 2014 shared task on Spanish-English pair. Compared to WMT 2013, this year's task is different due to the lack of SMT setting information and additional resources. We report how we overcome this challenge to retain most of the important features which performed well last year in our system. Novel features related to the availability of multiple systems output (new point of this year) are also proposed and experimented along with baseline set. The system is optimized by several ways: tuning the classification threshold, combining with WMT 2013 data, and refining using Feature Selection strategy on our development set, before dealing with the test set for submission.
TL;DR: Three research institutes involved in the EU-BRIDGE project combined their individual machine translation systems and participated with a joint setup in the shared translation task of the evaluation campaign at the ACL 2014 Eighth Workshop on Statistical Machine Translation.
Abstract: This paper describes one of the collaborative efforts within EU-BRIDGE to further advance the state of the art in machine translation between two European language pairs, German→English and English→German. Three research institutes involved in the EU-BRIDGE project combined their individual machine translation systems and participated with a joint setup in the shared translation task of the evaluation campaign at the ACL 2014 Eighth Workshop on Statistical Machine Translation (WMT 2014). We combined up to nine different machine translation engines via system combination. RWTH Aachen University, the University of Edinburgh, and Karlsruhe Institute of Technology developed several individual systems which serve as system combination input. We devoted special attention to building syntax-based systems and combining them with the phrasebased ones. The joint setups yield empirical gains of up to 1.6 points in BLEU and 1.0 points in TER on the WMT newstest2013 test set compared to the best single systems.
TL;DR: The quality of the crowdsourced corpus is significantly better than existing automatic methods: it obtains an performance comparable to expert annotations when used in MERT tuning of a microblog MT system; and training a parallel sentence classifier with it leads also to improved results.
Abstract: High-quality parallel data is crucial for a range of multilingual applications, from tuning and evaluating machine translation systems to cross-lingual annotation projection. Unfortunately, automatically obtained parallel data (which is available in relative abundance) tends to be quite noisy. To obtain high-quality parallel data, we introduce a crowdsourcing paradigm in which workers with only basic bilingual proficiency identify translations from an automatically extracted corpus of parallel microblog messages. For less than $350, we obtained over 5000 parallel segments in five language pairs. Evaluated against expert annotations, the quality of the crowdsourced corpus is significantly better than existing automatic methods: it obtains an performance comparable to expert annotations when used in MERT tuning of a microblog MT system; and training a parallel sentence classifier with it leads also to improved results. The crowdsourced corpora will be made available in http://www.cs.cmu.edu/ ~lingwang/microtopia/.
TL;DR: The two UPC submissions to the WMT14 Metrics Shared Task take advantage of novel metrics that consider linguistic structures, lexical relationships, and semantics to compare both source and reference translation against the candidate translation.
Abstract: This paper describes the UPC submissions to the WMT14 Metrics Shared Task: UPCIPA and UPC-STOUT. These metrics use a collection of evaluation measures integrated in ASIYA, a toolkit for machine translation evaluation. In addition to some standard metrics, the two submissions take advantage of novel metrics that consider linguistic structures, lexical relationships, and semantics to compare both source and reference translation against the candidate translation. The new metrics are available for several target languages other than English. In the the official WMT14 evaluation, UPC-IPA and UPC-STOUT scored above the average in 7 out of 9 language pairs at the system level and 8 out of 9 at the segment level.
TL;DR: The Manawi system showed the potential of improving translation quality by incorporating multiple NLP tools within the MT pipeline by introducing a novel filter method based on sentence-alignment features.
Abstract: We describe the Manawi 1 (mAnEv) system submitted to the 2014 WMT translation shared task. We participated in the English-Hindi (EN-HI) and Hindi-English (HI-EN) language pair and achieved 0.792 for the Translation Error Rate (TER) score 2 for EN-HI, the lowest among the competing systems. Our main innovations are (i) the usage of outputs from NLP tools, viz. billingual multi-word expression extractor and named-entity recognizer to improve SMT quality and (ii) the introduction of a novel filter method based on sentence-alignment features. The Manawi system showed the potential of improving translation quality by incorporating multiple NLP tools within the MT pipeline.
TL;DR: Ref referential translation machines remove the need to access any SMT system specific information or prior knowledge of the training data or models used when generating the translations and achieve the top performance in WMT13 quality estimation task (QET13).
Abstract: We use referential translation machines (RTM) for quality estimation of translation outputs. RTMs are a computational model for identifying the translation acts between any two data sets with respect to interpretants selected in the same domain, which are effective when making monolingual and bilingual similarity judgments. RTMs achieve top performance in automatic, accurate, and language independent prediction of sentence-level and word-level statistical machine translation (SMT) quality. RTMs remove the need to access any SMT system specific information or prior knowledge of the training data or models used when generating the translations and achieve the top performance in WMT13 quality estimation task (QET13). We improve our RTM models with the Parallel FDA5 instance selection model, with additional features for predicting the translation performance, and with improved learning models. We develop RTM models for each WMT14 QET (QET14) subtask, obtain improvements over QET13 results, and rank 1st in all of the tasks and subtasks of QET14.
TL;DR: Extended features are introduced, which are more specific than dense features yet more general than lexicalized sparse features, and yield robust BLEU gains for both Arabic-English and Chinese-English relative to a strong feature-rich baseline.
Abstract: Scalable discriminative training methods are now broadly available for estimating phrase-based, feature-rich translation models. However, the sparse feature sets typically appearing in research evaluations are less attractive than standard dense features such as language and translation model probabilities: they often overfit, do not generalize, or require complex and slow feature extractors. This paper introduces extended features, which are more specific than dense features yet more general than lexicalized sparse features. Large-scale experiments show that extended features yield robust BLEU gains for both Arabic-English (+1.05) and Chinese-English (+0.67) relative to a strong feature-rich baseline. We also specialize the feature set to specific datadomains, identifyanobjectivefunction that is less prone to overfitting, and release fast, scalable, and language-independent tools for implementing the features.
TL;DR: It is shown that combining information from both local and global test contexts helps to improve lexical selection and outperforms a baseline system by up to 1.15 BLEU.
Abstract: Despite its potential to improve lexical selection, most state-of-the-art machine translation systems take only minimal contextual information into account. We capture context with a topic model over distributional profiles built from the context words of each translation unit. Topic distributions are inferred for each translation unit and used to adapt the translation model dynamically to a given test context by measuring their similarity. We show that combining information from both local and global test contexts helps to improve lexical selection and outperforms a baseline system by up to 1.15 BLEU. We test our topic-adapted model on a diverse data set containing documents from three different domains and achieve competitive performance in comparison with two supervised domain-adapted systems.
TL;DR: Inventions include: a label coarsening scheme for syntactic tree-to-tree translation, a host of new discriminative features, several modules to create “synthetic translation options” that can generalize beyond what is directly observed in the training data, and a method of combining the output of multiple word aligners to uncover extra phrase pairs and grammar rules.
Abstract: We describe the CMU systems submitted to the 2014 WMT shared translation task. We participated in two language pairs, German–English and Hindi–English. Our innovations include: a label coarsening scheme for syntactic tree-to-tree translation, a host of new discriminative features, several modules to create “synthetic translation options” that can generalize beyond what is directly observed in the training data, and a method of combining the output of multiple word aligners to uncover extra phrase pairs and grammar rules.
TL;DR: LIMSI participation to the WMT’14 Shared Task on Quality Estimation; the system relies on a random forest classifier, an ensemble method that has been shown to be very competitive for this kind of task, when only a few dense and continuous features are used.
Abstract: This paper describes LIMSI participation to the WMT’14 Shared Task on Quality Estimation; we took part to the wordlevel quality estimation task for English to Spanish translations. Our system relies on a random forest classifier, an ensemble method that has been shown to be very competitive for this kind of task, when only a few dense and continuous features are used. Notably, only 16 features are used in our experiments. These features describe, on the one hand, the quality of the association between the source sentence and each target word and, on the other hand, the fluency of the hypothesis. Since the evaluation criterion is the f1 measure, a specific tuning strategy is proposed to select the optimal values for the hyper-parameters. Overall, our system achieves a 0.67 f1 score on a randomly extracted test set.
TL;DR: These submissions use the framework of Multi-task Gaussian Processes, where they combine multiple datasets in a multi-task setting to speed up training and prediction by providing sensible sparse approximations.
Abstract: We describe our systems for the WMT14 Shared Task on Quality Estimation (subtasks 1.1, 1.2 and 1.3). Our submissions use the framework of Multi-task Gaussian Processes, where we combine multiple datasets in a multi-task setting. Due to the large size of our datasets we also experiment with Sparse Gaussian Processes, which aim to speed up training and prediction by providing sensible sparse approximations.
TL;DR: This paper describes the statistical machine translation (SMT) systems developed at RWTH Aachen University for the German!English translation task of the ACL 2014 Eighth Workshop on Statistical Machine Translation (WMT 2014).
Abstract: This paper describes the statistical machine translation (SMT) systems developed at RWTH Aachen University for the German!English translation task of the ACL 2014 Eighth Workshop on Statistical Machine Translation (WMT 2014). Both hierarchical and phrase-based SMT systems are applied employing hierarchical phrase reordering and word class language models. For the phrase-based system, we run discriminative phrase training. In addition, we describe our preprocessing pipeline for German!English.
TL;DR: An effective technique to easily augment GHKM-style syntax-based machine translation systems (Galley et al., 2006) with phrase pairs that do not comply with any syntactic well-formedness constraints is presented.
Abstract: We present an effective technique to easily augment GHKM-style syntax-based machine translation systems (Galley et al., 2006) with phrase pairs that do not comply with any syntactic well-formedness constraints. Non-syntactic phrase pairs are distinguished from syntactic ones in order to avoid harming effects. We apply our technique in state-of-the-art string-totree and tree-to-string setups. For tree-tostring translation, we furthermore investigate novel approaches for translating with source-syntax GHKM rules in association with input tree constraints and input tree features.
TL;DR: VERTa, a linguistically-motivated metric that combines linguistic features at different levels, is presented and the linguistic motivation on which the metric is based is provided, as well as the different modules in VERTa and how they are combined.
Abstract: In this paper we present VERTa, a linguistically-motivated metric that combines linguistic features at different levels. We provide the linguistic motivation on which the metric is based, as well as describe the different modules in VERTa and how they are combined. Finally, we describe the two versions of VERTa, VERTa-EQ and VERTa-W, sent to WMT14 and report results obtained in the experiments conducted with the WMT12 and WMT13 data into English.
TL;DR: The AFRL statistical MT system and the improvements that were developed during the WMT14 evaluation campaign are described and the efforts to make use of monolingual English speakers to correct the output of machine translation are described.
Abstract: This paper describes the AFRL statistical MT system and the improvements that were developed during the WMT14 evaluation campaign. As part of these efforts we experimented with a number of extensions to the standard phrase-based model that improve performance on Russian to English and Hindi to English translation tasks. In addition, we describe our efforts to make use of monolingual English speakers to correct the output of machine translation, and present the results of monolingual postediting of the entire 3003 sentences of the WMT14 Russian-English test set.
TL;DR: The English!Czech and English!Hindi submissions for this year’s WMT translation task are presented and reverse self-training to acquire more parallel data and with modeling target-side morphology are experimented with.
Abstract: We present our English!Czech and English!Hindi submissions for this year’s WMT translation task. For English!Czech, we build upon last year’s CHIMERA and evaluate several setups. English!Hindi is a new language pair for this year. We experimented with reverse self-training to acquire more (synthetic) parallel data and with modeling target-side morphology.
TL;DR: A number of simple and effective techniques to adapt statistical machine translation systems in the medical domain and these systems achieve the best BLEU scores for Czech-English, EnglishGerman, French-English language pairs and the second best Blemish scores for reminding pairs are explored.
Abstract: This paper explores a number of simple and effective techniques to adapt statistical machine translation (SMT) systems in the medical domain. Comparative experiments are conducted on large corpora for six language pairs. We not only compare each adapted system with the baseline, but also combine them to further improve the domain-specific systems. Finally, we attend the WMT2014 medical summary sentence translation constrained task and our systems achieve the best BLEU scores for Czech-English, EnglishGerman, French-English language pairs and the second best BLEU scores for reminding pairs.
TL;DR: LIMSI’s submission to the first medical translation task at WMT’14 is described and results for EnglishFrench on the subtask of sentence translation from summaries of medical articles are reported.
Abstract: This paper describes LIMSI’s submission to the first medical translation task at WMT’14. We report results for EnglishFrench on the subtask of sentence translation from summaries of medical articles. Our main submission uses a combination of NCODE (n-gram-based) and MOSES (phrase-based) output and continuous-space language models used in a post-processing step for each system. Other characteristics of our submission include: the use of sampling for building MOSES’ phrase table; the implementation of the vector space model proposed by Chen et al. (2013); adaptation of the POStagger used by NCODE to the medical domain; and a report of error analysis based on the typology of Vilar et al. (2006).
TL;DR: This paper describes the DCU submission to WMT 2014 on German-English translation task, which uses phrasebased translation model with several popular techniques, including Lexicalized Reordering Model, Operation Sequence Model and Language Model interpolation.
Abstract: This paper describes the DCU submission to WMT 2014 on German-English translation task. Our system uses phrasebased translation model with several popular techniques, including Lexicalized Reordering Model, Operation Sequence Model and Language Model interpolation. Our final submission is the result of system combination on several systems which have different pre-processing and alignments.
TL;DR: This paper describes the Yandex School of Data Analysis Russian-English system and proposes a {simple yet practical} algorithm to transform Russian sentence into a more easily translatable form before decoding.
Abstract: This paper describes the Yandex School of Data Analysis Russian-English system submitted to the ACL 2014 Ninth Workshop on Statistical Machine Translation shared translation task. We start with the system that we developed last year and investigate a few methods that were successful at the previous translation task including unpruned language model, operation sequence model and the new reparameterization of IBM Model 2. Next we propose a {simple yet practical} algorithm to transform Russian sentence into a more easily translatable form before decoding. The algorithm is based on the linguistic intuition of native Russian speakers, also fluent in English.