TL;DR: The proposed use of character n-gram F-score for automatic evaluation of machine translation output shows very promising results, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.
Abstract: We propose the use of character n-gram F-score for automatic evaluation of machine translation output. Character ngrams have already been used as a part of more complex metrics, but their individual potential has not been investigated yet. We report system-level correlations with human rankings for 6-gram F1-score (CHRF) on the WMT12, WMT13 and WMT14 data as well as segment-level correlation for 6gram F1 (CHRF) and F3-scores (CHRF3) on WMT14 data for all available target languages. The results are very promising, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.
TL;DR: The WMT15 shared task as discussed by the authors included a standard news translation task, a metrics task, tuning task, and a task for run-time estimation of machine translation quality, and an automatic post-editing task.
Abstract: This paper presents the results of the WMT15 shared tasks, which included a standard news translation task, a metrics task, a tuning task, a task for run-time estimation of machine translation quality, and an automatic post-editing task. This year, 68 machine translation systems from 24 institutions were submitted to the ten translation directions in the standard translation task. An additional 7 anonymized systems were included, and were then evaluated both automatically and manually. The quality estimation task had three subtasks, with a total of 10 teams, submitting 34 entries. The pilot automatic postediting task had a total of 4 teams, submitting 7 entries.
TL;DR: This paper presents the results of the WMT15 Metrics Shared Task, which asked participants of this task to score the outputs of the MT systems involved in the W MT15 Shared Translation Task to evaluate system level correlation and segment level correlation.
Abstract: This paper presents the results of the WMT15 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT15 Shared Translation Task. We collected scores of 46 metrics from 11 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system level correlation (how well each metric’s scores correlate with WMT15 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence).
TL;DR: This paper describes the “FBK EnglishSpanish Automatic Post-editing (APE)” systems submitted to the APE shared task at the WMT 2015 and introduces some novel task-specific dense features through which improvements over the default setup of these approaches are observed.
Abstract: In this paper, we describe the “FBK EnglishSpanish Automatic Post-editing (APE)” systems submitted to the APE shared task at the WMT 2015. We explore the most widely used statistical APE technique (monolingual) and its most significant variant (context-aware). In this exploration, we introduce some novel task-specific dense features through which we observe improvements over the default setup of these approaches. We show these features are useful to prune the phrase table in order to remove unreliable rules and help the decoder to select useful translation options during decoding. Our primary APE system submitted at this shared task performs significantly better than the standard APE baseline.
TL;DR: The main changes introduced this year are: extending the learning-to-rank trained sentence level metric to the corpus level, incorporating syntactic ingredients based on dependency trees, and a technique for finding parameters of BEER that avoid “gaming of the metric” during tuning.
Abstract: We describe the submissions of ILLC UvA to the metrics and tuning tasks on WMT15. Both submissions are based on the BEER evaluation metric originally presented on WMT14 (Stanojevic and Sima’an, 2014a). The main changes introduced this year are: (i) extending the learning-to-rank trained sentence level metric to the corpus level (but still decomposable to sentence level), (ii) incorporating syntactic ingredients based on dependency trees, and (iii) a technique for finding parameters of BEER that avoid “gaming of the metric” during tuning.
TL;DR: The authors' systems outperform the baseline as well as many other submissions for Tasks 1 and 2 of the WMT15 Shared Task on Quality Estimation and the best performing system (SHEF-W2V) only uses features learned in an unsupervised fashion.
Abstract: We describe our systems for Tasks 1 and 2 of the WMT15 Shared Task on Quality Estimation. Our submissions use (i) a continuous space language model to extract additional features for Task 1 (SHEFGP, SHEF-SVM), (ii) a continuous bagof-words model to produce word embeddings as features for Task 2 (SHEF-W2V) and (iii) a combination of features produced by QuEst++ and a feature produced with word embedding models (SHEFQuEst++). Our systems outperform the baseline as well as many other submissions. The results are especially encouraging for Task 2, where our best performing system (SHEF-W2V) only uses features learned in an unsupervised fashion.
TL;DR: This paper set up phrase-based statistical machine translation systems for all ten language pairs of this year’s evaluation campaign, which are English paired with Czech, Finnish, French, German, and Russian in both translation directions.
Abstract: This paper describes the submission of the University of Edinburgh and the Johns Hopkins University for the shared translation task of the EMNLP 2015 Tenth Workshop on Statistical Machine Translation (WMT 2015). We set up phrase-based statistical machine translation systems for all ten language pairs of this year’s evaluation campaign, which are English paired with Czech, Finnish, French, German, and Russian in both translation directions. Novel research directions we investigated include: neural network language models and bilingual neural network language models, a comprehensive use of word classes, and sparse lexicalized reordering features.
TL;DR: This paper takes a closer look at the MT evaluation process from a glass-box perspective using eye-tracking and suggests that to have consistent and cost effective MT evaluations, it is better to use monolinguals with only target language information.
Abstract: In this paper, we take a closer look at the MT evaluation process from a glass-box perspective using eye-tracking. We analyze two aspects of the evaluation task ‐ the background of evaluators (monolingual or bilingual) and the sources of information available, and we evaluate them using time and consistency as criteria. Our findings show that monolinguals are slower but more consistent than bilinguals, especially when only target language information is available. When exposed to various sources of information, evaluators in general take more time and in the case of monolinguals, there is a drop in consistency. Our findings suggest that to have consistent and cost effective MT evaluations, it is better to use monolinguals with only target language information.
TL;DR: A metric based on dense vector spaces and Long Short Term Memory networks, which are types of Recurrent Neural Networks (RNNs), is submitted in the WMT-15 metrics task and is the best performing metric overall according to Spearman and Pearson and second best according to Pearson (TrueSkill) system level correlation.
Abstract: This paper presents our metric (UoWLSTM) submitted in the WMT-15 metrics task. Many state-of-the-art Machine Translation (MT) evaluation metrics are complex, involve extensive external resources (e.g. for paraphrasing) and require tuning to achieve the best results. We use a metric based on dense vector spaces and Long Short Term Memory (LSTM) networks, which are types of Recurrent Neural Networks (RNNs). For WMT15 our new metric is the best performing metric overall according to Spearman and Pearson (Pre-TrueSkill) and second best according to Pearson (TrueSkill) system level correlation.
TL;DR: The USAAR-SAPE English‐ Spanish Automatic Post-Editing (APE) system submitted to the APE Task organized in the Workshop on Statistical Machine Translation (WMT) in 2015 was able to improve upon the baseline MT system output by incorporating Phrase-Based Statistical MT (PBSMT) technique into the monolingual Statistical APE task (SAPE).
Abstract: We describe the USAAR-SAPE English‐ Spanish Automatic Post-Editing (APE) system submitted to the APE Task organized in the Workshop on Statistical Machine Translation (WMT) in 2015. Our system was able to improve upon the baseline MT system output by incorporating Phrase-Based Statistical MT (PBSMT) technique into the monolingual Statistical APE task (SAPE). The reported final submission crucially involves hybrid word alignment. The SAPE system takes raw Spanish Machine Translation (MT) output provided by the shared task organizers and produces post-edited Spanish text. The parallel data consist of English Text, raw machine translated Spanish output, and their corresponding manually post-edited versions. The major goal of the task is to reduce the post-editing effort by improving the quality of the MT output in terms of fluency and adequacy.
TL;DR: The results on WMT data sets show that fuzzy n-gram matching improves correlations to human evaluation especially for highly compounding languages.
Abstract: This paper describes the LeBLEU evaluation score for machine translation, submitted to WMT15 Metrics Shared Task. LeBLEU extends the popular BLEU score to consider fuzzy matches between word n-grams. While there are several variants of BLEU that allow to non-exact matches between words either by character-based distance measures or morphological preprocessing, none of them use fuzzy comparison between longer chunks of text. The results on WMT data sets show that fuzzy n-gram matching improves correlations to human evaluation especially for highly compounding languages.
TL;DR: The approach presented here is learning a Bayesian Ridge Regressor using document skip-gram embeddings in order to automatically evaluate Machine Translation (MT) output by predicting semantic adequacy scores.
Abstract: This paper describes USAAR’s submission to the the metrics shared task of the Workshop on Statistical Machine Translation (WMT) in 2015. The goal of our submission is to take advantage of the semantic overlap between hypothesis and reference translation for predicting MT output adequacy using language independent document embeddings. The approach presented here is learning a Bayesian Ridge Regressor using document skip-gram embeddings in order to automatically evaluate Machine Translation (MT) output by predicting semantic adequacy scores. The evaluation of our submission ‐ measured by the correlation with human judgements ‐ shows promising results on system-level scores.
TL;DR: The Universitat d’Alacant submissions for the machine translation quality estimation (MTQE) shared task in WMT 2015 is described, where they participated in the wordlevel MTQE sub-task.
Abstract: This paper describes the Universitat d’Alacant submissions (labelled as UAlacant) for the machine translation quality estimation (MTQE) shared task in WMT 2015, where we participated in the wordlevel MTQE sub-task. The method we used to produce our submissions uses external sources of bilingual information as a black box to spot sub-segment correspondences between a source segmentS and the translation hypothesisT produced by a machine translation system. This is done by segmenting bothS andT into overlapping subsegments of variable length and translating them in both translation directions, using the available sources of bilingual information on the fly. For our submissions, two sources of bilingual information were used: machine translation (Apertium and Google Translate) and the bilingual concordancer Reverso Context. After obtaining the subsegment correspondences, a collection of features is extracted from them, which are then used by a binary classifer to obtain the final “GOOD” or “BAD” word-level quality labels. We prepared two submissions for this year’s edition of WMT 2015: one using the features produced by our system, and one combining them with the baseline features published by the organisers of the task, which were ranked third and first for the sub-task, respectively.
TL;DR: ParFDA is a parallel implementation of feature decay algorithms (FDA) developed for fast deploy and results close to the top with an average of 3.176 BLEU points difference using significantly less resources for building SMT systems.
Abstract: We build parallel FDA5 (ParFDA) Moses statistical machine translation (SMT) systems for all language pairs in the workshop on statistical machine translation (Bojar et al., 2015) (WMT15) translation task and obtain results close to the top with an average of 3.176 BLEU points difference using significantly less resources for building SMT systems. ParFDA is a parallel implementation of feature decay algorithms (FDA) developed for fast deploy
TL;DR: This paper describes the UdS-Sant English‐German Hybrid Machine Translation system submitted to the Translation Task organized in the Workshop on Statistical Machine Translation (WMT) 2015 and brings improvements over the baseline system by incorporating additional knowledge such as extracted bilingual named entities and bilingual phrase pairs induced from example-based methods.
Abstract: This paper describes the UdS-Sant English‐German Hybrid Machine Translation (MT) system submitted to the Translation Task organized in the Workshop on Statistical Machine Translation (WMT) 2015. Our proposed hybrid system brings improvements over the baseline system by incorporating additional knowledge such as extracted bilingual named entities and bilingual phrase pairs induced from example-based methods. The reported final submission is the result of a hybrid system obtained from confusion network based system combination that combines the best performance of each individual system in a multi-engine pipeline.
TL;DR: This paper describes the submission of the UGENT-LT3 SCATE system to the WMT15 Shared Task on Quality Estimation (QE), viz.
Abstract: This paper describes the submission of the UGENT-LT3 SCATE system to the WMT15 Shared Task on Quality Estimation (QE), viz. English-Spanish word and sentence-level QE. We conceived QE as a supervised Machine Learning (ML) problem and designed additional features and combined these with the baseline feature set to estimate quality. The sentence-level QE system re-uses the word level predictions of the word-level QE system. We experimented with different learning methods and observe improvements over the baseline system for wordlevel QE with the use of the new features and by combining learning methods into ensembles. For sentence-level QE we show that using a single feature based on word-level predictions can perform better than the baseline system and using this in combination with additional features led to further improvements in performance.
TL;DR: This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish‐English language pair at the WMT 2015 translation task, which are the top performing English-to-Finnish unconstrained (all automatic metrics) and constrained (BLEU), and Finnish- to-English constrained (TER) systems.
Abstract: This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish‐English language pair at the WMT 2015 translation task. We tackle the lack of resources and complex morphology of the Finnish language by (i) crawling parallel and monolingual data from the Web and (ii) applying rule-based and unsupervised methods for morphological segmentation. Several statistical machine translation approaches are evaluated and then combined to obtain our final submissions, which are the top performing English-to-Finnish unconstrained (all automatic metrics) and constrained (BLEU), and Finnish-to-English constrained (TER) systems.
TL;DR: This work presents a method that improves data selection by combining a hybrid word/part-of-speech representation for corpora, with the idea of distinguishing between rare and frequent events.
Abstract: We present a method that improves data selection by combining a hybrid word/part-of-speech representation for corpora, with the idea of distinguishing between rare and frequent events. We validate our approach using data selection for machine translation, and show that it maintains or improves BLEU and TER translation scores while substantially improving vocabulary coverage and reducing data selection model size. Paradoxically, the coverage improvement is achieved by abstracting away over 97% of the total training corpus vocabulary using simple part-of-speech tags during the data selection process.
TL;DR: It is shown that referential translation machines pioneer a language independent approach to all similarity tasks and remove the need to access any task or domain specific information or resource.
Abstract: We use referential translation machines (RTMs) for predicting translation performance. RTMs pioneer a language independent approach to all similarity tasks and remove the need to access any task or domain specific information or resource. We improve our RTM models with the
TL;DR: It is shown, by carefully analyzing the failure of the two systems submitted by LIMSI to the WMT’15 Shared Task on Automatic Post-Editing, that this counterperformance mainly results from the inconsistency in the annotations.
Abstract: This paper describes the two systems submitted by LIMSI to the WMT’15 Shared Task on Automatic Post-Editing. The first one relies on a reformulation of the APE task as a Machine Translation task; the second implements a simple rule-based approach. Neither of these two systems manage to improve the automatic translation. We show, by carefully analyzing the failure of our systems that this counterperformance mainly results from the inconsistency in the annotations.
TL;DR: This paper describes the statistical machine translation system developed at RWTH Aachen University for the German!English translation task of the EMNLP 2015 Tenth Workshop on Statistical Machine Translation (WMT 2015).
Abstract: This paper describes the statistical machine translation system developed at RWTH Aachen University for the German!English translation task of the EMNLP 2015 Tenth Workshop on Statistical Machine Translation (WMT 2015). A phrase-based machine translation system was applied and augmented with hierarchical phrase reordering and word class language models. Further, we ran discriminative maximum expected BLEU training for our system. In addition, we utilized multiple feed-forward neural network language and translation models and a recurrent neural network language model for reranking.
TL;DR: LIMSI’s submissions to the shared WMT’15 translation task are described, including a tailored normalization of Russian to translate into English, and a two-step process to translate first into simplified Russian, followed by a conversion into inflected Russian.
Abstract: This paper describes LIMSI’s submissions to the shared WMT’15 translation task. We report results for French-English, Russian-English in both directions, as well as for Finnish-into-English. Our submissions use NCODE and MOSES along with continuous space translation models in a post-processing step. The main novelties of this year’s participation are the following: for Russian-English, we investigate a tailored normalization of Russian to translate into English, and a two-step process to translate first into simplified Russian, followed by a conversion into inflected Russian. For French-English, the challenge is domain adaptation, for which only monolingual corpora are available. Finally, for the Finnish-to-English task, we explore unsupervised morphological segmentation to reduce the sparsity of data induced by the rich morphology on the Finnish side.
TL;DR: This paper proposes to increase the size of the training corpus by using the post-edited and reference corpora during the training step and performs a linear regression of the feature space against scores in the range [0..1].
Abstract: We describe our system for WMT2015 Shared Task on Quality Estimation, task 1, sentence-level prediction of post-edition effort. We use baseline features, Latent Semantic Indexing based features and features based on pseudo-references. SVM algorithm allows to estimate the linear regression between the features vectors and the HTER score. We use a selection algorithm in order to put aside needless features. Our best system leads to a performance in terms of Mean Absolute Error equal to 13.34 on official test while the official baseline system leads to a performance equal to 14.82.
TL;DR: It is shown that context-sensitive synonym selection increases the correlation of the Meteor metric with human judgments of translation quality on the WMT14 data.
Abstract: This paper describes Meteor-WSD and RATATOUILLE, the LIMSI submissions to the WMT15 metrics shared task. MeteorWSD extends synonym mapping to languages other than English based on alignments and gives credit to semantically adequate translations in context. We show that context-sensitive synonym selection increases the correlation of the Meteor metric with human judgments of translation quality on the WMT14 data. RATATOUILLE combines MeteorWSD with nine other metrics for evaluation and outperforms the best metric (BEER) involved in its computation.
TL;DR: This work presents a new technique to train the log-linear model based on the ListNet algorithm that scales to many features, considers the whole list and not single entries during learning and can also be applied to more complex models than a log- linear combination.
Abstract: The log-linear combination of different features is an important component of SMT systems. It allows for the easy integartion of models into the system and is used during decoding as well as for nbest list rescoring. With the recent success of more complex models like neural network-based translation models, n-best list rescoring attracts again more attention. In this work, we present a new technique to train the log-linear model based on the ListNet algorithm. This technique scales to many features, considers the whole list and not single entries during learning and can also be applied to more complex models than a log-linear combination. Using the new learning approach, we improve the translation quality of a largescale system by 0.8 BLEU points during rescoring and generate translations which are up to 0.3 BLEU points better than other learning techniques such as MERT or MIRA.
TL;DR: B baseline systems for Finnish-English and English-Finnish machine translation using standard phrasebased and factored models including morphological features are described and the effectiveness of morphological pre-processing of Finnish is demonstrated.
Abstract: This paper describes baseline systems for Finnish-English and English-Finnish machine translation using standard phrasebased and factored models including morphological features. We experiment with compound splitting and morphological segmentation and study the effect of adding noisy out-of-domain data to the parallel and the monolingual training data. Our results stress the importance of training data and demonstrate the effectiveness of morphological pre-processing of Finnish.
TL;DR: This paper presents the results of the WMT15 Tuning Shared Task, which provided the participants of this task with a complete machine translation system and asked them to tune its internal parameters (feature weights).
Abstract: This paper presents the results of the WMT15 Tuning Shared Task. We provided the participants of this task with a complete machine translation system and asked them to tune its internal parameters (feature weights). The tuned systems were used to translate the test set and the outputs were manually ranked for translation quality. We received 4 submissions in the English-Czech and 6 in the Czech-English translation direction. In addition, we ran 3 baseline setups, tuning the parameters with standard optimizers for BLEU score.
TL;DR: DFKI participated in the shared translation task of WMT 2015 with the GermanEnglish language pair in each translation direction using an experimental hybrid system based on three systems: a statistical Moses system, a commercial rule-based system, and a serial coupling of the two.
Abstract: DFKI participated in the shared translation task of WMT 2015 with the GermanEnglish language pair in each translation direction. The submissions were generated using an experimental hybrid system based on three systems: a statistical Moses system, a commercial rule-based system, and a serial coupling of the two where the output of the rule-based system is further translated by Moses trained on parallel text consisting of the rule-based output and the original target language. The outputs of three systems are combined using two methods: (a) an empirical selection mechanism based on grammatical features (primary submission) and (b) IBM1 models based on POS 4-grams (contrastive submission).
TL;DR: In this paper, the authors enhance the traditional confusion network system combination approach with an additional model trained by a neural network, which gives system combination the option to prefer other systems at different word positions even for the same sentence.
Abstract: In this paper, we enhance the traditional confusion network system combination approach with an additional model trained by a neural network. This work is motivated by the fact that the commonly used binary system voting models only assign each input system a global weight which is responsible for the global impact of each input system on all translations. This prevents individual systems with low system weights from having influence on the system combination output, although in some situations this could be helpful. Further, words which have only been seen by one or few systems rarely have a chance of being present in the combined output. We train a local system voting model by a neural network which is based on the words themselves and the combinatorial occurrences of the different system outputs. This gives system combination the option to prefer other systems at different word positions even for the same sentence.