(Meta-) Evaluation of Machine Translation
Chris Callison-Burch,Cameron Shaw Fordyce,Philipp Koehn,Christof Monz,Josh Schroeder +4 more
- 23 Jun 2007
- pp 136-158
TL;DR: An extensive human evaluation was carried out not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process, revealing surprising facts about the most commonly used methodologies.
read more
Abstract: This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measured timing and intra- and inter-annotator agreement for three types of subjective evaluation. We measured the correlation of automatic evaluation metrics with human judgments. This meta-evaluation reveals surprising facts about the most commonly used methodologies.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
What's the Difference Between Professional Human and Machine Translation? A Blind Multi-language Study on Domain-specific MT.
Lukas Fischer,Samuel Läubli +1 more
TL;DR: A blind multi-language study compares professional human translation (HT) and machine translation (MT) errors, finding similar error rates in HT and MT, with MT requiring post-editing effort in only two out of three language pairs.
Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models.
Elizabeth Clark,Noah A. Smith +1 more
- 01 Jun 2021
TL;DR: This work presents Choose Your Own Adventure, a collaborative writing setup for pairwise model evaluation, where two models generate suggestions to people as they write a short story; writers are asked to choose one of the two suggestions, and they observe which model’s suggestions they prefer.
•Posted Content
A Review of Human Evaluation for Style Transfer
TL;DR: The authors reviewed and summarized human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency, and found that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.
Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
Brian Thompson,Nitika Mathur,Daniel Deutsch,Huda Khayrallah +3 more
TL;DR: Researchers propose Soft Pairwise Accuracy (SPA), a meta-metric that incorporates statistical significance to compare human and automatic metric judgments, improving stability and statistical significance over existing methods, and was selected as the official metric for the 2024 WMT metric shared task.
Is Machine Translation Getting Better over Time
Yvette Graham,Timothy Baldwin,Alistair Moffat,Justin Zobel +3 more
- 01 Apr 2014
TL;DR: A large-scale crowd-sourcing experiment is carried out to estimate the degree to which state-of-theart performance in machine translation has increased over the past five years, with Czech-to-English translation standing out as the language pair achieving most substantial gains.
References
The measurement of observer agreement for categorical data
J. R. Landis,Gary G. Koch +1 more
TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.
76.1K
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
•Proceedings Article
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee,Alon Lavie +1 more
- 01 Jun 2005
TL;DR: METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.
A systematic comparison of various statistical alignment models
Franz Josef Och,Hermann Ney +1 more
TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.
Statistical phrase-based translation
Philipp Koehn,Franz Josef Och,Daniel Marcu +2 more
- 27 May 2003
TL;DR: The empirical results suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translation.