(Meta-) Evaluation of Machine Translation

doi:10.3115/1626355.1626373

Open AccessProceedings Article10.3115/1626355.1626373

(Meta-) Evaluation of Machine Translation

Chris Callison-Burch, +4 more

- 23 Jun 2007

- pp 136-158

502

TL;DR: An extensive human evaluation was carried out not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process, revealing surprising facts about the most commonly used methodologies.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

10.48550/arxiv.2006.04781

What's the Difference Between Professional Human and Machine Translation? A Blind Multi-language Study on Domain-specific MT.

Lukas Fischer, +1 more

TL;DR: A blind multi-language study compares professional human translation (HT) and machine translation (MT) errors, finding similar error rates in HT and MT, with MT requiring post-editing effort in only two out of three language pairs.

...read moreread less

Proceedings Article•10.18653/V1/2021.NAACL-MAIN.279

Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models.

Elizabeth Clark, +1 more

- 01 Jun 2021

TL;DR: This work presents Choose Your Own Adventure, a collaborative writing setup for pairwise model evaluation, where two models generate suggestions to people as they write a short story; writers are asked to choose one of the two suggestions, and they observe which model’s suggestions they prefer.

...read moreread less

•Posted Content

A Review of Human Evaluation for Style Transfer

Eleftheria Briakou, +4 more

- 09 Jun 2021

- arXiv: Computation and Language

TL;DR: The authors reviewed and summarized human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency, and found that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.

...read moreread less

Journal Article•10.48550/arxiv.2409.09598

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Brian Thompson, +3 more

- 15 Sep 2024

- arXiv.org

TL;DR: Researchers propose Soft Pairwise Accuracy (SPA), a meta-metric that incorporates statistical significance to compare human and automatic metric judgments, improving stability and statistical significance over existing methods, and was selected as the official metric for the 2024 WMT metric shared task.

...read moreread less

•Proceedings Article•10.3115/V1/E14-1047

Is Machine Translation Getting Better over Time

Yvette Graham, +3 more

- 01 Apr 2014

TL;DR: A large-scale crowd-sourcing experiment is carried out to estimate the degree to which state-of-theart performance in machine translation has increased over the past five years, with Czech-to-English translation standing out as the language pair achieving most substantial gains.

...read moreread less

...

Expand

References

Journal Article•10.2307/2529310

The measurement of observer agreement for categorical data

J. R. Landis, +1 more

- 01 Mar 1977

- Biometrics

TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.

...read moreread less

76.1K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

•Proceedings Article

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee, +1 more

- 01 Jun 2005

TL;DR: METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.

...read moreread less

5.9K

•Journal Article•10.1162/089120103321337421

A systematic comparison of various statistical alignment models

Franz Josef Och, +1 more

- 01 Mar 2003

- Computational Linguistics

TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.

...read moreread less

4.6K

•Proceedings Article•10.3115/1073445.1073462

Statistical phrase-based translation

Philipp Koehn, +2 more

- 27 May 2003

TL;DR: The empirical results suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translation.

...read moreread less

4.1K