(Meta-) Evaluation of Machine Translation

doi:10.3115/1626355.1626373

Open AccessProceedings Article10.3115/1626355.1626373

(Meta-) Evaluation of Machine Translation

Chris Callison-Burch, +4 more

- 23 Jun 2007

- pp 136-158

502

TL;DR: An extensive human evaluation was carried out not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process, revealing surprising facts about the most commonly used methodologies.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.18653/V1/W15-3056

Improving evaluation and optimization of MT systems against MEANT

Chi-kiu Lo, +2 more

- 01 Sep 2015

TL;DR: It is shown that, consistent with MEANTtuned systems that translate into Chinese, MEANT-tuned MT systems that translates into English also outperforms BLEUtuned systems across commonly used MT evaluation metrics, even in BLEU.

...read moreread less

•Journal Article•10.1162/tacl_a_00561

Less is More: Mitigate Spurious Correlations for Open-Domain Dialogue Response Generation Models by Causal Discovery

Tao Feng, +2 more

- 02 Mar 2023

- Transactions of the Association for Comp...

TL;DR: In this article , a model-agnostic method for training and inference using a conditional independence classifier is proposed, which is trained by a constrained self-training method, coined ConSTrain, to overcome data sparsity.

...read moreread less

Book Chapter•10.1007/978-3-319-30298-0_69

Automatic Metrics for Machine Translation Evaluation and Minority Languages

Daša Munková, +1 more

- 01 Jan 2016

TL;DR: The results of the reliability analysis showed that these automatic metrics for MT evaluation are reliable and valid, whereby the validity and reliability were verified for one translation direction: from the minority language into English.

...read moreread less

Journal Article•10.1017/S1351324919000469

How to evaluate machine translation: A review of automated and human metrics

Eirini Chatzikoumi

- 01 Mar 2020

- Natural Language Engineering

TL;DR: The most up-to-date, influential automated, semiautomated and human metrics used to evaluate the quality of machine translation (MT) output are presented and provides the necessary background for MT evaluation projects.

...read moreread less

Proceedings Article

The Authenticity Gap in Human Evaluation

Kawin Ethayarajh, +1 more

- 24 May 2022

TL;DR: This paper proposed a new human evaluation protocol called system-level probabilistic assessment (SPA), which can recover the ordering of GPT-3 models by size, with statistically significant results.

...read moreread less

...

Expand

References

Journal Article•10.2307/2529310

The measurement of observer agreement for categorical data

J. R. Landis, +1 more

- 01 Mar 1977

- Biometrics

TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.

...read moreread less

76.1K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

•Proceedings Article

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee, +1 more

- 01 Jun 2005

TL;DR: METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.

...read moreread less

5.9K

•Journal Article•10.1162/089120103321337421

A systematic comparison of various statistical alignment models

Franz Josef Och, +1 more

- 01 Mar 2003

- Computational Linguistics

TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.

...read moreread less

4.6K

•Proceedings Article•10.3115/1073445.1073462

Statistical phrase-based translation

Philipp Koehn, +2 more

- 27 May 2003

TL;DR: The empirical results suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translation.

...read moreread less

4.1K