Evaluation of Text Generation: A Survey

Open AccessPosted Content

Evaluation of Text Generation: A Survey

- 26 Jun 2020

371

TL;DR: This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.18653/v1/2022.humeval-1.7

Toward More Effective Human Evaluation for Machine Translation

01 Jan 2022

TL;DR: This paper investigated a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set using a sampling approach, demonstrating that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline.

...read moreread less

5

•Posted Content

Meta Adaptive Neural Ranking with Contrastive Synthetic Supervision.

Si Sun, +7 more

- 29 Dec 2020

- arXiv: Information Retrieval

TL;DR: In this article, contrastive query generation (ContrastQG) is used to synthesize more informative queries as in-domain weak relevance labels, and then filter them with meta adaptive learning to rank (MetaLTR) to better generalize neural rankers to the target few-shot domain.

...read moreread less

5

Journal Article•10.48550/arXiv.2205.11930

How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

Kawin Ethayarajh, +1 more

- arXiv.org

TL;DR: This work identifies the implicit assumptions it makes about annotators and suggests improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation.

...read moreread less

5

Journal Article•10.48550/arxiv.2401.17139

Large Language Model Evaluation via Matrix Entropy

Lai Wei, +4 more

- 30 Jan 2024

- arXiv.org

TL;DR: Matrix entropy is a novel metric for evaluating LLMs that quantifies data compression proficiency and alignment quality. It complements traditional loss scaling law and provides insights into the intrinsic capabilities of LLMs.

...read moreread less

5

Journal Article•10.7717/peerj-cs.1905

Diffusion models in text generation: a survey

Qiuhua Yi, +5 more

- 23 Feb 2024

- PeerJ

TL;DR: This survey reviews diffusion models' applications in text generation, comparing them to autoregressive-based pre-training models, highlighting advantages, limitations, and future research directions, including improving sampling speed and exploring multi-modal text generation.

...read moreread less

5

...

Expand

References

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

•Journal Article•10.1177/001316446002000104

A Coefficient of agreement for nominal Scales

Jacob Cohen

- 01 Apr 1960

- Educational and Psychological Measuremen...

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.

...read moreread less

41.1K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

...

Expand

Evaluation of Text Generation: A Survey

Chat with Paper

AI Agents for this Paper

Citations

Toward More Effective Human Evaluation for Machine Translation

Meta Adaptive Neural Ranking with Contrastive Synthetic Supervision.

How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

Large Language Model Evaluation via Matrix Entropy

Diffusion models in text generation: a survey

References

Long short-term memory

Attention is All you Need

Glove: Global Vectors for Word Representation

A Coefficient of agreement for nominal Scales

Bleu: a Method for Automatic Evaluation of Machine Translation

Related Papers (5)

Bleu: a Method for Automatic Evaluation of Machine Translation

ROUGE: A Package for Automatic Evaluation of Summaries

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

RoBERTa: A Robustly Optimized BERT Pretraining Approach