Evaluation of Text Generation: A Survey

Open AccessPosted Content

Evaluation of Text Generation: A Survey

- 26 Jun 2020

371

TL;DR: This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article

DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering

Pei Ke, +6 more

- 13 Jul 2023

TL;DR: DecompEval as discussed by the authors decomposes a devised instruction-style question about the quality of generated texts into the sub-questions that measure the quality for each sentence, and then the subquestions with their answers generated by pre-trained language models are used as evidence to obtain the evaluation result.

...read moreread less

5

Journal Article•10.18653/v1/2022.emnlp-demos.35

FALTE: A Toolkit for Fine-grained Annotation for Long Text Evaluation

Tanya Goyal, +2 more

- 01 Jan 2022

TL;DR: Falte is a web-based annotation toolkit designed to streamline the evaluation of long text generation models. It allows researchers to collect fine-grained judgments of text quality from crowdworkers using an error taxonomy specific to the downstream task.

...read moreread less

5

•Posted Content

Hurdles to Progress in Long-form Question Answering

Kalpesh Krishna, +2 more

- 10 Mar 2021

- arXiv: Computation and Language

TL;DR: The authors used sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 Long Form Question Answering (LFQA) dataset.

...read moreread less

5

Journal Article•10.1109/bigdata59044.2023.10386778

Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models

K. M. S. Islam, +3 more

- 15 Dec 2023

TL;DR: This study demonstrates that utilizing BioGPT, leads to the development of an effective autocompletion tool for generating CC documentation in healthcare settings, and shows that BioGPT-Large exhibits superior performance compared to the other models.

...read moreread less

5

Proceedings Article

SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

Wenda Xu, +4 more

- 19 Dec 2022

TL;DR: The authors proposed SEScore2, a self-supervised approach for training a model-based metric for text generation evaluation, which synthesizes realistic model mistakes by perturbing sentences retrieved from a corpus.

...read moreread less

5

...

Expand

References

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

•Journal Article•10.1177/001316446002000104

A Coefficient of agreement for nominal Scales

Jacob Cohen

- 01 Apr 1960

- Educational and Psychological Measuremen...

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.

...read moreread less

41.1K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

...

Expand

Evaluation of Text Generation: A Survey

Chat with Paper

AI Agents for this Paper

Citations

DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering

FALTE: A Toolkit for Fine-grained Annotation for Long Text Evaluation

Hurdles to Progress in Long-form Question Answering

Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models

SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

References

Long short-term memory

Attention is All you Need

Glove: Global Vectors for Word Representation

A Coefficient of agreement for nominal Scales

Bleu: a Method for Automatic Evaluation of Machine Translation

Related Papers (5)

Bleu: a Method for Automatic Evaluation of Machine Translation

ROUGE: A Package for Automatic Evaluation of Summaries

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

RoBERTa: A Robustly Optimized BERT Pretraining Approach