Open AccessPosted Content
Evaluation of Text Generation: A Survey
TL;DR: This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
read more
Abstract: The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models We then present two examples for task-specific NLG evaluations for automatic text summarization and long text generation, and conclude the paper by proposing future research directions
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Proceedings Article
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
Pei Ke,Fei Huang,Fei Mi,Yasheng Wang,Qun Liu,Xiaoyan Zhu,Minlie Huang +6 more
- 13 Jul 2023
TL;DR: DecompEval as discussed by the authors decomposes a devised instruction-style question about the quality of generated texts into the sub-questions that measure the quality for each sentence, and then the subquestions with their answers generated by pre-trained language models are used as evidence to obtain the evaluation result.
FALTE: A Toolkit for Fine-grained Annotation for Long Text Evaluation
Tanya Goyal,Junyi Jessy Li,Greg Durrett +2 more
- 01 Jan 2022
TL;DR: Falte is a web-based annotation toolkit designed to streamline the evaluation of long text generation models. It allows researchers to collect fine-grained judgments of text quality from crowdworkers using an error taxonomy specific to the downstream task.
•Posted Content
Hurdles to Progress in Long-form Question Answering
TL;DR: The authors used sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 Long Form Question Answering (LFQA) dataset.
5
Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models
K. M. S. Islam,Ayesha Siddika Nipu,Praveen Madiraju,Priya Deshpande +3 more
- 15 Dec 2023
TL;DR: This study demonstrates that utilizing BioGPT, leads to the development of an effective autocompletion tool for generating CC documentation in healthcare settings, and shows that BioGPT-Large exhibits superior performance compared to the other models.
5
Proceedings Article
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes
Wenda Xu,Xian Qian,Mingxuan Wang,Lei Liu,William Yang Wang +4 more
- 19 Dec 2022
TL;DR: The authors proposed SEScore2, a self-supervised approach for training a model-based metric for text generation evaluation, which synthesizes realistic model mistakes by perturbing sentences retrieved from a corpus.
References
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
A Coefficient of agreement for nominal Scales
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.