Open AccessPosted Content
Evaluation of Text Generation: A Survey
TL;DR: This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
read more
Abstract: The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models We then present two examples for task-specific NLG evaluations for automatic text summarization and long text generation, and conclude the paper by proposing future research directions
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Toward More Effective Human Evaluation for Machine Translation
01 Jan 2022
TL;DR: This paper investigated a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set using a sampling approach, demonstrating that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline.
•Posted Content
Meta Adaptive Neural Ranking with Contrastive Synthetic Supervision.
Si Sun,Yingzhuo Qian,Zhenghao Liu,Chenyan Xiong,Kaitao Zhang,Jie Bao,Zhiyuan Liu,Paul N. Bennett +7 more
TL;DR: In this article, contrastive query generation (ContrastQG) is used to synthesize more informative queries as in-domain weak relevance labels, and then filter them with meta adaptive learning to rank (MetaLTR) to better generalize neural rankers to the target few-shot domain.
5
How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory
Kawin Ethayarajh,Dan Jurafsky +1 more
TL;DR: This work identifies the implicit assumptions it makes about annotators and suggests improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation.
5
Large Language Model Evaluation via Matrix Entropy
Lai Wei,Zhiquan Tan,Chenghai Li,Jindong Wang,Weiran Huang +4 more
TL;DR: Matrix entropy is a novel metric for evaluating LLMs that quantifies data compression proficiency and alignment quality. It complements traditional loss scaling law and provides insights into the intrinsic capabilities of LLMs.
Diffusion models in text generation: a survey
Qiuhua Yi,Xiangfan Chen,Chenwei Zhang,Zehai Zhou,Linan Zhu,Xiangjie Kong +5 more
TL;DR: This survey reviews diffusion models' applications in text generation, comparing them to autoregressive-based pre-training models, highlighting advantages, limitations, and future research directions, including improving sampling speed and exploring multi-modal text generation.
5
References
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
A Coefficient of agreement for nominal Scales
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.