Journal Article10.48550/arxiv.2310.15773
BLESS: Benchmarking Large Language Models on Sentence Simplification
Tannon Kew,Alison Chi,Laura Vásquez-Rodríguez,Sweta Agrawal,Dennis Aumiller,Fernando Emilio Alva Manchego,Matthew Shardlow,Jason Baumgartner,Savvas Zannettou,Brian Keegan,Megan Squire,Jeremy Blackburn. 2020,Sid Black,Eric Hallahan,Quentin Anthony,Leo Gao,Laurence Golding,Horace He,Connor Leahy,Kyle McDonell,Jason Phang,Michael Pieler,Tom Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Li-Kuang Chen,Yi-Chen Chang,Xi Srinivasan Iyer,Victoria Lin,Ramakanth Pasunuru,Todor Mihaylov,Daniel Simig,Ping Yu,Kurt Shuster,Tianlu Wang,Punit Qing Liu,Singh Koura,Xian Li,Brian O'Horo,Gabriel Pereyra,Jeff Wang,Christopher Dewan,A. Celikyilmaz,Luke Zettlemoyer,Ves Stoyanov. 2023,Chao Jiang,Mounica Maddela,Wuwei Lan,Yang Zhong,Wei Xu,Neural,J. P. Kincaid,R. P. Fishburne,R. L. Rogers,Brad S. Chissom. 1975,Hugo Laurençon,Lucile Saulnier,Thomas Wang,Christopher Akiki,Albert Villanova,del Moral,Teven Le Scao,Leandro von Werra,Chenghao Mou,E. G. Ponferrada,Huu Nguyen,Mike Lewis,Yin Shi Liu,Naman Goyal,Marjan Ghazvininejad,Abdelrahman Mohamed,Omer Levy +77 more
TL;DR: The evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines, and certain LLMs demonstrate a greater range and diversity of edit operations.
read more
Abstract: We present BLESS, a comprehensive performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS). We examine how well off-the-shelf LLMs can solve this challenging task, assessing a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our analysis considers a suite of automatic metrics as well as a large-scale quantitative investigation into the types of common edit operations performed by the different models. Furthermore, we perform a manual qualitative analysis on a subset of model outputs to better gauge the quality of the generated simplifications. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines. Additionally, we find that certain LLMs demonstrate a greater range and diversity of edit operations. Our performance benchmark will be available as a resource for the development of future TS methods and evaluation metrics.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 3: Distribution of token-level edit operations produced by the best-performing LLMs. 
Table 4: Description of Open-Weight models. Model type "D" refers to decoder-only models, "E-D" for models based on an encoder-decoder architecture. 
Table 5: Pricing information for OpenAI’s API models. Here we report the total costs incurred for all three inference prompts and three seeded runs, totalling nine inference runs per dataset. Prices listed correspond to those for the API-based models available from April through June, 2023. All prices are in USD. 
Table 1: Dataset Statistics. C: Complex; S: Simple; R: References. TER refers to Translation Error Rate, a measurement of the average edit distance between the source and reference texts (see https://www.cs.umd. edu/~snover/tercom). 
Table 3: Results of our manual analysis. The annotation schema includes the following annotation features: S↑: accepted simplification, MP↑: meaning preserved, L+: lexical simplification, P+: paraphrasing, R+: reordering (no changes), D+: deletion, Sp+: sentence splitting, H↓: hallucination. 
Figure 5: Token-level edit operations computed for all models and test sets using prompt 2. For most models, the edit operations performed in ASSET and NEWSELA reflect those in the gold reference simplifications. However, on the MED-EASI dataset, we observe a sudden spike in insertions from all LLMs except for OpenAI and Flan models. These additions indicate the presence of potentially unrelated hallucinated tokens and endless generations, which aligns with the low BERTScore results. We regard this failure case to be related to the fact that MED-EASI presents a challenging domain which is out of the distribution of most general-purpose models.
Citations
Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study
Niklas Mannhardt,Elizabeth Bondi-Kelly,Barbara Lam,Chloe O'Connell,Mercy Asiedu,Hussein Mozannar,Monica Agrawal,Alejandro Buendia,Tatiana Urman,Irbaz Bin Riaz,Catherine E. Ricciardi,Marzyeh Ghassemi,David Sontag +12 more
TL;DR: A patient-facing tool to make clinical notes more readable, leveraging large language models (LLMs) to simplify, extract information from, and add context to notes, demonstrates the potential of LLMs to improve patients' experience with clinical notes at a lower burden to clinicians.
4
Do Text Simplification Systems Preserve Meaning? A Human Evaluation via Reading Comprehension
Sweta Agrawal,Marine Carpuat +1 more
TL;DR: This work introduces a human evaluation framework to assess whether simplified texts preserve meaning using reading comprehension questions and investigates how existing TS evaluation metrics and automatic question-answering systems approximate the human judgments.
An In-depth Evaluation of GPT-4 in Sentence Simplification with Error-based Human Assessment
TL;DR: An in-depth evaluation of GPT-4 in sentence simplification with error-based human assessment reveals that while GPT-4 generally outperforms the state-of-the-art, its limitations exist in lexical paraphrasing. The study also finds that existing automatic metrics lack sensitivity to assess the overall high-quality simplification by GPT-4.
Edit-Constrained Decoding for Sentence Simplification
Tatsuya Zetsu,Yuki Arase,Tomoyuki Kajiwara +2 more
- 28 Sep 2024
TL;DR: This study proposes edit-constrained decoding for sentence simplification, introducing stricter constraints that replicate edit operations, outperforming previous methods on three English corpora, improving sentence simplification efficacy and accuracy.
Out of Sesame Street: A Study of Portuguese Legal Named Entity Recognition Through In-Context Learning
Rafael C. Nunes,André Suslik Spritzer,Carla Maria Dal Sasso Freitas,Dennis Giovani Balreira +3 more
- 01 Jan 2024
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
•Posted Content
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel,Noam Shazeer,Adam Roberts,Katherine Lee,Sharan Narang,Michael Matena,Yanqi Zhou,Wei Li,Peter J. Liu +8 more
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Michael Lewis,Yinhan Liu,Naman Goyal,Marjan Ghazvininejad,Abdelrahman Mohamed,Omer Levy,Veselin Stoyanov,Luke Zettlemoyer +7 more
- 01 Jul 2020
TL;DR: BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.
Training language models to follow instructions with human feedback
Long Ouyang,Jeffrey Wu,Xu Jiang,Diogo Almeida,Carroll L. Wainwright,Pamela Mishkin,Chong Zhang,Sandhini Agarwal,Katarina Slama,Alex Ray,John Schulman,Jacob Hilton,Fraser Kelton,Luke E. Miller,Maddie Simens,Amanda Askell,Peter Welinder,Paul F. Christiano,Jan Leike,Ryan Lowe +19 more
- 04 Mar 2022
TL;DR: The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
7.1K