BLESS: Benchmarking Large Language Models on Sentence Simplification

doi:10.48550/arxiv.2310.15773

Journal Article10.48550/arxiv.2310.15773

BLESS: Benchmarking Large Language Models on Sentence Simplification

Tannon Kew, +77 more

- 24 Oct 2023

- arXiv.org

- Vol. abs/2310.15773

10

TL;DR: The evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines, and certain LLMs demonstrate a greater range and diversity of edit operations.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Figure 3: Distribution of token-level edit operations produced by the best-performing LLMs.

Table 4: Description of Open-Weight models. Model type "D" refers to decoder-only models, "E-D" for models based on an encoder-decoder architecture.

Table 5: Pricing information for OpenAI’s API models. Here we report the total costs incurred for all three inference prompts and three seeded runs, totalling nine inference runs per dataset. Prices listed correspond to those for the API-based models available from April through June, 2023. All prices are in USD.

Table 1: Dataset Statistics. C: Complex; S: Simple; R: References. TER refers to Translation Error Rate, a measurement of the average edit distance between the source and reference texts (see https://www.cs.umd. edu/~snover/tercom).

Table 3: Results of our manual analysis. The annotation schema includes the following annotation features: S↑: accepted simplification, MP↑: meaning preserved, L+: lexical simplification, P+: paraphrasing, R+: reordering (no changes), D+: deletion, Sp+: sentence splitting, H↓: hallucination.

Figure 5: Token-level edit operations computed for all models and test sets using prompt 2. For most models, the edit operations performed in ASSET and NEWSELA reflect those in the gold reference simplifications. However, on the MED-EASI dataset, we observe a sudden spike in insertions from all LLMs except for OpenAI and Flan models. These additions indicate the presence of potentially unrelated hallucinated tokens and endless generations, which aligns with the low BERTScore results. We regard this failure case to be related to the fact that MED-EASI presents a challenging domain which is out of the distribution of most general-purpose models.

Citations

Journal Article•10.48550/arxiv.2401.09637

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

Niklas Mannhardt, +12 more

- 17 Jan 2024

- arXiv.org

TL;DR: A patient-facing tool to make clinical notes more readable, leveraging large language models (LLMs) to simplify, extract information from, and add context to notes, demonstrates the potential of LLMs to improve patients' experience with clinical notes at a lower burden to clinicians.

...read moreread less

4

Journal Article•10.48550/arxiv.2312.10126

Do Text Simplification Systems Preserve Meaning? A Human Evaluation via Reading Comprehension

Sweta Agrawal, +1 more

- 15 Dec 2023

- arXiv.org

TL;DR: This work introduces a human evaluation framework to assess whether simplified texts preserve meaning using reading comprehension questions and investigates how existing TS evaluation metrics and automatic question-answering systems approximate the human judgments.

...read moreread less

2

Journal Article•10.48550/arxiv.2403.04963

An In-depth Evaluation of GPT-4 in Sentence Simplification with Error-based Human Assessment

Xuanxin Wu, +1 more

- 08 Mar 2024

- arXiv.org

TL;DR: An in-depth evaluation of GPT-4 in sentence simplification with error-based human assessment reveals that while GPT-4 generally outperforms the state-of-the-art, its limitations exist in lexical paraphrasing. The study also finds that existing automatic metrics lack sensitivity to assess the overall high-quality simplification by GPT-4.

...read moreread less

2

Journal Article•10.48550/arxiv.2409.19247

Edit-Constrained Decoding for Sentence Simplification

Tatsuya Zetsu, +2 more

- 28 Sep 2024

TL;DR: This study proposes edit-constrained decoding for sentence simplification, introducing stricter constraints that replicate edit operations, outperforming previous methods on three English corpora, improving sentence simplification efficacy and accuracy.

...read moreread less

Journal Article•10.5220/0012624700003690

Out of Sesame Street: A Study of Portuguese Legal Named Entity Recognition Through In-Context Learning

Rafael C. Nunes, +3 more

- 01 Jan 2024

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K

•Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, +8 more

- 23 Oct 2019

- arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

12.9K

•Proceedings Article•10.18653/V1/2020.ACL-MAIN.703

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Michael Lewis, +7 more

- 01 Jul 2020

TL;DR: BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.

...read moreread less

11.5K

Proceedings Article•10.48550/arXiv.2203.02155

Training language models to follow instructions with human feedback

Long Ouyang, +19 more

- 04 Mar 2022

TL;DR: The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

...read moreread less

7.1K

...

Expand