A Survey on Evaluation of Large Language Models

doi:10.48550/arXiv.2307.03109

Journal Article10.48550/arXiv.2307.03109

A Survey on Evaluation of Large Language Models

Yu-Chu Chang, +13 more

- 06 Jul 2023

- arXiv.org

- Vol. abs/2307.03109

688

TL;DR: A comprehensive review of the evaluation methods for large language models can be found in this paper , where the authors provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, education, natural and social sciences, agent applications, and other areas.

Abstract: Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Table 8. Summary of new LLMs evaluation protocols.

Table 7. Summary of existing LLMs evaluation benchmarks (ordered by the name of the first author).

Table 5. Summary of evaluations on medical applications based on the three aspects: Medical queries, Medical assistants, and Medical examination (ordered by the name of the first author).

Table 3. Summary of LLMs evaluation on robustness, ethics, biases, and trustworthiness (ordered by the name of the first author).

Table 2. Summary of evaluation on natural language processing tasks: NLU (Natural Language Understanding, including SA (Sentiment Analysis), TC (Text Classification), NLI (Natural Language Inference) and other NLU tasks), Reasoning, NLG (Natural Language Generation, including Summ. (Summarization), Dlg. (Dialogue), Tran (Translation), QA (Question Answering) and other NLG tasks), and Multilingual tasks (ordered by the name of the first author).

Citations

Journal Article•10.48550/arxiv.2308.11432

A Survey on Large Language Model based Autonomous Agents

Lei Wang, +12 more

- 22 Aug 2023

- arXiv.org

TL;DR: A systematic review of the field of LLM-based autonomous agents from a holistic perspective, and proposes a unified framework that encompasses a majority of the previous work.

...read moreread less

564

Journal Article•10.48550/arxiv.2309.01219

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, +14 more

- 03 Sep 2023

- arXiv.org

TL;DR: This paper presents taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyzes existing approaches aiming at mitigating LLm hallucination, and discusses potential directions for future research.

...read moreread less

367

Journal Article•10.48550/arxiv.2309.00770

Bias and Fairness in Large Language Models: A Survey

Isabel O. Gallegos, +7 more

- 02 Sep 2023

- arXiv.org

TL;DR: This paper consolidates, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs, and proposes three intuitive taxonomies, two for bias evaluation and datasets, and one for mitigation.

...read moreread less

210

Journal Article•10.48550/arxiv.2312.02003

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Yifan Yao, +5 more

- 04 Dec 2023

- arXiv.org

TL;DR: This work investigates how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs, and identifies areas that require further research efforts.

...read moreread less

200

Journal Article•10.48550/arxiv.2402.06196

Large Language Models: A Survey

Shervin Minaee, +6 more

- 09 Feb 2024

- arXiv.org

TL;DR: This paper reviews some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discusses their characteristics, contributions and limitations, and gives an overview of techniques developed to build, and augment LLMs.

...read moreread less

194

...

Expand

References

•Journal Article•10.4174/astr.2023.104.5.269

ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models

Namkee Oh, +2 more

- 28 Apr 2023

- Annals of surgical treatment and researc...

TL;DR: In this article , the authors evaluated the performance of ChatGPT, specifically the GPT-3.5 and GPT4 models, in understanding complex surgical clinical information and its potential implications for surgical education and training.

...read moreread less

126

•Posted Content•10.48550/arxiv.2305.15771

On the Planning Abilities of Large Language Models -- A Critical Investigation

25 May 2023

TL;DR: In this paper , the authors evaluate the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and demonstrate that LLMs-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on generated plans and back-prompt the LLM for better plan generation.

...read moreread less

116

Journal Article•10.48550/arXiv.2306.04528

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Jindong Wang, +9 more

- 07 Jun 2023

- arXiv.org

TL;DR: The authors used a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic, and found that contemporary large language models are vulnerable to adversarial prompts.

...read moreread less

106

Journal Article•10.48550/arXiv.2205.12255

TALM: Tool Augmented Language Models

Aaron Thomas Parisi, +2 more

- 24 May 2022

- arXiv.org

TL;DR: TALM is presented, combining a text-only approach to augment language models with non-differentiable tools, and an iterative “self-play” technique to bootstrap performance starting from few tool demonstrations, suggesting that Tool Augmented Language Models are a promising direction to enrich LMs’ capabilities, with less dependence on scale.

...read moreread less

101

Journal Article•10.1021/acs.jcim.3c00285

Do Large Language Models Understand Chemistry? A Conversation with ChatGPT

André S. Pimentel

- 16 Mar 2023

- Journal of Chemical Information and Mode...

TL;DR: This article addressed the question of how well ChatGPT understands chemistry by posing five simple tasks in different sub-areas of chemistry and found that it is not the best model for chemistry.

...read moreread less

100

...

Expand