A Survey on Evaluation of Large Language Models

doi:10.48550/arXiv.2307.03109

Journal Article10.48550/arXiv.2307.03109

A Survey on Evaluation of Large Language Models

Yu-Chu Chang, +13 more

- 06 Jul 2023

- arXiv.org

- Vol. abs/2307.03109

688

TL;DR: A comprehensive review of the evaluation methods for large language models can be found in this paper , where the authors provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, education, natural and social sciences, agent applications, and other areas.

Abstract: Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Table 8. Summary of new LLMs evaluation protocols.

Table 7. Summary of existing LLMs evaluation benchmarks (ordered by the name of the first author).

Table 5. Summary of evaluations on medical applications based on the three aspects: Medical queries, Medical assistants, and Medical examination (ordered by the name of the first author).

Table 3. Summary of LLMs evaluation on robustness, ethics, biases, and trustworthiness (ordered by the name of the first author).

Table 2. Summary of evaluation on natural language processing tasks: NLU (Natural Language Understanding, including SA (Sentiment Analysis), TC (Text Classification), NLI (Natural Language Inference) and other NLU tasks), Reasoning, NLG (Natural Language Generation, including Summ. (Summarization), Dlg. (Dialogue), Tran (Translation), QA (Question Answering) and other NLG tasks), and Multilingual tasks (ordered by the name of the first author).

Citations

Journal Article•10.48550/arxiv.2308.11432

A Survey on Large Language Model based Autonomous Agents

Lei Wang, +12 more

- 22 Aug 2023

- arXiv.org

TL;DR: A systematic review of the field of LLM-based autonomous agents from a holistic perspective, and proposes a unified framework that encompasses a majority of the previous work.

...read moreread less

564

Journal Article•10.48550/arxiv.2309.01219

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, +14 more

- 03 Sep 2023

- arXiv.org

TL;DR: This paper presents taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyzes existing approaches aiming at mitigating LLm hallucination, and discusses potential directions for future research.

...read moreread less

367

Journal Article•10.48550/arxiv.2309.00770

Bias and Fairness in Large Language Models: A Survey

Isabel O. Gallegos, +7 more

- 02 Sep 2023

- arXiv.org

TL;DR: This paper consolidates, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs, and proposes three intuitive taxonomies, two for bias evaluation and datasets, and one for mitigation.

...read moreread less

210

Journal Article•10.48550/arxiv.2312.02003

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Yifan Yao, +5 more

- 04 Dec 2023

- arXiv.org

TL;DR: This work investigates how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs, and identifies areas that require further research efforts.

...read moreread less

200

Journal Article•10.48550/arxiv.2402.06196

Large Language Models: A Survey

Shervin Minaee, +6 more

- 09 Feb 2024

- arXiv.org

TL;DR: This paper reviews some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discusses their characteristics, contributions and limitations, and gives an overview of techniques developed to build, and augment LLMs.

...read moreread less

194

...

Expand

References

Journal Article•10.48550/arxiv.2309.01219

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, +14 more

- 03 Sep 2023

- arXiv.org

TL;DR: This paper presents taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyzes existing approaches aiming at mitigating LLm hallucination, and discusses potential directions for future research.

...read moreread less

367

Journal Article•10.48550/arXiv.2305.01210

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, +2 more

- 02 May 2023

- arXiv.org

TL;DR: EvalPlus as mentioned in this paper augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM-and mutation-based strategies.

...read moreread less

367

•Proceedings Article•10.18653/V1/2021.NAACL-MAIN.324

Dynabench: Rethinking Benchmarking in NLP.

Douwe Kiela, +18 more

- 01 Jun 2021

TL;DR: It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.

...read moreread less

367

Journal Article•10.48550/arxiv.2307.16789

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, +17 more

- 31 Jul 2023

- arXiv.org

TL;DR: ToolLLM is introduced, a general tool-use framework encompassing data construction, model training, and evaluation, and a novel depth-first search-based decision tree algorithm that enables LLMs to evaluate multiple reasoning traces and expand the search space.

...read moreread less

336

...

Expand

A Survey on Evaluation of Large Language Models

Chat with Paper

AI Agents for this Paper

Figures

Citations

A Survey on Large Language Model based Autonomous Agents

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Bias and Fairness in Large Language Models: A Survey

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Large Language Models: A Survey

References

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Dynabench: Rethinking Benchmarking in NLP.

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Language Is Not All You Need: Aligning Perception with Language Models