Journal Article10.48550/arXiv.2307.03109
A Survey on Evaluation of Large Language Models
Yu-Chu Chang,Xu Wang,Jindong Wang,Yuanyi Wu,Hao Chen,Linyi Yang,Xiaoyuan Yi,Cunxiang Wang,Yidong Wang,Weirong Ye,Yue Zhang,Philip S. Yu,Qian Yang,Xingxu Xie +13 more
TL;DR: A comprehensive review of the evaluation methods for large language models can be found in this paper , where the authors provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, education, natural and social sciences, agent applications, and other areas.
read more
Abstract: Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Table 8. Summary of new LLMs evaluation protocols. 
Table 7. Summary of existing LLMs evaluation benchmarks (ordered by the name of the first author). 
Fig. 1. Structure of this paper. 
Table 5. Summary of evaluations on medical applications based on the three aspects: Medical queries, Medical assistants, and Medical examination (ordered by the name of the first author). 
Table 3. Summary of LLMs evaluation on robustness, ethics, biases, and trustworthiness (ordered by the name of the first author). 
Table 2. Summary of evaluation on natural language processing tasks: NLU (Natural Language Understanding, including SA (Sentiment Analysis), TC (Text Classification), NLI (Natural Language Inference) and other NLU tasks), Reasoning, NLG (Natural Language Generation, including Summ. (Summarization), Dlg. (Dialogue), Tran (Translation), QA (Question Answering) and other NLG tasks), and Multilingual tasks (ordered by the name of the first author).
Citations
A Survey on Large Language Model based Autonomous Agents
Lei Wang,Cheng-jian Ma,Xueyang Feng,Zeyu Zhang,Hao-ran Yang,Jingsen Zhang,Zhi-Yang Chen,Jiakai Tang,Xu Chen,Yankai Lin,Wayne Xin Zhao,Zhewei Wei,Ji-Rong Wen +12 more
TL;DR: A systematic review of the field of LLM-based autonomous agents from a holistic perspective, and proposes a unified framework that encompasses a majority of the previous work.
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang,Yafu Li,Leyang Cui,Deng Li Cai,Lemao Liu,Tingchen Fu,Xinting Huang,Enbo Zhao,Yu Zhang,Yulong Chen,Longyue Wang,Anh Tuan Luu,Wei Bi,Freda Shi,Shuming Shi +14 more
TL;DR: This paper presents taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyzes existing approaches aiming at mitigating LLm hallucination, and discusses potential directions for future research.
Bias and Fairness in Large Language Models: A Survey
Isabel O. Gallegos,Ryan A. Rossi,Joe Barrow,Md. Mehrab Tanjim,Sungchul Kim,Franck Dernoncourt,Ruiyi Zhang,Nesreen K. Ahmed +7 more
TL;DR: This paper consolidates, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs, and proposes three intuitive taxonomies, two for bias evaluation and datasets, and one for mitigation.
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Yifan Yao,Jinhao Duan,Kaidi Xu,Yuanfang Cai,Eric Sun,Yue Zhang +5 more
TL;DR: This work investigates how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs, and identifies areas that require further research efforts.
200
Large Language Models: A Survey
Shervin Minaee,Tomas Mikolov,Narjes Nikzad,M. Chenaghlu,Richard Socher,Xavier Amatriain,Jianfeng Gao +6 more
TL;DR: This paper reviews some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discusses their characteristics, contributions and limitations, and gives an overview of techniques developed to build, and augment LLMs.
194
References
ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models
TL;DR: In this article , the authors evaluated the performance of ChatGPT, specifically the GPT-3.5 and GPT4 models, in understanding complex surgical clinical information and its potential implications for surgical education and training.
126
On the Planning Abilities of Large Language Models -- A Critical Investigation
25 May 2023
TL;DR: In this paper , the authors evaluate the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and demonstrate that LLMs-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on generated plans and back-prompt the LLM for better plan generation.
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
Jindong Wang,Jiaheng Zhou,Zichen Wang,Hao Chen,Yidong Wang,Linyi Yang,Weirong Ye,Neil Zhenqiang Gong,Yue Zhang,Xingxu Xie +9 more
TL;DR: The authors used a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic, and found that contemporary large language models are vulnerable to adversarial prompts.
TALM: Tool Augmented Language Models
TL;DR: TALM is presented, combining a text-only approach to augment language models with non-differentiable tools, and an iterative “self-play” technique to bootstrap performance starting from few tool demonstrations, suggesting that Tool Augmented Language Models are a promising direction to enrich LMs’ capabilities, with less dependence on scale.
101
Do Large Language Models Understand Chemistry? A Conversation with ChatGPT
TL;DR: This article addressed the question of how well ChatGPT understands chemistry by posing five simple tasks in different sub-areas of chemistry and found that it is not the best model for chemistry.
100