Journal Article10.48550/arXiv.2210.14986
Large language models are not zero-shot communicators
TL;DR: A simple task is designed and widely used state-of-the-art models are evaluated, finding that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random.
read more
Abstract: Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response"I wore gloves"to the question"Did you leave fingerprints?"as meaning"No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be"aligned with human intent"perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 2: Left: The zero-shot accuracy for different sizes of the model classes. The error bars show standard deviation over prompt templates. OpenAI’s instructable models perform better than most other models. For all models there is a significant gap between best accuracy and human accuracy. Right: Relative to zero-shot performance increase due to in-context examples, shown for the largest models of classes InstructGPT, Cohere, and OPT (note they are of a different size). The error bars show std. dev. over prompt templates. Performance increases strictly up to k = 5, and only slightly after. For OPT-175B there is a large variance over prompt templates. 
Table 58: Accuracy per prompt template for text-davinci-002-unknown. 
Table 57: Accuracy per prompt template for InstructGPT-3-175B (text-davinci-001). 
Table 7: The performance of the human annotators on the subsets of the test set. Subset 1 through 4 are non-overlapping and cover the whole test set. Annotator X for subset Y might be a different human than annotator X for subset Z. IAA is the average pairwise inter-annotator agreement (Cohen’s kappa coefficient) between annotators per subset. 
Table 44: Accuracy per prompt template for GPT-3-6.7B (curie). 
Table 42: Accuracy per prompt template for GPT-3-350M (ada).
Citations
Towards Reasoning in Large Language Models: A Survey
Jie Huang,Kevin C. Chung +1 more
- 20 Dec 2022
TL;DR: The authors provide a comprehensive overview of the current state of knowledge on reasoning in large language models, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions.
A fine-grained comparison of pragmatic language understanding in humans and language models
Jennifer Hu,Sammy Floyd,Olessia Jouravlev,Evelina Fedorenko,Edward Gibson +4 more
- 13 Dec 2022
TL;DR: This paper performed a fine-grained comparison of language models and humans on seven pragmatic phenomena, using zero-shot prompting on an expert-curated set of English materials, and found that the largest models achieve high accuracy and match human error patterns.
33
LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding
TL;DR: This paper explored the use of LLMs to reduce the time it takes for deductive coding while retaining the flexibility of a traditional content analysis, and found that GPT-3.5 can often perform deductive code at levels of agreement comparable to human coders.
25
Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
Dingli Yu,Simran Kaur,Arushi Gupta,Jonah Brown-Cohen,Anirudh Goyal,Sanjeev Arora +5 more
TL;DR: This work develops a methodology for designing and administering a new evaluation to measure ability to combine skills using GPT-4 as well as the open LLaMA-2 70B model, and sketches how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.
Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
Xuhui Zhou,Zhe Su,Tiwalayo Eisape,Hyunwoo Kim,Maarten Sap +4 more
TL;DR: An evaluation framework to simulate social interactions with LLM-based agents in various settings (omniscient, non-omniscient) indicates that addressing information asymmetry remains a fundamental challenge for LLM-based agents.
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
•Journal Article
Diagnostic and Statistical Manual of Mental Disorders (DSM-5)
TL;DR: Diagnostic and statistical manual of mental disorders (DSM-5) was translated by psychiatrists and psychologists, mainly from the University psychiatric hospital Vrapce and published by the Naklada Slap publisher.
15.8K
Logic and Conversation
Siobhan Chapman
- 01 Jan 2005
TL;DR: For instance, Grice was interested in Quine's logical approach to language, although he differed from Quine over certain specific specific questions, such as the viability of the distinction between analytic and synthetic statements.
8.9K