Large language models are not zero-shot communicators

doi:10.48550/arXiv.2210.14986

Journal Article10.48550/arXiv.2210.14986

Large language models are not zero-shot communicators

Laura Ruis, +5 more

- 26 Oct 2022

- arXiv.org

- Vol. abs/2210.14986

25

TL;DR: A simple task is designed and widely used state-of-the-art models are evaluated, finding that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Figure 2: Left: The zero-shot accuracy for different sizes of the model classes. The error bars show standard deviation over prompt templates. OpenAI’s instructable models perform better than most other models. For all models there is a significant gap between best accuracy and human accuracy. Right: Relative to zero-shot performance increase due to in-context examples, shown for the largest models of classes InstructGPT, Cohere, and OPT (note they are of a different size). The error bars show std. dev. over prompt templates. Performance increases strictly up to k = 5, and only slightly after. For OPT-175B there is a large variance over prompt templates.

Table 58: Accuracy per prompt template for text-davinci-002-unknown.

Table 57: Accuracy per prompt template for InstructGPT-3-175B (text-davinci-001).

Table 7: The performance of the human annotators on the subsets of the test set. Subset 1 through 4 are non-overlapping and cover the whole test set. Annotator X for subset Y might be a different human than annotator X for subset Z. IAA is the average pairwise inter-annotator agreement (Cohen’s kappa coefficient) between annotators per subset.

Table 44: Accuracy per prompt template for GPT-3-6.7B (curie).

Table 42: Accuracy per prompt template for GPT-3-350M (ada).

Citations

Proceedings Article•10.48550/arXiv.2212.10403

Towards Reasoning in Large Language Models: A Survey

Jie Huang, +1 more

- 20 Dec 2022

TL;DR: The authors provide a comprehensive overview of the current state of knowledge on reasoning in large language models, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions.

...read moreread less

310

Proceedings Article•10.48550/arXiv.2212.06801

A fine-grained comparison of pragmatic language understanding in humans and language models

Jennifer Hu, +4 more

- 13 Dec 2022

TL;DR: This paper performed a fine-grained comparison of language models and humans on seven pragmatic phenomena, using zero-shot prompting on an expert-curated set of English materials, and found that the largest models achieve high accuracy and match human error patterns.

...read moreread less

33

Journal Article•10.48550/arXiv.2306.14924

LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding

Rob Chew, +3 more

- 23 Jun 2023

- arXiv.org

TL;DR: This paper explored the use of LLMs to reduce the time it takes for deductive coding while retaining the flexibility of a traditional content analysis, and found that GPT-3.5 can often perform deductive code at levels of agreement comparable to human coders.

...read moreread less

25

Journal Article•10.48550/arxiv.2310.17567

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Dingli Yu, +5 more

- 26 Oct 2023

- arXiv.org

TL;DR: This work develops a methodology for designing and administering a new evaluation to measure ability to combine skills using GPT-4 as well as the open LLaMA-2 70B model, and sketches how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

...read moreread less

15

Journal Article•10.48550/arxiv.2403.05020

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Xuhui Zhou, +4 more

- 08 Mar 2024

- arXiv.org

TL;DR: An evaluation framework to simulate social interactions with LLM-based agents in various settings (omniscient, non-omniscient) indicates that addressing information asymmetry remains a fundamental challenge for LLM-based agents.

...read moreread less

14

...

Expand

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K

•Journal Article

Diagnostic and Statistical Manual of Mental Disorders (DSM-5)

Goran Arbanas

- 01 Jan 2015

- Alcoholism and psychiatry research

TL;DR: Diagnostic and statistical manual of mental disorders (DSM-5) was translated by psychiatrists and psychologists, mainly from the University psychiatric hospital Vrapce and published by the Naklada Slap publisher.

...read moreread less

15.8K

Book Chapter•10.1057/9780230005853_5

Logic and Conversation

Siobhan Chapman

- 01 Jan 2005

TL;DR: For instance, Grice was interested in Quine's logical approach to language, although he differed from Quine over certain specific specific questions, such as the viability of the distinction between analytic and synthetic statements.

...read moreread less

8.9K

...

Expand