It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

doi:10.18653/V1/2021.NAACL-MAIN.185

Open AccessProceedings Article10.18653/V1/2021.NAACL-MAIN.185

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Timo Schick, +1 more

- 01 Jun 2021

- pp 2339-2352

1.1K

TL;DR: This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.48550/arXiv.2210.02952

Improving the Sample Efficiency of Prompt Tuning with Domain Adaptation

Xu Guo, +2 more

- 06 Oct 2022

TL;DR:

...read moreread less

Proceedings Article•10.48550/arXiv.2204.07241

The Art of Prompting: Event Detection based on Type Specific Prompts

Sijia Wang, +2 more

- 14 Apr 2022

TL;DR: A unified framework to incorporate the event type specific prompts for supervised, few-shot, and zero-shot event detection and shows up to 24.3% F-score gain over the previous state- of-the-art baselines.

...read moreread less

Proceedings Article

Answering Ambiguous Questions via Iterative Prompting

Weiwei Sun, +6 more

- 08 Jul 2023

TL;DR: AmbigPrompt as discussed by the authors integrates an answering model with a prompting model in an iterative manner, which adaptively tracks the reading process and progressively triggers the answering model to compose distinct and relevant answers.

...read moreread less

Proceedings Article•10.48550/arXiv.2210.03304

Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding

Zhijun Yang, +4 more

- 07 Oct 2022

TL;DR: This paper proposes a knowledge-enhanced longformer by injecting three domain-speciﬁc knowledge: hierarchy, synonym, and abbreviation with additional pretraining using contrastive learning to address the long-tail challenge of automatic ICD coding.

...read moreread less

10.48550/arxiv.2108.13161

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

Ningyu Zhang, +7 more

TL;DR: This study proposes DART, a novel approach that reformulates NLP tasks as pre-trained language model tasks, differentially optimizing prompts and labels with backpropagation, achieving better few-shot performance on standard NLP tasks without prompt engineering.

...read moreread less

...

Expand

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K