It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

doi:10.18653/V1/2021.NAACL-MAIN.185

Open AccessProceedings Article10.18653/V1/2021.NAACL-MAIN.185

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Timo Schick, +1 more

- 01 Jun 2021

- pp 2339-2352

1.1K

TL;DR: This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arXiv.2211.04337

Prompt-Based Metric Learning for Few-Shot NER

Yanru Chen, +2 more

- 08 Nov 2022

- arXiv.org

TL;DR: This article proposed a simple method to largely improve metric learning for NER: multiple prompt schemas are designed to enhance label semantics, and a novel architecture to effectively combine multiple prompt-based representations.

...read moreread less

10

Journal Article•10.18653/v1/2023.findings-acl.510

Global and Local Hierarchy-aware Contrastive Framework for Implicit Discourse Relation Recognition

Yuxin Jiang, +2 more

- 01 Jan 2023

TL;DR: A novel contrastive framework for IDRR that effectively incorporates global and local hierarchies of senses to learn better discourse relation representations.

...read moreread less

10

Proceedings Article•10.48550/arXiv.2210.10693

Robustness of Demonstration-based Learning Under Limited Data Scenario

Hongxin Zhang, +3 more

- 19 Oct 2022

TL;DR: This paper designs pathological demonstrations by gradually removing intuitively useful information from the standard ones to take a deep dive of the robustness of demonstration-based sequence labeling and shows that demonstrations composed of random tokens still make the model a better few-shot learner.

...read moreread less

10

•Posted Content

Data-Efficient Pretraining via Contrastive Self-Supervision

Nils Rethmeier, +1 more

- 02 Oct 2020

- arXiv: Computation and Language

TL;DR: This paper proposed a data and compute efficient self-supervised, contrastive text encoder, pretrained on 60MB of ''task-internal'' text data, and compare it to RoBERTa, which was pretrained over 160GB of 'task-external' text.

...read moreread less

10

•Proceedings Article•10.18653/v1/2022.findings-naacl.168

On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

01 Jan 2022

TL;DR: In this paper , the authors highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

...read moreread less

10

...

Expand

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K