It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

doi:10.18653/V1/2021.NAACL-MAIN.185

Open AccessProceedings Article10.18653/V1/2021.NAACL-MAIN.185

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Timo Schick, +1 more

- 01 Jun 2021

- pp 2339-2352

1.1K

TL;DR: This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Moving on from OntoNotes: Coreference Resolution Model Transfer

Patrick Xia, +1 more

- 17 Apr 2021

- arXiv: Computation and Language

TL;DR: The authors quantified transferability of coreference resolution models based on the number of annotated documents available in the target dataset and found that continued training is consistently effective and especially beneficial when there are few target documents.

...read moreread less

Journal Article•10.3390/electronics13101944

Semantic Augmentation in Chinese Adversarial Corpus for Discourse Relation Recognition Based on Internal Semantic Elements

Zheng Hua, +1 more

- 15 May 2024

- Electronics

TL;DR: The SACA corpus incorporates linguistic semantic information into discourse relation recognition and includes 9546 adversative complex sentences annotated with internal semantic elements. The corpus follows the Penn Discourse Treebank (PDTB) annotation scheme, except for sense classification, which is based on the Chinese Discourse Treebank (CDTB).

...read moreread less

Journal Article•10.48550/arxiv.2310.05502

XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners

Yun Luo, +7 more

- 09 Oct 2023

- arXiv.org

TL;DR: This work proposes a novel Explainable Active Learning framework (XAL) for low-resource text classification, which aims to encourage classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.

...read moreread less

Journal Article•10.48550/arxiv.2310.15773

BLESS: Benchmarking Large Language Models on Sentence Simplification

Tannon Kew, +77 more

- 24 Oct 2023

- arXiv.org

TL;DR: The evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines, and certain LLMs demonstrate a greater range and diversity of edit operations.

...read moreread less

•Proceedings Article•10.18653/v1/2022.nlp4convai-1.10

Knowledge Distillation Meets Few-Shot Learning: An Approach for Few-Shot Intent Classification Within and Across Domains

Annie Sauer, +2 more

- 01 Jan 2022

TL;DR: This paper introduces an approach for distilling small models that generalize to new intent classes and domains using only a handful of labeled examples, and conducts experiments on public intent classification benchmarks, confirming the generalization ability of the small distilled models while having lower computational costs.

...read moreread less

...

Expand

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K