It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
Timo Schick,Hinrich Schütze +1 more
- 01 Jun 2021
- pp 2339-2352
TL;DR: This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.
read more
Abstract: When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Moving on from OntoNotes: Coreference Resolution Model Transfer
Patrick Xia,Benjamin Van Durme +1 more
TL;DR: The authors quantified transferability of coreference resolution models based on the number of annotated documents available in the target dataset and found that continued training is consistently effective and especially beneficial when there are few target documents.
Semantic Augmentation in Chinese Adversarial Corpus for Discourse Relation Recognition Based on Internal Semantic Elements
Zheng Hua,Yanbin Feng +1 more
TL;DR: The SACA corpus incorporates linguistic semantic information into discourse relation recognition and includes 9546 adversative complex sentences annotated with internal semantic elements. The corpus follows the Penn Discourse Treebank (PDTB) annotation scheme, except for sense classification, which is based on the Chinese Discourse Treebank (CDTB).
XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners
Yun Luo,Zhen Yang,Fandong Meng,Yingjie Li,Fang Guo,Qinglin Qi,Jie Zhou,Yue Zhang +7 more
TL;DR: This work proposes a novel Explainable Active Learning framework (XAL) for low-resource text classification, which aims to encourage classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
BLESS: Benchmarking Large Language Models on Sentence Simplification
Tannon Kew,Alison Chi,Laura Vásquez-Rodríguez,Sweta Agrawal,Dennis Aumiller,Fernando Emilio Alva Manchego,Matthew Shardlow,Jason Baumgartner,Savvas Zannettou,Brian Keegan,Megan Squire,Jeremy Blackburn. 2020,Sid Black,Eric Hallahan,Quentin Anthony,Leo Gao,Laurence Golding,Horace He,Connor Leahy,Kyle McDonell,Jason Phang,Michael Pieler,Tom Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Li-Kuang Chen,Yi-Chen Chang,Xi Srinivasan Iyer,Victoria Lin,Ramakanth Pasunuru,Todor Mihaylov,Daniel Simig,Ping Yu,Kurt Shuster,Tianlu Wang,Punit Qing Liu,Singh Koura,Xian Li,Brian O'Horo,Gabriel Pereyra,Jeff Wang,Christopher Dewan,A. Celikyilmaz,Luke Zettlemoyer,Ves Stoyanov. 2023,Chao Jiang,Mounica Maddela,Wuwei Lan,Yang Zhong,Wei Xu,Neural,J. P. Kincaid,R. P. Fishburne,R. L. Rogers,Brad S. Chissom. 1975,Hugo Laurençon,Lucile Saulnier,Thomas Wang,Christopher Akiki,Albert Villanova,del Moral,Teven Le Scao,Leandro von Werra,Chenghao Mou,E. G. Ponferrada,Huu Nguyen,Mike Lewis,Yin Shi Liu,Naman Goyal,Marjan Ghazvininejad,Abdelrahman Mohamed,Omer Levy +77 more
TL;DR: The evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines, and certain LLMs demonstrate a greater range and diversity of edit operations.
Knowledge Distillation Meets Few-Shot Learning: An Approach for Few-Shot Intent Classification Within and Across Domains
Annie Sauer,Shima Asaadi,Fabian Küch +2 more
- 01 Jan 2022
TL;DR: This paper introduces an approach for distilling small models that generalize to new intent classes and domains using only a handful of labeled examples, and conducts experiments on public intent classification benchmarks, confirming the generalization ability of the small distilled models while having lower computational costs.
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin +7 more
- 01 Jan 2017
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
51.8K
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.