It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
Timo Schick,Hinrich Schütze +1 more
- 01 Jun 2021
- pp 2339-2352
TL;DR: This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.
read more
Abstract: When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
The Art of Prompting: Event Detection based on Type Specific Prompts
Sijia Wang,Mo Yu,Lifu Huang +2 more
- 14 Apr 2022
TL;DR: A unified framework to incorporate the event type specific prompts for supervised, few-shot, and zero-shot event detection and shows up to 24.3% F-score gain over the previous state- of-the-art baselines.
Proceedings Article
Answering Ambiguous Questions via Iterative Prompting
Weiwei Sun,Hengjin Cai,Hongshen Chen,Pengjie Ren,Zhumin Chen,Maarten de Rijke,Zhaochun Ren +6 more
- 08 Jul 2023
TL;DR: AmbigPrompt as discussed by the authors integrates an answering model with a prompting model in an iterative manner, which adaptively tracks the reading process and progressively triggers the answering model to compose distinct and relevant answers.
Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding
Zhijun Yang,Shufan Wang,Bhanu Pratap Singh Rawat,Avijit Mitra,Hongfang Yu +4 more
- 07 Oct 2022
TL;DR: This paper proposes a knowledge-enhanced longformer by injecting three domain-specific knowledge: hierarchy, synonym, and abbreviation with additional pretraining using contrastive learning to address the long-tail challenge of automatic ICD coding.
Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
Ningyu Zhang,Luoqiu Li,Xiang Chen,Shumin Deng,Zhen Bi,Chuanqi Tan,Fei Huang,Huajun Chen +7 more
TL;DR: This study proposes DART, a novel approach that reformulates NLP tasks as pre-trained language model tasks, differentially optimizing prompts and labels with backpropagation, achieving better few-shot performance on standard NLP tasks without prompt engineering.
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin +7 more
- 01 Jan 2017
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
51.8K
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.