Journal Article10.48550/arXiv.2203.15556
Training Compute-Optimal Large Language Models
Jordan Hoffmann,Sebastian Borgeaud,Arthur Mensch,Elena Buchatskaya,Trevor Cai,Eliza Rutherford,Diego de Las Casas,Lisa Anne Hendricks,Johannes Welbl,Aidan Clark,Tom Hennigan,Eric Noland,Katie Millican,George van den Driessche,Bogdan Damoc,Aurelia Guy,Simon Osindero,Karen Simonyan,Erich Elsen,Jack W. Rae,Oriol Vinyals,Laurent Sifre +21 more
TL;DR: This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.
read more
Abstract: We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron,Thibaut Lavril,Gautier Izacard,Xavier Martinet,Marie-Anne Lachaux,Timothée Lacroix,Baptiste Roziere,Naman Goyal,Eric Hambro,Faisal Azhar,Aur'elien Rodriguez,Armand Joulin,Edouard Grave,Guillaume Lample +13 more
TL;DR: This article introduced LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, and trained their models on trillions of tokens, and showed that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.
GPT-4 Technical Report
TL;DR: GPT-4 as mentioned in this paper is a Transformer-based model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron,Louis Martin,Kevin R. Stone,Amjad Almahairi,Soumya Batra,Prajjwal Bhargava,Shruti Bhosale,Daniel M. Bikel,Lukas Blecher,Cristian Canton-Ferrer,Moya Chen,Guillem Cucurull,David Esiobu,Jude Fernandes,Cynthia Gao,Vedanuj Goswami,Naman Goyal,Anthony S. Hartshorn,Saghar Hosseini,Rui Hou,Hakan Inan,Marcin Kardas,Viktor Kerkez,Madian Khabsa,Isabel M. Kloumann,A. V. Korenev,Punit Singh Koura,Marie-Anne Lachaux,Thibaut Lavril,Diana Liskovich,Yinghai Lu,Yuning Mao,Xavier Martinet,Todor Mihaylov,Pushkar Mishra,Igor Molybog,Yixin Nie,Andrew M. Poulton,Jeremy Reizenstein,Rashi Rungta,Kalyan Saladi,Alan Schelten,Eric A. Smith,R. Subramanian,Xia Tan,Binh Tang,Ross Taylor,Adina Williams,Zhengxu Yan,Iliyan Radev Zarov,Yuchen Zhang,Angela Fan,Melanie Rae Kambadur,Sharan Narang,Aur'elien Rodriguez,Robert Stojnic,Sergey Edunov,Thomas Scialom +57 more
- 18 Jul 2023
TL;DR: This article developed and released Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Journal Article
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery,Sharan Narang,Jacob Devlin,Maarten Bosma,Gaurav Mishra,Adam Roberts,Paul Barham,Hyung Won Chung,Charles Sutton,Sebastian Gehrmann,Parker Schuh,Kensen Shi,Sasha Tsvyashchenko,Joshua Maynez,Abhishek Rao,Parker Barnes,Yi Tay,Noam Shazeer,Velu Prabhakaran,Emily Reif,Nan Du,B. C. Hutchinson,Reiner Pope,James Bradbury,Jacob Austin,Michael Isard,Guy Gur-Ari,Peng Yin,Toju Duke,Anselm Levskaya,Sanjay Ghemawat,Sunipa Dev,Henryk Michalewski,Xavier Garcia,Vedant Misra,Kevin Robinson,L Fedus,Denny Zhou,Daphne Ippolito,David Luan,Hyeontaek Lim,Barret Zoph,Alexander Spiridonov,Ryan Sepassi,David Dohan,Shivani Agrawal,Mark Omernick,Andrew M. Dai,Thanumalayan Sankaranarayana Pillai,Marie Pellat,Aitor Lewkowycz,Erica Oliveira Moreira,Rewon Child,Oleksandr Polozov,Katherine Lee,Zong Tuan Zhou,Xuezhi Wang,Brennan Saeta,Mark Díaz,Orhan Firat,M. Catasta,Jason Loh Seong Wei,Kathleen S. Meier-Hellstern,Douglas Eck,Jeffrey Dean,Slav Petrov,Noah Fiedel +66 more
TL;DR: A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Segment Anything
Alexander Kirillov,Eric Mintun,Nikhila Ravi,Hanzi Mao,Laura Gustafson,Tete Xiao,Spencer Whitehead,Alexander C. Berg,Wan-Yen Lo,Piotr Doll r,Ross Girshick +10 more
TL;DR: The Segment Anything (SA) dataset as mentioned in this paper is the largest dataset for image segmentation, with over 1 billion masks on 11M licensed and privacy-preserving images and is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.
3.3K
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
•Posted Content
Decoupled Weight Decay Regularization
Ilya Loshchilov,Frank Hutter +1 more
TL;DR: This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.
14.4K
•Posted Content
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel,Noam Shazeer,Adam Roberts,Katherine Lee,Sharan Narang,Michael Matena,Yanqi Zhou,Wei Li,Peter J. Liu +8 more
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.