Efficient large-scale language model training on GPU clusters using megatron-LM
Deepak Narayanan,Mohammad Shoeybi,Jared Casper,Patrick LeGresley,Mostofa Patwary,Vijay Anand Korthikanti,Dmitri Vainbrand,Prethvi Kashinkunti,Julie Bernauer,Bryan Catanzaro,Amar Phanishayee,Matei Zaharia +11 more
- 14 Nov 2021
TL;DR: In this paper, the authors propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches, allowing them to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.
read more
Abstract: Large language models have led to state-of-the-art accuracies across several tasks. However, training these models efficiently is challenging because: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of GPUs. In this paper, we show how tensor, pipeline, and data parallelism can be composed to scale to thousands of GPUs. We propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak).
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Journal Article
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery,Sharan Narang,Jacob Devlin,Maarten Bosma,Gaurav Mishra,Adam Roberts,Paul Barham,Hyung Won Chung,Charles Sutton,Sebastian Gehrmann,Parker Schuh,Kensen Shi,Sasha Tsvyashchenko,Joshua Maynez,Abhishek Rao,Parker Barnes,Yi Tay,Noam Shazeer,Velu Prabhakaran,Emily Reif,Nan Du,B. C. Hutchinson,Reiner Pope,James Bradbury,Jacob Austin,Michael Isard,Guy Gur-Ari,Peng Yin,Toju Duke,Anselm Levskaya,Sanjay Ghemawat,Sunipa Dev,Henryk Michalewski,Xavier Garcia,Vedant Misra,Kevin Robinson,L Fedus,Denny Zhou,Daphne Ippolito,David Luan,Hyeontaek Lim,Barret Zoph,Alexander Spiridonov,Ryan Sepassi,David Dohan,Shivani Agrawal,Mark Omernick,Andrew M. Dai,Thanumalayan Sankaranarayana Pillai,Marie Pellat,Aitor Lewkowycz,Erica Oliveira Moreira,Rewon Child,Oleksandr Polozov,Katherine Lee,Zong Tuan Zhou,Xuezhi Wang,Brennan Saeta,Mark Díaz,Orhan Firat,M. Catasta,Jason Loh Seong Wei,Kathleen S. Meier-Hellstern,Douglas Eck,Jeffrey Dean,Slav Petrov,Noah Fiedel +66 more
TL;DR: A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao,Angela Fan,Christopher Akiki,Elizabeth-Jane Pavlick,Suzana Ilic,Daniel Hesslow,Roman Castagn'e,Alexandra Luccioni,Franccois Yvon,Matthias Gallé,J. S. Tow,Alexander M. Rush,Stella Biderman,Albert Webson,Pawan Sasanka Ammanamanchi,Thomas Wang,Benoît Sagot,Niklas Muennighoff,A. Villanova del Moral,Olatunji Ruwase,R. Bawden,Stas Bekman,Angelina McMillan-Major,Iz Beltagy,Huu Nguyen,Lucile Saulnier,Samson Tan,Pedro Javier Ortiz Suárez,Victor Sanh,Hugo Laurenccon,Yacine Jernite,Julien Launay,Margaret Mitchell,Colin Raffel,Aaron Gokaslan,Adi Simhi,Aitor Soroa,Alham Fikri Aji,Amit Alfassy,Anna Rogers,Ariel Kreisberg Nitzav,Canwen Xu,Chenghao Mou,Chris Chinenye Emezue,Christopher Klamm,Colin D. Leong,Daniel van Strien,David Ifeoluwa Adelani,Dragomir R. Radev,Eduardo G. Ponferrada,Efrat Levkovizh,Ethan Kim,Eyal Natan,Francesco De Toni,Gérard Dupont,G. Kruszewski,Giada Pistilli,Hady Elsahar,Hamza Benyamina,H. Tran,Ian Yu,Idris Abdulmumin,Isaac Johnson,Itziar Gonzalez-Dios,Javier de la Rosa,Jenny Chim,Jesse Dodge,Jian Zhou,Jonathan Chang,Jorg Frohberg,Josephine L. Tobing,Joydeep Bhattacharjee,Khalid Almubarak,Kimbo Chen,Kyle Lo,Leandro von Werra,Leon Weber,Long Phan,Loubna Ben Allal,Ludovic Tanguy,Manan Dey,Manuel Romero Muñoz,Maraim Masoud,Mar'ia Grandury,Mario vSavsko,Max Huang,Maximin Coavoux,Mayank Singh,Mike Tian-Jian Jiang,Minh Chien Vu,M. A. Jauhar,Mustafa Ghaleb,Nishant Subramani,Nora Kassner,Nurulaqilla Khamis,Olivier Nguyen,Omar Espejel,Ona de Gibert,Paulo Villegas,Peter Henderson,Pierre Colombo,Priscilla Amuok,Quentin Lhoest,Rheza Harliman,Rishi Bommasani,R. L'opez,Salomey Osei,Sampo Pyysalo,Sebastian Nagel,Shamik Bose,Shamsuddeen Hassan Muhammad,Shanya Sharma,Shayne Longpre,Somaieh Nikpoor,Stanislav Silberberg,Suhas Pai,S Zink,Tiago Timponi Torrent,Timo Schick,Tristan Thrush,Valentin Danchev,Vassilina Nikoulina,Veronika Laippala,Violette Lepercq,V. Prabhu,Zaid Alyafeai,Zeerak Talat,Arun Raja,Benjamin Heinzerling,Chenglei Si,Elizabeth Salesky,Sabrina J. Mielke,Wilson Y. Lee,Abheesht Sharma,Andrea Santilli,Antoine Chaffin,Arnaud Stiegler,Debajyoti Datta,Eliza Szczechla,Gunjan Chhablani,Han Wang,Harshit Pandey,Hendrik Strobelt,Jason A. Fries,Jos Rozen,Leo Gao,Lintang A. Sutawika,M Saiful Bari,Maged S. Al-shaibani,Matteo Manica,Nihal V. Nayak,Ryan Teehan,Samuel Albanie,Sheng Shen,Srulik Ben-David,Stephen H. Bach,Taewoon Kim,T. G. Owe Bers,Thibault Févry,Trishala Neeraj,Urmish Thakker,Vikas Raunak,Xiang Tang,Zheng-Xin Yong,Zhiqing Sun,Shaked Brody,Y Uri,Hadar Tojarieh,Adam Roberts,Hyung Won Chung,Jae-Oong Tae,Jason Phang,Ofir Press,Conglong Li,Deepak Narayanan,Hatim Bourfoune,Jared Casper,Jeffrey Thomas Rasley,Maksim Riabinin,Mayank Mishra,Minjia Zhang,Mohammad Shoeybi,Myriam Peyrounette,Nicolas Patry,Nouamane Tazi,Omar Sanseviero,Patrick von Platen,Pierre Cornette,Pierre Franccois Lavall'ee,R. Lacroix,Samyam Rajbhandari,Sanchit Gandhi,Shaden Smith,S. Requena,Suraj Patil,Tim Dettmers,A. D. Baruwa,Anastasia Cheveleva,Anne-Laure Ligozat,Arjun Subramonian,Aur'elie N'ev'eol,Charles Lovering,Daniel H Garrette,Deepak R. Tunuguntla,Ehud Reiter,Ekaterina Taktasheva,E. Voloshina,Eli Bogdanov,Genta Indra Winata,Hailey Schoelkopf,Jan-Christoph Kalo,Jekaterina Novikova,Jessica Zosa Forde,Xiangru Tang,Jungo Kasai,Ken Kawamura,Liam Hazan,Marine Carpuat,Miruna-Adriana Clinciu,Najoung Kim,Newton Cheng,Oleg Serikov,Omer Antverg,Oskar van der Wal,Rui Zhang,Ruochen Zhang,Sebastian Gehrmann,Shachar Mirkin,S. Osher Pais,Tatiana Shavrina,Thomas Scialom,Tian Yun,Tomasz Limisiewicz,V. Rieser,Vitaly Protasov,Vladislav Mikhailov,Yada Pruksachatkun,Yonatan Belinkov,Zachary Bamberger,Zdenvek Kasner,Alice Rueda,A. Pestana,Amir Feizpour,Ammar Khan,Amy Faranak,A. Santos,Anthony Hevia,Antigona Unldreaj,Arash Aghagol,Arezoo Abdollahi,Aycha Tammour,Azadeh HajiHosseini,Bahareh Behroozi,Benjamin Olusola Ajibade,Bharat Kumar Saxena,Carlos Muñoz Ferrandis,Danish Contractor,David Lansky,Davis David,Douwe Kiela,Luong An Nguyen,Edward Tan,Emily Baylor,Ezinwanne Ozoani,Fatim Tahirah Mirza,Frankline Ononiwu,Habib Rezanejad,H.A. Jones,Indrani Bhattacharya,Irene Solaiman,Irina Sedenko,Isar Nejadgholi,J. Lawrence Passmore,Joshua Seltzer,Julio Bonis Sanz,Lívia Macedo Dutra,Mairon Samagaio,Maraim Elbadri,M. Mieskes,Marissa Gerchick,Martha Akinlolu,Michael McKenna,Mike Qiu,M. K. K. Ghauri,Mykola Burynok,Nafis Abrar,Nazneen Fatema Rajani,Nour Elkott,Nourhan Fahmy,O. Samuel,Ran An,R. P. Kromann,Ryan Hao,Samira Alizadeh,Sarmad Shubber,Silas L Wang,Sourav Roy,Sylvain Viguier,Thanh-Cong Le,Tobi Oyebade,T. Le,Yoyo Yang,Zachary Nguyen,Abhinav Ramesh Kashyap,Alfredo Palasciano,Alison Callahan,Anima Shukla,Antonio Miranda-Escalada,Ayush Kumar Singh,Benjamin Beilharz,Bo Wang,C. Brito,Chenxi Zhou,Chirag Jain,Chuxin Xu,Clémentine Fourrier,Daniel Le'on Perin'an,Daniel Molano,Dian Yu,Enrique Manjavacas,Fabio Barth,Florian Fuhrimann,Gabriel Altay,Giyaseddin Bayrak,Helena U Vrabec,I. Bello,Isha Dash,Jihyun Kang,John M Giorgi,Jonas Golde,J. Posada,Karthi Sivaraman,Lokesh Bulchandani,Lu Liu,Luisa Shinzato,Madeleine Hahn de Bykhovetz,Maiko Takeuchi,Marc Pàmies,M Andrea Castillo,Marianna Nezhurina,Mario Sanger,Matthias Samwald,Michael Cullan,Michaela Django Weinberg,M. Wolf,Mina Mihaljcic,Minna Liu,M. Freidank,Myungsun Kang,Natasha Seelam,Nathan B Dahlberg,Nicholas Broad,N. Muellner,Pascale Fung,Patricia Haller,R. Chandrasekhar,R. Eisenberg,Robert Martin,Rodrigo L. Canalli,Rosaline Su,Ruisi Su,Samuel Cahyawijaya,Samuele Garda,Shlok S Deshmukh,Shubhanshu Mishra,Sid Kiblawi,Simon Ott,Sinee Sang-aroonsiri,Srishti Kumar,Stefan Schweter,Sushil Pratap Bharati,Tanmay Laud,Th'eo Gigant,Tomoya Kainuma,Wojciech Kusa,Yanis Labrak,Yashasvi Bajaj,Y. Venkatraman,Yifan Xu,Ying Xu,Yunchao Xu,Z. Tan,Zhong-li Xie,Zifan Ye,Mathilde Bras,Younes Belkada,T. Wolf +386 more
TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).
1.4K
A Survey of Large Language Models
Wayne Xin Zhao,Kun Zhou,Junyi Li,Tianyi Tang,Xiaolei Wang,Yupeng Hou,Yingqian Min,Beichen Zhang,Junjie Zhang,Zican Dong,Yifan Du,Chen Yang,Yushuo Chen,Zhongyong Chen,Jinhao Jiang,Ruiyang Ren,Yifan Li,Xinyu Tang,Zikang Liu,Peiyu Liu,Jian-Yun Nie,Ji-Rong Wen +21 more
TL;DR: Recently, a large language model (LLM) as mentioned in this paper has been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov,Archit Sharma,Eric Mitchell,Stefano Ermon,Christopher D. Manning,Chelsea Finn +5 more
TL;DR: The authors leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data.
1.3K
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu,Yuanzhong Xu,Jing Yu Koh,Thang Luong,Gunjan Baid,Zirui Wang,Vijay K. Vasudevan,Alexander Ku,Yinfei Yang,Burcu Karagol Ayan,B. C. Hutchinson,Weimin Huang,Zarana Parekh,Xin Li,Han Zhang,Jason Baldridge,Yonghui Wu +16 more
- 22 Jun 2022
TL;DR: The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
685
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel,Noam Shazeer,Adam Roberts,Katherine Lee,Sharan Narang,Michael Matena,Yanqi Zhou,Wei Li,Peter J. Liu +8 more
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
•Posted Content
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TL;DR: XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
•Posted Content
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal,Piotr Dollár,Ross Girshick,Pieter Noordhuis,Lukasz Wesolowski,Aapo Kyrola,Andrew Tulloch,Yangqing Jia,Kaiming He +8 more
TL;DR: This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
4K
Neural Information Processing Systems 7
Kam-Chuen Jim,Bill G. Horne,C. Lee Giles +2 more
- 01 Jan 1995
2.3K