Open AccessPosted Content
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan,Mohammad Shoeybi,Jared Casper,Patrick LeGresley,Mostofa Patwary,Vijay Anand Korthikanti,Dmitri Vainbrand,Prethvi Kashinkunti,Julie Bernauer,Bryan Catanzaro,Amar Phanishayee,Matei Zaharia +11 more
TL;DR: In this article, different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters, and a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches.
read more
Abstract: Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress.
In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at this https URL.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
DEMix Layers: Disentangling Domains for Modular Language Modeling
TL;DR: The authors introduce a domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text, which makes the LM modular: experts can be mixed, added or removed after initial training.
42
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao,Thomas J. Wang,Daniel Hesslow,Stas Bekman,M Saiful Bari,Stella Biderman,Hady Elsahar,Niklas Muennighoff,Jason Phang,Ofir Press,Colin Raffel,Victor Sanh,Sheng Shen,Lintang Sutawika,Jaesung Tae,Zheng Yong,Julien Launay,Iz Beltagy +17 more
- 01 Jan 2022
TL;DR: The best architecture and training setup for a large language model using 1 million GPU hours involves an ablation study at the billion-parameter scale, comparing different modeling practices and their impact on zero-shot generalization.
An Overview on Language Models: Recent Developments and Outlook
10 Mar 2023
TL;DR: The authors provide an overview of the relationship between pre-trained language models and conventional language models, including linguistic units, architectures, training methods, evaluation methods, and applications, and shed light on the future directions of language modeling in the pre-training era.
BC4LLM: A Perspective of Trusted Artificial Intelligence When Blockchain Meets Large Language Models
Haoxiang Luo,Juan Luo,Athanasios V. Vasilakos +2 more
9
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models
Se Jung Kwon,Jeong-Hoon Kim,Jeongin Bae,Kyung Don Yoo,Jin‐Hwa Kim,Baeseong Park,Byeongwook Kim,Jung-Woo Ha,Nako Sung,Dongsoo Lee +9 more
- 01 Jan 2022
TL;DR: AlphaTuning combines parameter-efficient adaptation and model compression for large-scale language models, achieving significant compression ratios and improved inference efficiency.
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin,Ming-Wei Chang,Kenton Lee,Kristina Toutanova +3 more
- 11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
24.6K
•Proceedings Article
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
- 01 Jan 2019
TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
•Posted Content
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TL;DR: XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
•Posted Content
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal,Piotr Dollár,Ross Girshick,Pieter Noordhuis,Lukasz Wesolowski,Aapo Kyrola,Andrew Tulloch,Yangqing Jia,Kaiming He +8 more
TL;DR: This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
4K