Efficient large-scale language model training on GPU clusters using megatron-LM

doi:10.1145/3458817.3476209

Open AccessProceedings Article10.1145/3458817.3476209

Efficient large-scale language model training on GPU clusters using megatron-LM

- 14 Nov 2021

515

TL;DR: In this paper, the authors propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches, allowing them to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, +66 more

- 05 Apr 2022

- arXiv.org

TL;DR: A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.

...read moreread less

4K

Journal Article•10.48550/arXiv.2211.05100

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, +386 more

- 09 Nov 2022

- arXiv.org

TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).

...read moreread less

1.4K

Journal Article•10.48550/arXiv.2305.18290

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, +5 more

- 29 May 2023

- arXiv.org

TL;DR: The authors leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data.

...read moreread less

1.3K

Journal Article•10.48550/arXiv.2206.10789

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, +16 more

- 22 Jun 2022

TL;DR: The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-ﬁdelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

...read moreread less

685

...

Expand

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, +8 more

- 23 Oct 2019

- arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

12.9K

•Posted Content

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang, +5 more

- 19 Jun 2019

- arXiv: Computation and Language

TL;DR: XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

...read moreread less

5.5K

•Posted Content

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, +8 more

- 08 Jun 2017

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

...read moreread less

4K