Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Open AccessPosted Content

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

- 09 Apr 2021

15

TL;DR: In this article, different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters, and a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches.

Abstract: Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at this https URL.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

DEMix Layers: Disentangling Domains for Modular Language Modeling

Suchin Gururangan, +4 more

- 11 Aug 2021

- arXiv: Computation and Language

TL;DR: The authors introduce a domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text, which makes the LM modular: experts can be mixed, added or removed after initial training.

...read moreread less

42

Journal Article•10.18653/v1/2022.findings-emnlp.54

What Language Model to Train if You Have One Million GPU Hours?

Teven Le Scao, +17 more

- 01 Jan 2022

TL;DR: The best architecture and training setup for a large language model using 1 million GPU hours involves an ablation study at the billion-parameter scale, comparing different modeling practices and their impact on zero-shot generalization.

...read moreread less

26

•Posted Content•10.48550/arxiv.2303.05759

An Overview on Language Models: Recent Developments and Outlook

10 Mar 2023

TL;DR: The authors provide an overview of the relationship between pre-trained language models and conventional language models, including linguistic units, architectures, training methods, evaluation methods, and applications, and shed light on the future directions of language modeling in the pre-training era.

...read moreread less

14

Journal Article•10.1016/j.neucom.2024.128089

BC4LLM: A Perspective of Trusted Artificial Intelligence When Blockchain Meets Large Language Models

Haoxiang Luo, +2 more

- 01 Jun 2024

- Neurocomputing

9

Journal Article•10.18653/v1/2022.findings-emnlp.240

AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Se Jung Kwon, +9 more

- 01 Jan 2022

TL;DR: AlphaTuning combines parameter-efficient adaptation and model compression for large-scale language models, achieving significant compression ratios and improved inference efficiency.

...read moreread less

7

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

Proceedings Article•10.18653/V1/N19-1423

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

24.6K

•Proceedings Article

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, +20 more

- 01 Jan 2019

TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

...read moreread less

10.3K

•Posted Content

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang, +5 more

- 19 Jun 2019

- arXiv: Computation and Language

TL;DR: XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

...read moreread less

5.5K

•Posted Content

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, +8 more

- 08 Jun 2017

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

...read moreread less

4K