Galvatron

doi:10.14778/3570690.3570697

Open AccessJournal Article10.14778/3570690.3570697

Galvatron

Xupeng Miao, +6 more

- 01 Nov 2022

- Proceedings of The Vldb Endowment

- Vol. 16, Iss: 3, pp 470-479

1

TL;DR: Galvatron as discussed by the authors proposes a decision tree to make decomposition and pruning based on some reasonable intuitions, and then designs a dynamic programming search algorithm to generate the optimal plan, which achieves superior system throughput compared to previous work with limited parallelism.

Abstract: Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evaluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3588964

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

26 May 2023

TL;DR: FlexMoE as mentioned in this paper proposes a dynamic expert management and device placement mechanism to solve the routing imbalance and fluctuation problems in sparsely-gated Mixture-of-Experts (MoEs) models.

...read moreread less

10

References

Proceedings Article•10.1145/3357384.3357895

BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer

Fei Sun, +6 more

- 03 Nov 2019

TL;DR: BERT4Rec as discussed by the authors employs the deep bidirectional self-attention to model user behavior sequences, predicting the random masked items in the sequence by jointly conditioning on their left and right context.

...read moreread less

2K

Proceedings Article•10.1145/3341301.3359646

PipeDream: generalized pipeline parallelism for DNN training

Deepak Narayanan, +7 more

- 27 Oct 2019

TL;DR: PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.

...read moreread less

844

•Proceedings Article•10.1145/3458817.3476209

Efficient large-scale language model training on GPU clusters using megatron-LM

Deepak Narayanan, +11 more

- 14 Nov 2021

TL;DR: In this paper, the authors propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches, allowing them to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.

...read moreread less

515

Journal Article•10.14778/3415478.3415530

PyTorch distributed: experiences on accelerating data parallel training

Shen Li, +10 more

- 01 Aug 2020

TL;DR: Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

...read moreread less

297

•Proceedings Article•10.1145/3302424.3303953

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Minjie Wang, +2 more

- 24 Jul 2018

- arXiv: Distributed, Parallel, and Cluste...

TL;DR: Tofu as mentioned in this paper partitions a dataflow graph of fine-grained tensor operators in order to work transparently with a general-purpose deep learning platform like MXNet.

...read moreread less

150