Galvatron
TL;DR: Galvatron as discussed by the authors proposes a decision tree to make decomposition and pruning based on some reasonable intuitions, and then designs a dynamic programming search algorithm to generate the optimal plan, which achieves superior system throughput compared to previous work with limited parallelism.
read more
Abstract: Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evaluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
26 May 2023
TL;DR: FlexMoE as mentioned in this paper proposes a dynamic expert management and device placement mechanism to solve the routing imbalance and fluctuation problems in sparsely-gated Mixture-of-Experts (MoEs) models.
References
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Fei Sun,Jun Liu,Jian Wu,Changhua Pei,Xiao Lin,Wenwu Ou,Peng Jiang +6 more
- 03 Nov 2019
TL;DR: BERT4Rec as discussed by the authors employs the deep bidirectional self-attention to model user behavior sequences, predicting the random masked items in the sequence by jointly conditioning on their left and right context.
PipeDream: generalized pipeline parallelism for DNN training
Deepak Narayanan,Aaron Harlap,Amar Phanishayee,Vivek Seshadri,Nikhil R. Devanur,Gregory R. Ganger,Phillip B. Gibbons,Matei Zaharia +7 more
- 27 Oct 2019
TL;DR: PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.
844
Efficient large-scale language model training on GPU clusters using megatron-LM
Deepak Narayanan,Mohammad Shoeybi,Jared Casper,Patrick LeGresley,Mostofa Patwary,Vijay Anand Korthikanti,Dmitri Vainbrand,Prethvi Kashinkunti,Julie Bernauer,Bryan Catanzaro,Amar Phanishayee,Matei Zaharia +11 more
- 14 Nov 2021
TL;DR: In this paper, the authors propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches, allowing them to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.
515
PyTorch distributed: experiences on accelerating data parallel training
Shen Li,Yanli Zhao,Rohan Varma,Omkar Salpekar,Pieter Noordhuis,Teng Li,Adam Paszke,Jeffrey Matthew Smith,Brian Vaughan,Pritam Damania,Soumith Chintala +10 more
- 01 Aug 2020
TL;DR: Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
Supporting Very Large Models using Automatic Dataflow Graph Partitioning
TL;DR: Tofu as mentioned in this paper partitions a dataflow graph of fine-grained tensor operators in order to work transparently with a general-purpose deep learning platform like MXNet.