Journal Article10.48550/arXiv.2307.02031
Improving Automatic Parallel Training via Balanced Memory Workload Optimization
TL;DR: Galvatron-BMW as mentioned in this paper integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy, and employs a decision tree approach for decomposition and pruning based on intuitive insights.
read more
Abstract: Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Fig. 4: Performance of 4-way 1F1B-Flush pipelines with different partition plans on A100 GPUs. The global batch size is 32 for BERT-Huge-48 and 64 for T5-512/4-48 (see more details of the model in Section VII-A), and the microbatch number is 8. Bars (from left to right) symbolize pipeline stage 1 through 4: height for memory consumption, width for time cost (normalized), including the number of layers, balance degrees and throughput. 
TABLE II: Comparison with 8 GPUs under different memory constraints. The maximum throughput (samples/s) of each strategy is given, along with the corresponding batch size in the bracket, and OOM denotes Out-Of-Memory. 
TABLE IV: Ablation study on bi-objective optimization of workload balance on high-performance cluster (16 A100). The performance of bi-objective optimization (1F1B+Bi-obj) is compared with that of memory-balanced partition (1F1B+Mem) and time-balanced partition (1F1B+Time). The pipeline partition p is shown next to the system throughput and training batch size. 
Fig. 7: Examples of the optimal parallelism plans given by Galvatron-BMW. Case A and B is for BERT-Huge-32 and Swin-Huge-32 under 8 GB memory budgets on 8 low-performance GPUs, and case C is for T5-512/4-32 under 8GB memory budgets on 16 low-performance GPUs and 16 high-performance GPUs. Each yellow rectangle denotes an encoder layer, and its height and width represent parameter size and activation size respectively. The number ×N under the rectangle means applying an strategy for consecutive N layers. 
TABLE V: Comparison with 64 GPUs. 
TABLE I: Statistics of Models
Citations
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan,Shuo Zhang,Zerui Wang,Lijuan Jiang,Wenwen Qu,Qinghao Hu,Guoteng Wang,Qizhen Weng,Hang Yan,Xingcheng Zhang,Xipeng Qiu,Dahua Lin,Yonggang Wen,Xin Jin,Tianwei Zhang,Peng Sun +15 more
- 29 Jul 2024
TL;DR: This survey explores recent advancements in training large language models on distributed infrastructures, covering innovations in AI accelerators, networking, storage, and scheduling, as well as parallelism strategies and optimizations for computation, communication, and memory.
Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences
Haoyang Li,Fangcheng Fu,Lin Sheng,Hao Ge,Xuanyu Wang,J.B. Niu,Jie Jiang,Bin Cui +7 more
- 10 Dec 2024
TL;DR: Hydraulis optimizes large Transformer model training by jointly addressing data sampling and packing imbalances through dynamic heterogeneous parallel strategies and a two-stage data assignment approach, achieving 1.32-2.66 times better performance than existing systems.
A survey on closed-loop intelligent frameworks for parallel training of deep neural networks
Zhiyuan Ren,Shijie Zhou,Dong Liu,Qihe Liu +3 more
FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment
Ran Yan,Youhe Jiang,Wangcheng Tao,Xiaonan Nie,Bin Cui,Binhang Yuan +5 more
- 02 Sep 2024
TL;DR: This paper proposes FlashFlex, a system that enables large language model training over heterogeneous GPUs by flexibly partitioning computations across data-, pipeline-, and tensor parallelism, achieving comparable performance to homogeneous settings with optimal resource utilization.
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent
TL;DR: Angel-PTM as discussed by the authors is a productive deep learning system designed for pre-training and fine-tuning Transformer models, which can train extremely large-scale models with hierarchical memory efficiently.
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,Xiaohua Zhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,Jakob Uszkoreit,Neil Houlsby +11 more
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
•Posted Content
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
TL;DR: PyTorch as discussed by the authors is a machine learning library that provides an imperative and Pythonic programming style that makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.
25.9K