Improving Automatic Parallel Training via Balanced Memory Workload Optimization

doi:10.48550/arXiv.2307.02031

Journal Article10.48550/arXiv.2307.02031

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Yujie Wang, +5 more

- 05 Jul 2023

- arXiv.org

- Vol. abs/2307.02031

5

TL;DR: Galvatron-BMW as mentioned in this paper integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy, and employs a decision tree approach for decomposition and pruning based on intuitive insights.

Abstract: Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Fig. 4: Performance of 4-way 1F1B-Flush pipelines with different partition plans on A100 GPUs. The global batch size is 32 for BERT-Huge-48 and 64 for T5-512/4-48 (see more details of the model in Section VII-A), and the microbatch number is 8. Bars (from left to right) symbolize pipeline stage 1 through 4: height for memory consumption, width for time cost (normalized), including the number of layers, balance degrees and throughput.

TABLE II: Comparison with 8 GPUs under different memory constraints. The maximum throughput (samples/s) of each strategy is given, along with the corresponding batch size in the bracket, and OOM denotes Out-Of-Memory.

TABLE IV: Ablation study on bi-objective optimization of workload balance on high-performance cluster (16 A100). The performance of bi-objective optimization (1F1B+Bi-obj) is compared with that of memory-balanced partition (1F1B+Mem) and time-balanced partition (1F1B+Time). The pipeline partition p is shown next to the system throughput and training batch size.

Fig. 7: Examples of the optimal parallelism plans given by Galvatron-BMW. Case A and B is for BERT-Huge-32 and Swin-Huge-32 under 8 GB memory budgets on 8 low-performance GPUs, and case C is for T5-512/4-32 under 8GB memory budgets on 16 low-performance GPUs and 16 high-performance GPUs. Each yellow rectangle denotes an encoder layer, and its height and width represent parameter size and activation size respectively. The number ×N under the rectangle means applying an strategy for consecutive N layers.

Citations

Journal Article•10.48550/arxiv.2407.20018

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

Jiangfei Duan, +15 more

- 29 Jul 2024

TL;DR: This survey explores recent advancements in training large language models on distributed infrastructures, covering innovations in AI accelerators, networking, storage, and scheduling, as well as parallelism strategies and optimizations for computation, communication, and memory.

...read moreread less

1

Journal Article•10.48550/arxiv.2412.07894

Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences

Haoyang Li, +7 more

- 10 Dec 2024

TL;DR: Hydraulis optimizes large Transformer model training by jointly addressing data sampling and packing imbalances through dynamic heterogeneous parallel strategies and a two-stage data assignment approach, achieving 1.32-2.66 times better performance than existing systems.

...read moreread less

Journal Article•10.1016/j.engappai.2025.112734

A survey on closed-loop intelligent frameworks for parallel training of deep neural networks

Zhiyuan Ren, +3 more

- 15 Oct 2025

- Engineering Applications of Artificial I...

Journal Article•10.48550/arxiv.2409.01143

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

Ran Yan, +5 more

- 02 Sep 2024

TL;DR: This paper proposes FlashFlex, a system that enables large language model training over heterogeneous GPUs by flexibly partitioning computations across data-, pipeline-, and tensor parallelism, achieving comparable performance to homogeneous settings with optimal resource utilization.

...read moreread less

Journal Article•10.48550/arXiv.2303.02868

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent

Xiaonan Nie, +7 more

- 06 Mar 2023

- arXiv.org

TL;DR: Angel-PTM as discussed by the authors is a productive deep learning system designed for pre-training and fine-tuning Transformer models, which can train extremely large-scale models with hierarchical memory efficiently.

...read moreread less

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

36.9K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Posted Content

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, +20 more

- 03 Dec 2019

- arXiv: Learning

TL;DR: PyTorch as discussed by the authors is a machine learning library that provides an imperative and Pythonic programming style that makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.

...read moreread less

25.9K

...

Expand