Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

doi:10.14778/3570690.3570697

Open AccessJournal Article10.14778/3570690.3570697

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Xupeng Miao, +6 more

- 25 Nov 2022

- Proceedings of The Vldb Endowment

- Vol. 16, Iss: 3, pp 470-479

40

TL;DR: Galvatron as discussed by the authors proposes a decision tree to make decomposition and pruning based on some reasonable intuitions, and then designs a dynamic programming search algorithm to generate the optimal plan, which achieves superior system throughput compared to previous work with limited parallelism.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.14778/3611479.3611527

How Large Language Models Will Disrupt Data Management

Raul Fernandez, +4 more

- 01 Jul 2023

- Proceedings of The Vldb Endowment

TL;DR: It is argued that the disruptive influence that LLMs will have on data management will come from two angles, namely, entity resolution, schema matching, data discovery, and query synthesis, which hit a ceiling of automation because the system does not fully understand the semantics of the underlying data.

...read moreread less

47

Book•10.1145/3600006.3613175

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

Suhas Jayaram Subramanya, +5 more

- 23 Oct 2023

TL;DR: The Sia scheduler efficiently assigns heterogeneous deep learning cluster resources to elastic resource-adaptive jobs, and introduces a new scheduling formulation that can scale to the search-space sizes and intentionally match jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time.

...read moreread less

18

Journal Article•10.14778/3598581.3598604

SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training

Xupeng Miao, +4 more

- 01 May 2023

- Proceedings of The Vldb Endowment

TL;DR: SDPipe as discussed by the authors is a semi-decentralized framework for pipeline-parallel training, which decentralizes the communication model synchronization, which accounts for the largest proportion of synchronization overhead.

...read moreread less

14

Journal Article•10.14201/adcaij.31704

Generative Artificial Intelligence: Fundamentals

Juan M. Corchado, +4 more

- 01 Dec 2023

- Advances in distributed computing and ar...

TL;DR: Generative language models have witnessed substantial traction, notably with the introduction of refined models aimed at more coherent user-AI interactions—principally conversational models; yet their capabilities in data analytics, intrusion detection, and misinformation combatting is laudable; yet the ethical and security implications concerning data privacy, surveillance, and potential misuse warrant judicious scrutiny.

...read moreread less

10

•Journal Article•10.1145/3588964

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

26 May 2023

TL;DR: FlexMoE as mentioned in this paper proposes a dynamic expert management and device placement mechanism to solve the routing imbalance and fluctuation problems in sparsely-gated Mixture-of-Experts (MoEs) models.

...read moreread less

10

...

Expand

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

36.9K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Posted Content

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, +20 more

- 03 Dec 2019

- arXiv: Learning

TL;DR: PyTorch as discussed by the authors is a machine learning library that provides an imperative and Pythonic programming style that makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.

...read moreread less

25.9K