Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
TL;DR: Galvatron as discussed by the authors proposes a decision tree to make decomposition and pruning based on some reasonable intuitions, and then designs a dynamic programming search algorithm to generate the optimal plan, which achieves superior system throughput compared to previous work with limited parallelism.
read more
Abstract: Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
How Large Language Models Will Disrupt Data Management
TL;DR: It is argued that the disruptive influence that LLMs will have on data management will come from two angles, namely, entity resolution, schema matching, data discovery, and query synthesis, which hit a ceiling of automation because the system does not fully understand the semantics of the underlying data.
47
Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
Suhas Jayaram Subramanya,Daiyaan Arfeen,Shouxu Lin,Aurick Qiao,Zhihao Jia,Gregory R. Ganger +5 more
- 23 Oct 2023
TL;DR: The Sia scheduler efficiently assigns heterogeneous deep learning cluster resources to elastic resource-adaptive jobs, and introduces a new scheduling formulation that can scale to the search-space sizes and intentionally match jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time.
18
SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training
TL;DR: SDPipe as discussed by the authors is a semi-decentralized framework for pipeline-parallel training, which decentralizes the communication model synchronization, which accounts for the largest proportion of synchronization overhead.
14
Generative Artificial Intelligence: Fundamentals
Juan M. Corchado,Sebastian López F.,J. Núñez V.,Raul Garcia S.,Pablo Chamoso +4 more
TL;DR: Generative language models have witnessed substantial traction, notably with the introduction of refined models aimed at more coherent user-AI interactions—principally conversational models; yet their capabilities in data analytics, intrusion detection, and misinformation combatting is laudable; yet the ethical and security implications concerning data privacy, surveillance, and potential misuse warrant judicious scrutiny.
10
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
26 May 2023
TL;DR: FlexMoE as mentioned in this paper proposes a dynamic expert management and device placement mechanism to solve the routing imbalance and fluctuation problems in sparsely-gated Mixture-of-Experts (MoEs) models.
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,Xiaohua Zhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,Jakob Uszkoreit,Neil Houlsby +11 more
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
•Posted Content
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
TL;DR: PyTorch as discussed by the authors is a machine learning library that provides an imperative and Pythonic programming style that makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.
25.9K