Supporting Very Large Models using Automatic Dataflow Graph Partitioning
Minjie Wang,Chien-Chin Huang,Jinyang Li +2 more
- 25 Mar 2019
- pp 3303953
TL;DR: Tofu as discussed by the authors uses a recursive search algorithm that minimizes the total communication cost to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow.
read more
Abstract: This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Wireless Network Intelligence at the Edge
Jihong Park,Sumudu Samarakoon,Mehdi Bennis,Merouane Debbah +3 more
- 11 Oct 2019
TL;DR: In this article, the key building blocks of edge ML, different neural network architectural splits and their inherent tradeoffs, as well as theoretical and technical enablers stemming from a wide range of mathematical disciplines are presented.
Pre-Trained Models: Past, Present and Future
Xu Han,Zhengyan Zhang,Ning Ding,Yuxian Gu,Xiao Liu,Yuqi Huo,Jiezhong Qiu,Liang Zhang,Wentao Han,Minlie Huang,Qin Jin,Yanyan Lan,Yang Liu,Zhiyuan Liu,Zhiwu Lu,Xipeng Qiu,Ruihua Song,Jie Tang,Ji-Rong Wen,Jinhui Yuan,Wayne Xin Zhao,Jun Zhu +21 more
- 14 Jun 2021
TL;DR: In this paper, the authors take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum.
581
ZeRO: Memory optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari,Jeff Rasley,Olatunji Ruwase,Yuxiong He +3 more
- 01 Nov 2020
TL;DR: The Zero Redundancy Optimizer (ZeRO) as mentioned in this paper eliminates memory redundancies in data and model-parallel training while retaining low communication volume and high computational granularity, allowing to scale the model size proportional to the number of devices.
447
•Posted Content
Wireless Network Intelligence at the Edge
TL;DR: In a first of its kind, this article explores the key building blocks of edge ML, different neural network architectural splits and their inherent tradeoffs, as well as theoretical and technical enablers stemming from a wide range of mathematical disciplines.
444
A generic communication scheduler for distributed DNN training acceleration
Yanghua Peng,Yibo Zhu,Yangrui Chen,Yixin Bao,Bairen Yi,Chang Lan,Chuan Wu,Chuanxiong Guo +7 more
- 27 Oct 2019
TL;DR: This work introduces a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines, and introduces a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions.
387
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
•Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
82.5K
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
•Journal Article
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Related Papers (5)
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
[...]
Yonghui Wu,Mike Schuster,Zhifeng Chen,Quoc V. Le,Mohammad Norouzi,Wolfgang Macherey,Maxim Krikun,Yuan Cao,Qin Gao,Klaus Macherey,Jeff Klingner,Apurva Shah,Melvin Johnson,Xiaobing Liu,Łukasz Kaiser,Stephan Gouws,Yoshikiyo Kato,Taku Kudo,Hideto Kazawa,Keith Stevens,George Kurian,Nishant Patil,Wei Wang,Cliff Young,Jason A. Smith,Jason Riesa,Alex Rudnick,Oriol Vinyals,Greg S. Corrado,Macduff Hughes,Jeffrey Dean +30 more