Supporting Very Large Models using Automatic Dataflow Graph Partitioning

doi:10.1145/3302424.3303953

Open AccessProceedings Article10.1145/3302424.3303953

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Minjie Wang, +2 more

- 25 Mar 2019

- pp 3303953

71

TL;DR: Tofu as discussed by the authors uses a recursive search algorithm that minimizes the total communication cost to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1109/JPROC.2019.2941458

Wireless Network Intelligence at the Edge

Jihong Park, +3 more

- 11 Oct 2019

TL;DR: In this article, the key building blocks of edge ML, different neural network architectural splits and their inherent tradeoffs, as well as theoretical and technical enablers stemming from a wide range of mathematical disciplines are presented.

...read moreread less

623

•Journal Article•10.1016/J.AIOPEN.2021.08.002

Pre-Trained Models: Past, Present and Future

Xu Han, +21 more

- 14 Jun 2021

TL;DR: In this paper, the authors take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum.

...read moreread less

581

•Proceedings Article•10.1109/SC41405.2020.00024

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, +3 more

- 01 Nov 2020

TL;DR: The Zero Redundancy Optimizer (ZeRO) as mentioned in this paper eliminates memory redundancies in data and model-parallel training while retaining low communication volume and high computational granularity, allowing to scale the model size proportional to the number of devices.

...read moreread less

447

•Posted Content

Wireless Network Intelligence at the Edge

Jihong Park, +3 more

- 07 Dec 2018

- arXiv: Information Theory

TL;DR: In a first of its kind, this article explores the key building blocks of edge ML, different neural network architectural splits and their inherent tradeoffs, as well as theoretical and technical enablers stemming from a wide range of mathematical disciplines.

...read moreread less

444

Proceedings Article•10.1145/3341301.3359642

A generic communication scheduler for distributed DNN training acceleration

Yanghua Peng, +7 more

- 27 Oct 2019

TL;DR: This work introduces a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines, and introduces a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions.

...read moreread less

387

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Posted Content

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 22 Dec 2014

- arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

82.5K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

•Journal Article

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John C. Duchi, +2 more

- 01 Feb 2011

- Journal of Machine Learning Research

TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

...read moreread less

8.9K