GRACE: A Compressed Communication Framework for Distributed Machine Learning

doi:10.1109/ICDCS51616.2021.00060

Open AccessProceedings Article10.1109/ICDCS51616.2021.00060

GRACE: A Compressed Communication Framework for Distributed Machine Learning

Hang Xu, +7 more

- 07 Jul 2021

- pp 561-572

80

TL;DR: In this article, the authors present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank).

Abstract: Powerful computer clusters are used nowadays to train complex deep neural networks (DNN) on large datasets. Distributed training increasingly becomes communication bound. For this reason, many lossy compression techniques have been proposed to reduce the volume of transferred data. Unfortunately, it is difficult to argue about the behavior of compression methods, because existing work relies on inconsistent evaluation testbeds and largely ignores the performance impact of practical system configurations. In this paper, we present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank). Next, we propose GRACE, a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits. We instantiate GRACE on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression / decompression, we demonstrate that some methods may be impractical. GRACE and the entire benchmarking suite are available as open-source.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

TABLE II: Summary of the benchmarks and quality metrics used in this work.

Fig. 1: Top-1 accuracy for VGG16 on CIFAR-10 with TensorFlow on 8 workers via 25 Gbps network links. In (b) Randk converges in 450s, but 8-bit quantization needs 1200s.

Fig. 8: Latency of compress and decompress for different compressors with a range of input sizes. Thr esh

Fig. 10: Performance of compressors for ResNet-50 on ImageNet via 1 Gbps network. Legend in Figure 6.

Fig. 9: Throughput for ResNet-9 on CIFAR10 contrasting TCP vs. RDMA performance in PyTorch.

TABLE I: Classification of surveyed gradient compression methods. Note that ‖g̃‖0 and ‖g‖0 are the number of elements in the compressed and uncompressed gradient, respectively; nature of operator Q is random or deterministic; EF-On indicates if error feedback is used in our experiments. We implement 16 methods on TensorFlow and PyTorch.

Citations

•Proceedings Article•10.1145/3477132.3483553

Gradient Compression Supercharged High-Performance Data Parallel DNN Training

Youhui Bai, +7 more

- 26 Oct 2021

TL;DR: In this paper, a compression-aware gradient synchronization architecture, CaSync, is proposed to alleviate the communication bottleneck in data parallel deep neural network (DNN) training by significantly reducing the data volume of gradients for synchronization.

...read moreread less

44

•Posted Content

Genuinely Distributed Byzantine Machine Learning

El-Mahdi El-Mhamdi, +4 more

- 05 May 2019

- arXiv: Distributed, Parallel, and Cluste...

TL;DR: A new algorithm, ByzSGD, is presented, which solves the general Byzantine-resilient distributed machine learning problem by relying on three major schemes, including Scatter/Gather, Distributed Median Contraction, and Minimum-Diameter Averaging, whose goal is to tolerate Byzantine workers.

...read moreread less

36

•Proceedings Article•10.1145/3517207.3526969

Empirical analysis of federated learning in heterogeneous environments

Ahmed M. Abdelmoniem, +3 more

- 05 Apr 2022

TL;DR: An extensive empirical study spanning close to 1.5K unique configurations on five popular FL benchmarks shows that these sources of heterogeneity have a major impact on both model performance and fairness, thus shedding light on the importance of considering heterogeneity in FL system design.

...read moreread less

30

•Journal Article•10.1109/tac.2022.3180695

A Compressed Gradient Tracking Method for Decentralized Optimization With Linear Convergence

01 Oct 2022

- IEEE Transactions on Automatic Control

TL;DR: In this article , a compressed gradient tracking algorithm (C-GT) was proposed to solve the decentralized optimization problem under limited communication, where the global objective is to minimize the average of local cost functions over a multiagent network using only local computation and peer-to-peer communication.

...read moreread less

29

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Book Chapter•10.1007/978-3-319-24574-4_28

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, +2 more

- 05 Oct 2015

TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.

...read moreread less

92K