Coded convolution for parallel and distributed computing within a deadline
Sanghamitra Dutta,Viveck R. Cadambe,Pulkit Grover +2 more
- 01 Jun 2017
- pp 2403-2407
TL;DR: In this paper, the authors consider the problem of computing the convolution of two long vectors using parallel processors in the presence of stragglers and demonstrate that coding can dramatically improve the probability of finishing the computation within a target deadline.
read more
Abstract: We consider the problem of computing the convolution of two long vectors using parallel processors in the presence of “stragglers”. Stragglers refer to the small fraction of faulty or slow processors that delays the entire computation in time-critical distributed systems. We first show that splitting the vectors into smaller pieces and using a linear code to encode these pieces provides improved resilience against stragglers than replication-based schemes under a simple, worst-case straggler analysis. We then demonstrate that under commonly used models of computation time, coding can dramatically improve the probability of finishing the computation within a target “deadline” time. As opposed to the more commonly used technique of expected computation time analysis, we quantify the exponents of the probability of failure in the limit of large deadlines. Our exponent metric captures the probability of failing to finish before a specified deadline time, i.e., the behavior of the “tail”. Moreover, our technique also allows for simple closed form expressions for more general models of computation time, e.g. shifted Weibull models instead of only shifted exponentials. Thus, through this problem of coded convolution, we establish the utility of a novel asymptotic failure exponent analysis for distributed systems.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products
TL;DR: The key novelty in this work is that in the particular regime where the number of available processing nodes is greater than the total number of dot products, Short-Dot has lower expected computation time under straggling under an exponential model compared to existing strategies.
On the Optimal Recovery Threshold of Coded Matrix Multiplication
Sanghamitra Dutta,Mohammad Fahim,Farzin Haddadpour,Haewon Jeong,Viveck R. Cadambe,Pulkit Grover +5 more
TL;DR: Novel coded computation strategies for distributed matrix–matrix products that outperform the recent “Polynomial code” constructions in recovery threshold, i.e., the required number of successful workers are provided.
302
Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding
TL;DR: While evaluating bilinear complexity is a well-known challenging problem, it is shown that optimal recovery threshold for linear coding strategies can be approximated within a factor of 2 of this fundamental quantity.
295
Coded Computation Over Heterogeneous Clusters
TL;DR: This paper proposes heterogeneous coded matrix multiplication (HCMM) algorithm for performing distributed matrix multiplication over heterogeneous clusters that are provably asymptotically optimal for a broad class of processing time distributions and develops a heuristic algorithm for HCMM load allocation for the distributed implementation of budget-limited computation tasks.
240
•Posted Content
"Short-Dot": Computing Large Linear Transforms Distributedly Using Coded Short Dot Products
TL;DR: In this paper, the authors propose a technique called Short-Dot to reduce the number of redundant computations in a coding theory inspired fashion for computing linear transforms of long vectors.
202
References
•Book
Mathematical Methods for Physicists
George B. Arfken
- 01 Jan 1966
TL;DR: In this article, the authors present a model for vector analysis based on the Calculus of Variations and the Sturm-Liouville theory, which includes the following: Curved Coordinates, Tensors.
8.2K
The tail at scale
Jeffrey Dean,Luiz Andre Barroso +1 more
TL;DR: Software techniques that tolerate latency variability are vital to building responsive large-scale Web services.
1.9K
Algorithm-Based Fault Tolerance for Matrix Operations
Kuang-Hua Huang,Abraham +1 more
TL;DR: Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.
1.4K
The Fast Fourier Transform and Its Applications
TL;DR: A description of the alogorithm and its programming is given here and followed by a theorem relating its operands, the finite sample sequences, to the continuous functions they often are intended to approximate.
1.2K
![Fig. 4. Simulation Results: The plot shows the log of the complement of the cdf, i.e. (1−cdf) of the computation time for an uncoded strategy, an (212, 211, 8, 211) Coded Convolution strategy and a (8, 4) Repetition strategy based on 106 Monte Carlo simulations from their respective shifted exponential distributions in MATLAB (Code available in [20]). We observe that for uncoded strategy the decay of the failure exponent starts first, but is outperformed by both repetition and coded convolution as the deadline becomes large due to steeper rate of decay. Coded convolution is found to have the steepest decay for large deadlines.](/figures/figure4-1-73abprhfcfhx.png)