Skew in parallel query processing

doi:10.1145/2594538.2594558

Open AccessProceedings Article10.1145/2594538.2594558

Skew in parallel query processing

Paul Beame, +2 more

- 18 Jun 2014

- pp 212-223

144

TL;DR: A tight connection is established between the fractional edge packing of the query and the amount of communication in two cases, where the data is skewed and the heavy hitters and their frequencies are known.

Abstract: We study the problem of computing a conjunctive query q in parallel, using p of servers, on a large database. We consider algorithms with one round of communication, and study the complexity of the communication. We are especially interested in the case where the data is skewed, which is a major challenge for scalable parallel query processing. We establish a tight connection between the fractional edge packing of the query and the amount of communication in two cases. First, in the case when the only statistics on the database are the cardinalities of the input relations, and the data is skew-free, we provide matching upper and lower bounds (up to a polylogarithmic factor of p) expressed in terms of fractional edge packings of the query q. Second, in the case when the relations are skewed and the heavy hitters and their frequencies are known, we provide upper and lower bounds expressed in terms of packings of residual queries obtained by specializing the query to a heavy hitter. All our lower bounds are expressed in the strongest form, as number of bits needed to be communicated between processors with unlimited computational power. Our results generalize prior results on uniform databases (where each relation is a matching) [4], and lower bounds for the MapReduce model [1].

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3125644

Communication Steps for Parallel Query Processing

Paul Beame, +2 more

- 14 Oct 2017

- Journal of the ACM

TL;DR: The problem of computing conjunctive queries over large databases on parallel architectures without shared storage is studied and essentially tight upper and lower bounds for one round algorithms are obtained and how the bounds degrade when there is skew in the data is shown.

...read moreread less

113

•Posted Content

Sparsifying Distributed Algorithms with Ramifications in Massively Parallel Computation and Centralized Local Computation

Mohsen Ghaffari, +1 more

- 17 Jul 2018

- arXiv: Data Structures and Algorithms

TL;DR: A method for sparsifying distributed algorithms is introduced and how it leads to improvements that go past known barriers in two algorithmic settings of large-scale graph processing: Massively Parallel Computation (MPC), and Local Computation Algorithms (LCA).

...read moreread less

109

•Journal Article•10.14778/3184470.3184473

Distributed evaluation of subgraph queries using worst-case optimal low-memory dataflows

AmmarKhaled, +3 more

- 01 Feb 2018

TL;DR: This work presents the first approach that performs worst-case optimal computation and monitors fixed-size subgraphs in a continually changing large-scale graph.

...read moreread less

93

Journal Article•10.14778/2983200.2983203

Distributed data deduplication

Xu Chu, +2 more

- 01 Jul 2016

TL;DR: This paper shows how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment and proposes a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees.

...read moreread less

82

Proceedings Article•10.1109/FOCS.2019.00097

Conditional Hardness Results for Massively Parallel Computation from Distributed Lower Bounds

Mohsen Ghaffari, +2 more

- 01 Nov 2019

TL;DR: The first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring are presented.

...read moreread less

72

...

Expand

References

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

•Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

- 25 Apr 2012

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

4.6K

•Proceedings Article•10.1145/1963405.1963491

Counting triangles and the curse of the last reducer

Siddharth Suri, +1 more

- 28 Mar 2011

TL;DR: This work describes a sequential triangle counting algorithm and shows how to adapt it to the MapReduce setting, and presents a new algorithm designed specifically for the Map Reduce framework that achieves a factor of 10-100 speed up over the naive approach.

...read moreread less

502

•Proceedings Article•10.1145/1739041.1739056

Optimizing joins in a map-reduce environment

Foto N. Afrati, +1 more

- 22 Mar 2010

TL;DR: The problem of optimizing the shares, given a fixed number of Reduce processes, is studied, and an algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given.

...read moreread less

401

Proceedings Article•10.1145/2463676.2465288

Shark: SQL and rich analytics at scale

Reynold Xin, +5 more

- 22 Jun 2013

TL;DR: Shark as discussed by the authors is a new data analysis system that marries query processing with complex analytics on large clusters, and leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions at scale, and efficiently recovers from failures mid-query.

...read moreread less

394

...

Expand

Skew in parallel query processing

Chat with Paper

AI Agents for this Paper

Citations

Communication Steps for Parallel Query Processing

Sparsifying Distributed Algorithms with Ramifications in Massively Parallel Computation and Centralized Local Computation

Distributed evaluation of subgraph queries using worst-case optimal low-memory dataflows

Distributed data deduplication

Conditional Hardness Results for Massively Parallel Computation from Distributed Lower Bounds

References

MapReduce: simplified data processing on large clusters

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Counting triangles and the curse of the last reducer

Optimizing joins in a map-reduce environment

Shark: SQL and rich analytics at scale

Related Papers (5)

Communication steps for parallel query processing

MapReduce: simplified data processing on large clusters

Optimizing joins in a map-reduce environment

Spark: cluster computing with working sets

Sorting, searching, and simulation in the mapreduce framework