Skew in parallel query processing
Paul Beame,Paraschos Koutris,Dan Suciu +2 more
- 18 Jun 2014
- pp 212-223
TL;DR: A tight connection is established between the fractional edge packing of the query and the amount of communication in two cases, where the data is skewed and the heavy hitters and their frequencies are known.
read more
Abstract: We study the problem of computing a conjunctive query q in parallel, using p of servers, on a large database. We consider algorithms with one round of communication, and study the complexity of the communication. We are especially interested in the case where the data is skewed, which is a major challenge for scalable parallel query processing. We establish a tight connection between the fractional edge packing of the query and the amount of communication in two cases. First, in the case when the only statistics on the database are the cardinalities of the input relations, and the data is skew-free, we provide matching upper and lower bounds (up to a polylogarithmic factor of p) expressed in terms of fractional edge packings of the query q. Second, in the case when the relations are skewed and the heavy hitters and their frequencies are known, we provide upper and lower bounds expressed in terms of packings of residual queries obtained by specializing the query to a heavy hitter. All our lower bounds are expressed in the strongest form, as number of bits needed to be communicated between processors with unlimited computational power. Our results generalize prior results on uniform databases (where each relation is a matching) [4], and lower bounds for the MapReduce model [1].
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Communication Steps for Parallel Query Processing
TL;DR: The problem of computing conjunctive queries over large databases on parallel architectures without shared storage is studied and essentially tight upper and lower bounds for one round algorithms are obtained and how the bounds degrade when there is skew in the data is shown.
113
•Posted Content
Sparsifying Distributed Algorithms with Ramifications in Massively Parallel Computation and Centralized Local Computation
Mohsen Ghaffari,Jara Uitto +1 more
TL;DR: A method for sparsifying distributed algorithms is introduced and how it leads to improvements that go past known barriers in two algorithmic settings of large-scale graph processing: Massively Parallel Computation (MPC), and Local Computation Algorithms (LCA).
109
Distributed evaluation of subgraph queries using worst-case optimal low-memory dataflows
AmmarKhaled,McSherryFrank,SalihogluSemih,JoglekarManas +3 more
- 01 Feb 2018
TL;DR: This work presents the first approach that performs worst-case optimal computation and monitors fixed-size subgraphs in a continually changing large-scale graph.
93
Distributed data deduplication
Xu Chu,Ihab F. Ilyas,Paraschos Koutris +2 more
- 01 Jul 2016
TL;DR: This paper shows how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment and proposes a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees.
Conditional Hardness Results for Massively Parallel Computation from Distributed Lower Bounds
Mohsen Ghaffari,Fabian Kuhn,Jara Uitto +2 more
- 01 Nov 2019
TL;DR: The first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring are presented.
72
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
•Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
- 25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Counting triangles and the curse of the last reducer
Siddharth Suri,Sergei Vassilvitskii +1 more
- 28 Mar 2011
TL;DR: This work describes a sequential triangle counting algorithm and shows how to adapt it to the MapReduce setting, and presents a new algorithm designed specifically for the Map Reduce framework that achieves a factor of 10-100 speed up over the naive approach.
Optimizing joins in a map-reduce environment
Foto N. Afrati,Jeffrey D. Ullman +1 more
- 22 Mar 2010
TL;DR: The problem of optimizing the shares, given a fixed number of Reduce processes, is studied, and an algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given.
Shark: SQL and rich analytics at scale
Reynold Xin,Josh Rosen,Matei Zaharia,Michael J. Franklin,Scott Shenker,Ion Stoica +5 more
- 22 Jun 2013
TL;DR: Shark as discussed by the authors is a new data analysis system that marries query processing with complex analytics on large clusters, and leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions at scale, and efficiently recovers from failures mid-query.
394
Related Papers (5)
Paul Beame,Paraschos Koutris,Dan Suciu +2 more
- 22 Jun 2013
Foto N. Afrati,Jeffrey D. Ullman +1 more
- 22 Mar 2010