Processing theta-joins using MapReduce

doi:10.1145/1989323.1989423

Proceedings Article10.1145/1989323.1989423

Processing theta-joins using MapReduce

Alper Okcan, +1 more

- 12 Jun 2011

- pp 949-960

291

TL;DR: This work derives a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job, and provides evidence that for a variety of join problems, it is either close to optimal or the best possible option.

Abstract: Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1145/2094114.2094118

Parallel data processing with MapReduce: a survey

Kyong-Ha Lee, +4 more

- 11 Jan 2012

TL;DR: In this survey, the MapReduce framework is characterized and its inherent pros and cons are discussed, and its optimization strategies reported in the recent literature are introduced.

...read moreread less

728

•Proceedings Article•10.1145/2247596.2247602

Efficient parallel kNN joins for large data in MapReduce

Chi Zhang, +2 more

- 27 Mar 2012

TL;DR: This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data to meet many practical needs in data mining applications and spatial and multimedia databases.

...read moreread less

304

Journal Article•10.14778/2367502.2367562

Efficient big data processing in Hadoop MapReduce

Jens Dittrich, +1 more

- 01 Aug 2012

TL;DR: This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently and highlights the similarities and differences between Hadoop MapReduce and Parallel DBMS.

...read moreread less

294

•Journal Article•10.14778/2336664.2336674

Efficient processing of k nearest neighbor joins using MapReduce

Wei Lu, +3 more

- 01 Jun 2012

TL;DR: Zhang et al. as discussed by the authors investigated how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers.

...read moreread less

269

•Journal Article•10.1007/S00778-013-0319-9

A survey of large-scale analytical query processing in MapReduce

Christos Doulkeridis, +1 more

- 01 Jun 2014

TL;DR: A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target, and interesting directions for future parallel data processing systems are outlined.

...read moreread less

262

...

Expand

References

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

Proceedings Article•10.1145/1272996.1273005

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

- 21 Mar 2007

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

3K

•Journal Article•10.1145/152610.152611

Query evaluation techniques for large databases

Goetz Graefe

- 01 Jun 1993

- ACM Computing Surveys

TL;DR: This survey describes a wide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.

...read moreread less

1.5K

•Journal Article•10.1145/129888.129894

Parallel database systems: the future of high performance database systems

David J. DeWitt, +1 more

- 01 Jun 1992

- Communications of The ACM

TL;DR: Eradata, Tandem, and a host of startup companies have successfully developed and marketed highly parallel database machines.

...read moreread less

1.4K

...

Expand

Processing theta-joins using MapReduce

Chat with Paper

AI Agents for this Paper

Citations

Parallel data processing with MapReduce: a survey

Efficient parallel kNN joins for large data in MapReduce

Efficient big data processing in Hadoop MapReduce

Efficient processing of k nearest neighbor joins using MapReduce

A survey of large-scale analytical query processing in MapReduce

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Dryad: distributed data-parallel programs from sequential building blocks

Query evaluation techniques for large databases

Parallel database systems: the future of high performance database systems

Related Papers (5)

MapReduce: simplified data processing on large clusters

A comparison of join algorithms for log processing in MaPreduce

Efficient parallel set-similarity joins using MapReduce

Optimizing joins in a map-reduce environment

Map-reduce-merge: simplified relational data processing on large clusters