Proceedings Article10.1145/1989323.1989423
Processing theta-joins using MapReduce
Alper Okcan,Mirek Riedewald +1 more
- 12 Jun 2011
- pp 949-960
TL;DR: This work derives a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job, and provides evidence that for a variety of join problems, it is either close to optimal or the best possible option.
read more
Abstract: Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Parallel data processing with MapReduce: a survey
Kyong-Ha Lee,Yoon-Joon Lee,Hyunsik Choi,Yon Dohn Chung,Bongki Moon +4 more
- 11 Jan 2012
TL;DR: In this survey, the MapReduce framework is characterized and its inherent pros and cons are discussed, and its optimization strategies reported in the recent literature are introduced.
728
Efficient parallel kNN joins for large data in MapReduce
Chi Zhang,Feifei Li,Jeffrey Jestes +2 more
- 27 Mar 2012
TL;DR: This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data to meet many practical needs in data mining applications and spatial and multimedia databases.
Efficient big data processing in Hadoop MapReduce
Jens Dittrich,Jorge-Arnulfo Quiané-Ruiz +1 more
- 01 Aug 2012
TL;DR: This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently and highlights the similarities and differences between Hadoop MapReduce and Parallel DBMS.
Efficient processing of k nearest neighbor joins using MapReduce
Wei Lu,Yanyan Shen,Su Chen,Beng Chin Ooi +3 more
- 01 Jun 2012
TL;DR: Zhang et al. as discussed by the authors investigated how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers.
269
A survey of large-scale analytical query processing in MapReduce
Christos Doulkeridis,Kjetil Nørvåg +1 more
- 01 Jun 2014
TL;DR: A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target, and interesting directions for future parallel data processing systems are outlined.
262
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard,Mihai Budiu,Yuan Yu,Andrew Birrell,Dennis Fetterly +4 more
- 21 Mar 2007
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Query evaluation techniques for large databases
TL;DR: This survey describes a wide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.
Parallel database systems: the future of high performance database systems
David J. DeWitt,Jim Gray +1 more
TL;DR: Eradata, Tandem, and a host of startup companies have successfully developed and marketed highly parallel database machines.
Related Papers (5)
Rares Vernica,Michael J. Carey,Chen Li +2 more
- 06 Jun 2010
Foto N. Afrati,Jeffrey D. Ullman +1 more
- 22 Mar 2010