Proceedings Article10.1145/1620432.1620438
Efficiently support MapReduce-like computation models inside parallel DBMS
Qiming Chen,Andy Therber,Meichun Hsu,Hans Zeller,Bin Zhang,Ren Wu +5 more
- 16 Sep 2009
- pp 43-53
25
TL;DR: These mechanisms enable us to solve the essential problems in supporting MapReduce and other analytics computation models inside a parallel database engine: modeling complex applications, integrating them into query processing, and shielding analytics developers from DBMS internal details.
read more
Abstract: While parallel DBMSs do support large scale parallel query processing on partitioned data, the reach of more general applications relies on User Defined Functions (UDFs). However, the existent UDF technology is insufficient both conceptually and practically. A UDF is not a relation-in, relation-out operator, which restricts its ability to model complex applications defined on a set of tuples rather than on a single one, and to be composed with other relational operators in a query. Further, to interact with the query execution efficiently, a UDF must be coded with complex interactions with DBMS internal data structures and system calls which is often beyond the expertise of an analytics application developer.To solve these problems, we start with wrapping general applications with Relation Valued Functions (RVFs); then based on the notion of invocation patterns, we provide focused system support for efficiently integrating RVF execution into the query processing pipeline. We further distinguish the system responsibility and the user responsibility in RVF development, by separating an RVF into the RVF-Shell for dealing with system interaction, and the user-function for pure application logic, such that the RVF-Shell can be constructed in terms of high-level APIs. These mechanisms enable us to solve the essential problems in supporting MapReduce and other analytics computation models inside a parallel database engine: modeling complex applications, integrating them into query processing, and shielding analytics developers from DBMS internal details.Prototyped on a commercial and proprietary parallel database engine, our experience reveals the practical value of the proposed approaches.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Accelerating SQL database operations on a GPU with CUDA
Peter Bakkum,Kevin Skadron +1 more
- 14 Mar 2010
TL;DR: This paper implements a subset of the SQLite command processor directly on the GPU, reducing the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.
Patent
Query Execution Systems and Methods
Daniel J. Abadi,Jiewen Huang +1 more
- 29 Jun 2012
TL;DR: In this paper, the system, method and computer program product for processing a query are disclosed, which includes partitioning the stored data into a plurality of partitions based on at least one vertex in the plurality of vertexes.
240
Column-oriented storage techniques for MapReduce
Avrilia Floratou,Jignesh M. Patel,Eugene J. Shekita,Sandeep Tata +3 more
- 01 Apr 2011
TL;DR: This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs and introduces a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost.
Patent
System and method for data stream processing
Meichun Hsu,Qiming Chen +1 more
- 19 Oct 2010
TL;DR: In this paper, a method and system for processing a data stream are described, until the occurrence of a cut condition, a map function from a set of query processing steps to generate map results for a first portion of the data stream, executes a reduce function from the set of queries to generate history-sensitive data from the map results.
128
Patent
Parallel processing of data
Craig D. Chambers,Ashish Raniwala,Frances J. Perry,Stephen R. Adams,Robert R. Henry,Robert Bradshaw,Nathan Weizenbaum +6 more
- 04 May 2011
TL;DR: In this article, a data-parallel pipeline is defined to specify multiple parallel data objects that contain multiple elements and multiple parallel operations that operate on the parallel data object, based on which a dataflow graph is generated and one or more graph transformations may be applied to the data-flow graph to generate a revised data flow graph.
120
References
Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard,Mihai Budiu,Yuan Yu,Andrew Birrell,Dennis Fetterly +4 more
- 21 Mar 2007
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
PNUTS: Yahoo!'s hosted data serving platform
Brian F. Cooper,Raghu Ramakrishnan,Utkarsh Srivastava,Adam Silberstein,Philip Bohannon,Hans-Arno Jacobsen,Nick Puz,Daniel Weaver,Ramana Yerneni +8 more
- 01 Aug 2008
TL;DR: PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees and utilizes automated load-balancing and failover to reduce operational complexity.
SCOPE: easy and efficient parallel processing of massive data sets
Ronnie Chaiken,Bob Jenkins,Per-Ake Larson,Bill Ramsey,Darren A. Shakib,Simon Weaver,Jingren Zhou +6 more
- 01 Aug 2008
TL;DR: A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.
Map-reduce-merge: simplified relational data processing on large clusters
Hung-chih Yang,Ali Dasdan,Ruey-Lung Hsiao,D. Stott Parker +3 more
- 11 Jun 2007
TL;DR: A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.
•Proceedings Article
MonetDB/X100: Hyper-Pipelining Query Execution
Peter Boncz,Marcin Zukowski,Niels Nes +2 more
- 01 Jan 2005
TL;DR: An in-depth investigation to the reason why database systems tend to achieve only low IPC on modern CPUs in compute-intensive application areas, and a new set of guidelines for designing a query processor for the MonetDB system that follows these guidelines.