Efficiently support MapReduce-like computation models inside parallel DBMS

doi:10.1145/1620432.1620438

Proceedings Article10.1145/1620432.1620438

Efficiently support MapReduce-like computation models inside parallel DBMS

Qiming Chen, +5 more

- 16 Sep 2009

- pp 43-53

25

TL;DR: These mechanisms enable us to solve the essential problems in supporting MapReduce and other analytics computation models inside a parallel database engine: modeling complex applications, integrating them into query processing, and shielding analytics developers from DBMS internal details.

Abstract: While parallel DBMSs do support large scale parallel query processing on partitioned data, the reach of more general applications relies on User Defined Functions (UDFs). However, the existent UDF technology is insufficient both conceptually and practically. A UDF is not a relation-in, relation-out operator, which restricts its ability to model complex applications defined on a set of tuples rather than on a single one, and to be composed with other relational operators in a query. Further, to interact with the query execution efficiently, a UDF must be coded with complex interactions with DBMS internal data structures and system calls which is often beyond the expertise of an analytics application developer.To solve these problems, we start with wrapping general applications with Relation Valued Functions (RVFs); then based on the notion of invocation patterns, we provide focused system support for efficiently integrating RVF execution into the query processing pipeline. We further distinguish the system responsibility and the user responsibility in RVF development, by separating an RVF into the RVF-Shell for dealing with system interaction, and the user-function for pure application logic, such that the RVF-Shell can be constructed in terms of high-level APIs. These mechanisms enable us to solve the essential problems in supporting MapReduce and other analytics computation models inside a parallel database engine: modeling complex applications, integrating them into query processing, and shielding analytics developers from DBMS internal details.Prototyped on a commercial and proprietary parallel database engine, our experience reveals the practical value of the proposed approaches.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1145/1735688.1735706

Accelerating SQL database operations on a GPU with CUDA

Peter Bakkum, +1 more

- 14 Mar 2010

TL;DR: This paper implements a subset of the SQLite command processor directly on the GPU, reducing the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.

...read moreread less

307

Patent

Query Execution Systems and Methods

Daniel J. Abadi, +1 more

- 29 Jun 2012

TL;DR: In this paper, the system, method and computer program product for processing a query are disclosed, which includes partitioning the stored data into a plurality of partitions based on at least one vertex in the plurality of vertexes.

...read moreread less

240

•Journal Article•10.14778/1988776.1988778

Column-oriented storage techniques for MapReduce

Avrilia Floratou, +3 more

- 01 Apr 2011

TL;DR: This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs and introduces a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost.

...read moreread less

140

Patent

System and method for data stream processing

Meichun Hsu, +1 more

- 19 Oct 2010

TL;DR: In this paper, a method and system for processing a data stream are described, until the occurrence of a cut condition, a map function from a set of query processing steps to generate map results for a first portion of the data stream, executes a reduce function from the set of queries to generate history-sensitive data from the map results.

...read moreread less

128

Patent

Parallel processing of data

Craig D. Chambers, +6 more

- 04 May 2011

TL;DR: In this article, a data-parallel pipeline is defined to specify multiple parallel data objects that contain multiple elements and multiple parallel operations that operate on the parallel data object, based on which a dataflow graph is generated and one or more graph transformations may be applied to the data-flow graph to generate a revised data flow graph.

...read moreread less

120

...

Expand

References

Proceedings Article•10.1145/1272996.1273005

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

- 21 Mar 2007

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

3K

Journal Article•10.14778/1454159.1454167

PNUTS: Yahoo!'s hosted data serving platform

Brian F. Cooper, +8 more

- 01 Aug 2008

TL;DR: PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees and utilizes automated load-balancing and failover to reduce operational complexity.

...read moreread less

1.1K

Journal Article•10.14778/1454159.1454166

SCOPE: easy and efficient parallel processing of massive data sets

Ronnie Chaiken, +6 more

- 01 Aug 2008

TL;DR: A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.

...read moreread less

908

Proceedings Article•10.1145/1247480.1247602

Map-reduce-merge: simplified relational data processing on large clusters

Hung-chih Yang, +3 more

- 11 Jun 2007

TL;DR: A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.

...read moreread less

851

•Proceedings Article

MonetDB/X100: Hyper-Pipelining Query Execution

Peter Boncz, +2 more

- 01 Jan 2005

TL;DR: An in-depth investigation to the reason why database systems tend to achieve only low IPC on modern CPUs in compute-intensive application areas, and a new set of guidelines for designing a query processor for the MonetDB system that follows these guidelines.

...read moreread less

595