Compile-Time Query Optimization for Big Data Analytics

Open Access

Compile-Time Query Optimization for Big Data Analytics

- 01 Jan 2019

- Vol. 5, Iss: 1, pp 35-61

5

TL;DR: A new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time are introduced.

Abstract: Many emerging programming environments for large-scale data analysis, such as Map-Reduce, Spark, and Flink, provide Scala-based APIs that consist of powerful higher-order operations that ease the development of complex data analysis applications. However, despite the simplicity of these APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, most current data analysis query languages are based on the relational model and cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model. To address these shortcomings, we introduce a new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer uses algebraic transformations to derive all possible joins in a query, including those hidden across deeply nested queries, thus unnesting nested queries of any form and any number of nesting levels. The optimizer also uses general transformations to push down predicates before joins and to prune unneeded data across operations. DIQL has been implemented on three Big Data platforms, Apache Spark, Apache Flink, and Twitter's Cascading/Scalding, and has been shown to have competitive performance relative to Spark DataFrames and Spark SQL for some complex queries. This paper extends our previous work on embedded data-intensive query languages by describing the complete details of the formal framework and the query translation and optimization processes, and by providing more experimental results that give further evidence of the performance of our system.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Scalable Querying of Nested Data

Jaclyn Smith, +3 more

- 12 Nov 2020

- arXiv: Databases

TL;DR: This work proposes a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated, and provides an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.

...read moreread less

9

•Journal Article•10.1016/j.scico.2021.102764

A two-level formal model for Big Data processing programs

Graham Curry

- 01 Mar 2022

- Science of Computer Programming

TL;DR: In this article , the authors propose a model for specifying data flow-based parallel data processing programs agnostic of target Big Data processing frameworks, focusing on the formal abstract specification of non-iterative and iterative programs.

...read moreread less

2

Journal Article•10.14445/23488549/ijece-v11i5p109

Emerging Trends in Data Science and Big Data Analytics: A Bibliometric Analysis

Abdulaziz Yasin Nageye, +3 more

- 31 May 2024

- SSRG international journal of electronic...

TL;DR: Bibliometric analysis exploring trends in Data Science and Big Data Analytics research from 2010 to March 2024. Identifying key trends, patterns, and dynamics within the field.

...read moreread less

•Posted Content

An Abstract View of Big Data Processing Programs.

João Batista de Souza Neto, +3 more

- 06 Aug 2021

- arXiv: Software Engineering

TL;DR: In this paper, the authors propose a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks, focusing on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow big data processing frameworks.

...read moreread less

•Book Chapter•10.1007/978-3-030-63882-5_7

Modeling Big Data Processing Programs

João Batista de Souza Neto, +3 more

- 25 Nov 2020

TL;DR: This model generalizes the data flow programming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink and uses Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the data/control flow.

...read moreread less

References

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

•Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

- 25 Apr 2012

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

4.6K

Proceedings Article•10.1145/1807167.1807184

Pregel: a system for large-scale graph processing

Grzegorz Malewicz, +6 more

- 06 Jun 2010

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.

...read moreread less

4.1K

Proceedings Article•10.1145/1272996.1273005

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

- 21 Mar 2007

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

3K

...

Expand

Compile-Time Query Optimization for Big Data Analytics

Chat with Paper

AI Agents for this Paper

Citations

Scalable Querying of Nested Data

A two-level formal model for Big Data processing programs

Emerging Trends in Data Science and Big Data Analytics: A Bibliometric Analysis

An Abstract View of Big Data Processing Programs.

Modeling Big Data Processing Programs

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Pregel: a system for large-scale graph processing

Dryad: distributed data-parallel programs from sequential building blocks

Related Papers (5)

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

Integrating Big Data and Relational Data with a Functional SQL-like Query Language

Versatile XQuery Processing in MapReduce

MeshSQL: the query language for simulation mesh data

Efficient Support for Time Series Queries in Data Stream Management Systems