Open Access
Compile-Time Query Optimization for Big Data Analytics
Leonidas Fegaras
- 01 Jan 2019
- Vol. 5, Iss: 1, pp 35-61
TL;DR: A new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time are introduced.
read more
Abstract: Many emerging programming environments for large-scale data analysis, such as Map-Reduce, Spark, and Flink, provide Scala-based APIs that consist of powerful higher-order operations that ease the development of complex data analysis applications. However, despite the simplicity of these APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, most current data analysis query languages are based on the relational model and cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model. To address these shortcomings, we introduce a new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer uses algebraic transformations to derive all possible joins in a query, including those hidden across deeply nested queries, thus unnesting nested queries of any form and any number of nesting levels. The optimizer also uses general transformations to push down predicates before joins and to prune unneeded data across operations. DIQL has been implemented on three Big Data platforms, Apache Spark, Apache Flink, and Twitter's Cascading/Scalding, and has been shown to have competitive performance relative to Spark DataFrames and Spark SQL for some complex queries. This paper extends our previous work on embedded data-intensive query languages by describing the complete details of the formal framework and the query translation and optimization processes, and by providing more experimental results that give further evidence of the performance of our system.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Scalable Querying of Nested Data
TL;DR: This work proposes a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated, and provides an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.
9
A two-level formal model for Big Data processing programs
TL;DR: In this article , the authors propose a model for specifying data flow-based parallel data processing programs agnostic of target Big Data processing frameworks, focusing on the formal abstract specification of non-iterative and iterative programs.
Emerging Trends in Data Science and Big Data Analytics: A Bibliometric Analysis
Abdulaziz Yasin Nageye,Abdukadir Dahir Jimale,Mohamed Omar Abdullahi,Yahye Abukar Ahmed +3 more
TL;DR: Bibliometric analysis exploring trends in Data Science and Big Data Analytics research from 2010 to March 2024. Identifying key trends, patterns, and dynamics within the field.
•Posted Content
An Abstract View of Big Data Processing Programs.
João Batista de Souza Neto,Anamaria Martins Moreira,Genoveva Vargas-Solar,Martin A. Musicante +3 more
TL;DR: In this paper, the authors propose a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks, focusing on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow big data processing frameworks.
Modeling Big Data Processing Programs
João Batista de Souza Neto,Anamaria Martins Moreira,Genoveva Vargas-Solar,Martin A. Musicante +3 more
- 25 Nov 2020
TL;DR: This model generalizes the data flow programming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink and uses Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the data/control flow.
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
•Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
- 25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Pregel: a system for large-scale graph processing
Grzegorz Malewicz,Matthew H. Austern,Aart J. C. Bik,James C. Dehnert,Ilan Horn,Naty Leiser,Grzegorz Czajkowski +6 more
- 06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard,Mihai Budiu,Yuan Yu,Andrew Birrell,Dennis Fetterly +4 more
- 21 Mar 2007
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.