Implicit Parallelism through Deep Language Embedding

doi:10.1145/2723372.2750543

Proceedings Article10.1145/2723372.2750543

Implicit Parallelism through Deep Language Embedding

Alexander Alexandrov, +7 more

- 27 May 2015

- Vol. 45, Iss: 1, pp 47-61

66

TL;DR: This paper proposes a language for complex data analysis embedded in Scala, which allows for declarative specification of dataflows and hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation.

Abstract: The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1145/3035918.3054775

Data Management in Machine Learning: Challenges, Techniques, and Systems

Arun Kumar, +2 more

- 09 May 2017

TL;DR: This tutorial provides a comprehensive review of systems for advanced analytics, integrating ML algorithms and languages with existing data systems such as RDBMSs, and adapting data management-inspired techniques to new systems that target ML workloads.

...read moreread less

149

•Journal Article•10.1007/S00778-018-0514-9

A survey of state management in big data processing systems

Quoc-Cuong To, +2 more

- 01 Dec 2018

TL;DR: This survey presents examples of state as an enabler, discusses the alternative approaches used to handle and implement state, captures the many facets of state management, and highlights new research directions.

...read moreread less

68

•Journal Article•10.14778/3229863.3229865

On optimizing operator fusion plans for large-scale machine learning in systemML

Matthias Boehm, +5 more

- 01 Aug 2018

TL;DR: In this paper, a cost-based optimization framework for fusion plans is proposed and integrated into Apache SystemML, where candidate exploration and selection of fusion plans, as well as code generation of local and distributed operations over dense, sparse, and compressed data are presented.

...read moreread less

66

•Journal Article•10.14778/3342263.3342633

An intermediate representation for optimizing machine learning pipelines

Andreas Kunft, +5 more

- 01 Jul 2019

TL;DR: Lara is presented, a declarative domainspecific language for collections and matrices with intermediate representation (IR) that reflects on the complete program, i.e., UDFs, control flow, and both data types, to enable holistic optimization of ML training pipelines.

...read moreread less

56

•Book

Data Management in Machine Learning Systems

Matthias Boehm, +2 more

- 25 Feb 2019

TL;DR: This work states that large-scale data analytics using machine learning (ML) underpins many modern data-driven applications and provides means of specifying and executing these ML workloads in an efficacious manner.

...read moreread less

46

...

Expand

References

•Proceedings Article

Spark: cluster computing with working sets

Matei Zaharia, +4 more

- 22 Jun 2010

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

5.3K

•Proceedings Article

Measuring User Influence in Twitter: The Million Follower Fallacy

Meeyoung Cha, +3 more

- 16 May 2010

TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.

...read moreread less

3.5K

Proceedings Article•10.1145/1272996.1273005

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

- 21 Mar 2007

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

3K

Proceedings Article•10.1145/582095.582099

Access path selection in a relational database management system

P. Griffiths Selinger, +4 more

- 30 May 1979

TL;DR: System R as mentioned in this paper is an experimental database management system developed to carry out research on the relational model of data, which chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates.

...read moreread less

2.3K

Journal Article•10.14778/1687553.1687609

Hive: a warehousing solution over a map-reduce framework

Ashish Thusoo, +8 more

- 01 Aug 2009

TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.

...read moreread less

1.8K