Proceedings Article10.1145/2723372.2750543
Implicit Parallelism through Deep Language Embedding
Alexander Alexandrov,Andreas Kunft,Asterios Katsifodimos,Felix Schüler,Lauritz Thamsen,Odej Kao,Tobias Herb,Volker Markl +7 more
- 27 May 2015
- Vol. 45, Iss: 1, pp 47-61
TL;DR: This paper proposes a language for complex data analysis embedded in Scala, which allows for declarative specification of dataflows and hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation.
read more
Abstract: The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Data Management in Machine Learning: Challenges, Techniques, and Systems
Arun Kumar,Matthias Boehm,Jun Yang +2 more
- 09 May 2017
TL;DR: This tutorial provides a comprehensive review of systems for advanced analytics, integrating ML algorithms and languages with existing data systems such as RDBMSs, and adapting data management-inspired techniques to new systems that target ML workloads.
149
A survey of state management in big data processing systems
Quoc-Cuong To,Juan Soto,Volker Markl +2 more
- 01 Dec 2018
TL;DR: This survey presents examples of state as an enabler, discusses the alternative approaches used to handle and implement state, captures the many facets of state management, and highlights new research directions.
68
On optimizing operator fusion plans for large-scale machine learning in systemML
Matthias Boehm,Berthold Reinwald,Dylan Hutchison,Prithviraj Sen,Alexandre V. Evfimievski,Niketan Pansare +5 more
- 01 Aug 2018
TL;DR: In this paper, a cost-based optimization framework for fusion plans is proposed and integrated into Apache SystemML, where candidate exploration and selection of fusion plans, as well as code generation of local and distributed operations over dense, sparse, and compressed data are presented.
An intermediate representation for optimizing machine learning pipelines
Andreas Kunft,Asterios Katsifodimos,Sebastian Schelter,Sebastian Breß,Tilmann Rabl,Volker Markl +5 more
- 01 Jul 2019
TL;DR: Lara is presented, a declarative domainspecific language for collections and matrices with intermediate representation (IR) that reflects on the complete program, i.e., UDFs, control flow, and both data types, to enable holistic optimization of ML training pipelines.
•Book
Data Management in Machine Learning Systems
Matthias Boehm,Arun Kumar,Jun Yang +2 more
- 25 Feb 2019
TL;DR: This work states that large-scale data analytics using machine learning (ML) underpins many modern data-driven applications and provides means of specifying and executing these ML workloads in an efficacious manner.
46
References
•Proceedings Article
Spark: cluster computing with working sets
Matei Zaharia,Mosharaf Chowdhury,Michael J. Franklin,Scott Shenker,Ion Stoica +4 more
- 22 Jun 2010
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
•Proceedings Article
Measuring User Influence in Twitter: The Million Follower Fallacy
Meeyoung Cha,Hamed Haddadi,Fabrício Benevenuto,Krishna P. Gummadi +3 more
- 16 May 2010
TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.
Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard,Mihai Budiu,Yuan Yu,Andrew Birrell,Dennis Fetterly +4 more
- 21 Mar 2007
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Access path selection in a relational database management system
P. Griffiths Selinger,Morton M. Astrahan,Donald D. Chamberlin,Raymond A. Lorie,T. G. Price +4 more
- 30 May 1979
TL;DR: System R as mentioned in this paper is an experimental database management system developed to carry out research on the relational model of data, which chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates.
Hive: a warehousing solution over a map-reduce framework
Ashish Thusoo,Joydeep Sen Sarma,Namit Jain,Zheng Shao,Prasad Chakka,Suresh Anthony,Hao Liu,Pete Wyckoff,Raghotham Murthy +8 more
- 01 Aug 2009
TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.