An intermediate representation for optimizing machine learning pipelines

doi:10.14778/3342263.3342633

Open AccessJournal Article10.14778/3342263.3342633

An intermediate representation for optimizing machine learning pipelines

Andreas Kunft, +5 more

- 01 Jul 2019

- Vol. 12, Iss: 11, pp 1553-1567

56

TL;DR: Lara is presented, a declarative domainspecific language for collections and matrices with intermediate representation (IR) that reflects on the complete program, i.e., UDFs, control flow, and both data types, to enable holistic optimization of ML training pipelines.

Abstract: Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's inter-mediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

The Pushshift Reddit Dataset

Jason Baumgartner, +4 more

- 23 Jan 2020

- arXiv: Social and Information Networks

TL;DR: The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.

...read moreread less

950

Digital Design And Computer Architecture

Daniela Fischer

- 01 Jan 2016

TL;DR: The digital design and computer architecture is universally compatible with any devices to read and is available in the digital library an online access to it is set as public so you can download it instantly.

...read moreread less

346

•Proceedings Article•10.1145/3468264.3468536

Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline

Sumon Biswas, +1 more

- 20 Aug 2021

TL;DR: In this paper, the authors introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline and leveraged existing metrics to define the fairness measures of the stages.

...read moreread less

90

•Proceedings Article

Extending Relational Query Processing with ML Inference.

Konstantinos Karanasos, +12 more

- 01 Jan 2019

TL;DR: We answer the above positively by building Raven, a system that leverages native integration of ML runtimes (i.e., ONNX Runtime) deep within SQL Server, and a unified intermediate representation (IR) to enable advanced cross-optimizations between ML and DB operators.

...read moreread less

57

•Proceedings Article•10.1145/3468264.3468536

Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline

Sumon Biswas, +1 more

- 02 Jun 2021

- arXiv: Learning

TL;DR: In this paper, the authors introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline and leveraged existing metrics to define the fairness measures of the stages.

...read moreread less

49

...

Expand

References

•Proceedings Article

A study of cross-validation and bootstrap for accuracy estimation and model selection

Ron Kohavi

- 20 Aug 1995

TL;DR: The results indicate that for real-word datasets similar to the authors', the best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.

...read moreread less

12.7K

•Proceedings Article

Spark: cluster computing with working sets

Matei Zaharia, +4 more

- 22 Jun 2010

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

5.3K

...

Expand

Related Papers (5)

Implicit Parallelism through Deep Language Embedding

[...]

Alexander Alexandrov, +7 more

- 27 May 2015

LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation

[...]

Dylan Hutchison, +2 more

- 14 May 2017

Froid: optimization of imperative programs in a relational database

[...]

Karthik Ramachandra, +5 more

- 01 Dec 2017

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics

[...]

Evan R. Sparks, +4 more

- 19 Apr 2017

An intermediate representation for optimizing machine learning pipelines

Chat with Paper

AI Agents for this Paper

Citations

The Pushshift Reddit Dataset

Digital Design And Computer Architecture

Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline

Extending Relational Query Processing with ML Inference.

Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline

References

Scikit-learn: Machine Learning in Python

Scikit-learn: Machine Learning in Python

A study of cross-validation and bootstrap for accuracy estimation and model selection

TensorFlow: a system for large-scale machine learning

Spark: cluster computing with working sets

Related Papers (5)

TensorFlow: a system for large-scale machine learning

Implicit Parallelism through Deep Language Embedding

LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation

Froid: optimization of imperative programs in a relational database

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics