An intermediate representation for optimizing machine learning pipelines
Andreas Kunft,Asterios Katsifodimos,Sebastian Schelter,Sebastian Breß,Tilmann Rabl,Volker Markl +5 more
- 01 Jul 2019
- Vol. 12, Iss: 11, pp 1553-1567
TL;DR: Lara is presented, a declarative domainspecific language for collections and matrices with intermediate representation (IR) that reflects on the complete program, i.e., UDFs, control flow, and both data types, to enable holistic optimization of ML training pipelines.
read more
Abstract: Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's inter-mediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
The Pushshift Reddit Dataset
TL;DR: The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.
950
Digital Design And Computer Architecture
Daniela Fischer
- 01 Jan 2016
TL;DR: The digital design and computer architecture is universally compatible with any devices to read and is available in the digital library an online access to it is set as public so you can download it instantly.
346
Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline
Sumon Biswas,Hridesh Rajan +1 more
- 20 Aug 2021
TL;DR: In this paper, the authors introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline and leveraged existing metrics to define the fairness measures of the stages.
•Proceedings Article
Extending Relational Query Processing with ML Inference.
Konstantinos Karanasos,Matteo Interlandi,Doris Xin,Fotis Psallidas,Rathijit Sen,Kwanghyun Park,Ivan Popivanov,Supun Nakandala,Subru Krishnan,Markus Weimer,Yuan Yu,Raghu Ramakrishnan,Carlo Curino +12 more
- 01 Jan 2019
TL;DR: We answer the above positively by building Raven, a system that leverages native integration of ML runtimes (i.e., ONNX Runtime) deep within SQL Server, and a unified intermediate representation (IR) to enable advanced cross-optimizations between ML and DB operators.
Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline
Sumon Biswas,Hridesh Rajan +1 more
TL;DR: In this paper, the authors introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline and leveraged existing metrics to define the fairness measures of the stages.
49
References
•Journal Article
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +15 more
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
•Posted Content
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Andreas Müller,Joel Nothman,Gilles Louppe,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +18 more
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
28.9K
•Proceedings Article
A study of cross-validation and bootstrap for accuracy estimation and model selection
Ron Kohavi
- 20 Aug 1995
TL;DR: The results indicate that for real-word datasets similar to the authors', the best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.
TensorFlow: a system for large-scale machine learning
Martín Abadi,Paul Barham,Jianmin Chen,Zhifeng Chen,Andy Davis,Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Geoffrey Irving,Michael Isard,Manjunath Kudlur,Josh Levenberg,Rajat Monga,Sherry Moore,Derek G. Murray,Benoit Steiner,Paul A. Tucker,Vijay K. Vasudevan,Pete Warden,Martin Wicke,Yuan Yu,Xiaoqiang Zheng +21 more
- 02 Nov 2016
TL;DR: TensorFlow as mentioned in this paper is a machine learning system that operates at large scale and in heterogeneous environments, using dataflow graphs to represent computation, shared state, and the operations that mutate that state.
•Proceedings Article
Spark: cluster computing with working sets
Matei Zaharia,Mosharaf Chowdhury,Michael J. Franklin,Scott Shenker,Ion Stoica +4 more
- 22 Jun 2010
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Related Papers (5)
Martín Abadi,Paul Barham,Jianmin Chen,Zhifeng Chen,Andy Davis,Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Geoffrey Irving,Michael Isard,Manjunath Kudlur,Josh Levenberg,Rajat Monga,Sherry Moore,Derek G. Murray,Benoit Steiner,Paul A. Tucker,Vijay K. Vasudevan,Pete Warden,Martin Wicke,Yuan Yu,Xiaoqiang Zheng +21 more
- 02 Nov 2016