Journal Article10.1109/cluster59578.2024.00027
ML-Based Dynamic Operator-Level Query Mapping for Stream Processing Systems in Heterogeneous Computing Environments
Seung-Hwan Oh,Gordon Euhyun Moon,Sung Yong Park +2 more
- 24 Sep 2024
pp 226-237
TL;DR: DynO, a stream processing system, uses a tree-based machine learning algorithm to dynamically map queries to devices at the operator-level, optimizing performance by predicting execution times and leveraging GPU idle periods and prefetching.
read more
Abstract: Mapping queries to optimal computing devices at the operator-level presents a significant challenge in stream processing systems (SPS) with heterogeneous computing resources. Inefficient query mapping can degrade the performance of the SPS. To address this issue, existing approaches employ static methods, such as mapping all queries to either CPUs or GPUs, or maintaining static mapping tables for queries or operators based on their predetermined device preferences. However, the static mapping scheme fails to provide an optimal solution, as the device preference for different query operators changes dynamically at runtime. In this paper, we propose DynO, a high performance SPS that dynamically maps queries to devices at the operator-level using a tree-based machine learning algorithm. To effectively determine an optimized device mapping plan for query operators, DynO employs a tree-based gradient boosting model to accurately predict the execution time for all potential mapping plan combinations. DynO also introduces a novel turn-based updating scheme to maximize performance in stream processing while training a tree-based gradient boosting model. Additionally, we devise an efficient device mapping scheme to expedite the process of determining the optimal device mapping plan by leveraging a direct acyclic graph (DAG) shortest path algorithm. DynO completely hides any overhead caused by the extra computation needed to find the optimal plan by utilizing prefetching and GPU idle periods. Experimental results using a variety of queries and traffic patterns show that DynO outperforms existing state-of-the-art approaches by ensuring high throughput, low latency, and high efficiency.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
References
Extremely randomized trees
TL;DR: A new tree-based ensemble method for supervised classification and regression problems that consists of randomizing strongly both attribute and cut-point choice while splitting a tree node and builds totally randomized trees whose structures are independent of the output values of the learning sample.
Stochastic gradient boosting
TL;DR: It is shown that both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating randomization into the procedure.
7.2K
Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction
TL;DR: In this article, the authors evaluated four statistical models (Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS) for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model.
2.2K
Apache Spark: a unified engine for big data processing
Matei Zaharia,Reynold Xin,Patrick Wendell,Tathagata Das,Michael Armbrust,Ankur Dave,Xiangrui Meng,Josh Rosen,Shivaram Venkataraman,Michael J. Franklin,Ali Ghodsi,Joseph E. Gonzalez,Scott Shenker,Ion Stoica +13 more
TL;DR: This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
2.2K
Permutation importance
TL;DR: Almann et al. as discussed by the authors introduced a heuristic for normalizing feature importance measures that can correct the feature importance bias, based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting.