A technique for parallel query optimization using MapReduce framework and a semantic-based clustering method.
TL;DR: In this article, the authors have applied and tested a model for clustering variant sizes of large query datasets parallelly using MapReduce and showed the effectiveness of the parallel implementation of query workloads clustering to achieve good scalability.
read more
Abstract: Query optimization is the process of identifying the best Query Execution Plan (QEP). The query optimizer produces a close to optimal QEP for the given queries based on the minimum resource usage. The problem is that for a given query, there are plenty of different equivalent execution plans, each with a corresponding execution cost. To produce an effective query plan thus requires examining a large number of alternative plans. Access plan recommendation is an alternative technique to database query optimization, which reuses the previously-generated QEPs to execute new queries. In this technique, the query optimizer uses clustering methods to identify groups of similar queries. However, clustering such large datasets is challenging for traditional clustering algorithms due to huge processing time. Numerous cloud-based platforms have been introduced that offer low-cost solutions for the processing of distributed queries such as Hadoop, Hive, Pig, etc. This paper has applied and tested a model for clustering variant sizes of large query datasets parallelly using MapReduce. The results demonstrate the effectiveness of the parallel implementation of query workloads clustering to achieve good scalability.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
TL;DR: The results of the experiments demonstrated the effectiveness of parallel query clustering in achieving high scalability, and Apache Spark achieved better performance than Apache Hadoop, reaching an average speedup of 2x.
Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
17 Sep 2022
TL;DR: In this paper , a MapReduce-based access plan recommendation method is proposed to cluster different sizes of query datasets in the query space based on the query execution plans (QEPs) and the performance evaluation is performed based on execution time.
5
Visual Dynamic Simulation Model of Unstructured Data in Social Networks
TL;DR: The experimental results show that the Hadoop cluster design, implements data persistence by HDFS, uses MapReduce to extract data clusters for distributed computing, and builds a visual dynamic simulation model of unstructured data in social network have a good visualization effect and can effectively improve the stability and efficiency of un Structured data visualization in social networks.
2
Database Optimization Techniques with Logic Execution Optimization on Microservices Architecture
TL;DR: In this paper , the authors used database optimization techniques with logic execution optimization microservices architecture to obtain query response time efficiency for accounting applications and obtained the source of information from the Accounting Harmony Accounting Module, which has an API (get-list-attachment) with data sourced from Service Accounting (581253 records) and Service Users (2182 records).
References
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
TL;DR: A new graphical display is proposed for partitioning techniques, where each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation, and provides an evaluation of clustering validity.
19K
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
•Proceedings Article
A density-based algorithm for discovering clusters in large spatial Databases with Noise
Martin Ester,Hans-Peter Kriegel,Jörg Sander,Xiaowei Xu +3 more
- 01 Jan 1996
TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Principal component analysis for clustering gene expression data.
Ka Yee Yeung,Walter L. Ruzzo +1 more
TL;DR: The empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality, and would not recommend PCA before clustering except in special circumstances.
Related Papers (5)
Sai Wu,Feng Li,Sharad Mehrotra,Beng Chin Ooi +3 more
- 26 Oct 2011
K.W. Ng,Zhenghao Wang,Richard R. Muntz,Silvia Nittel +3 more
- 28 Jul 1999
Antara Ghosh,Jignashu Parikh,Vibhuti S. Sengar,Jayant R. Haritsa +3 more
- 20 Aug 2002