Open AccessDissertation
Parallel Techniques for Big Data Analytics
Reza Akbarinia
- 24 May 2019
TL;DR: The experimental results show that by using the parallel technique, the indexing and mining of very large volumes of time series can now be done in very small execution times, which are impossible to achieve using traditional centralized approaches.
read more
Abstract: Nowadays, we are witnessing the production of large volumes of data in many applications domains like social networks, medical monitoring, weather forecasting, biology, agronomy, earth monitoring, etc. Analyzing this data would help us to extract a lot of hidden knowledge about the events happened or to be happened in the future. However, traditional data analytics techniques are not efficient for analyzing such data volumes. A promising solution for improving the performance of data analytics is to take advantage of the computing power of distributed systems and parallel frameworks such as Spark. In this HDR manuscript, I describe my research activities for developing parallel and distributed techniques to deal with two main data analytics problems: 1) similarity search over time series; 2) maximally informative k-itemsets mining. The first problem, i.e., similarity search over time series, is very important for many applications such as fraud detection in finance, earthquake prediction, plant monitoring, etc. In order to improve the performance of similarity queries, index construction is one of the most popular techniques, which has been successfully used in a variety of settings and applications. In our research activities, we took advantage of parallel and distributed frameworks such as Spark, and developed efficient solutions for parallel construction of tree-based and grid-based indexes over large time series datasets. We also developed efficient algorithms for parallel similarity search over distributed time series datasets using indexes. The second problem, i.e., maximally informative k-itemsets mining (miki for short), is one of the fundamental building bricks for exploring informative patterns in databases. Efficient miki mining has a high impact on various tasks such as supervised learning, unsupervised learning or information retrieval, to cite a few. A typical application is the discovery of discriminative sets of features, based on joint entropy, which allows distinguishing between different categories of objects. Indeed, with massive amounts of data, the miki mining is very challenging, due to high number of entropy computations. An efficient miki mining solution should scale up with the increase in the size of the itemsets, calling for cutting edge parallel algorithms and high performance computation of miki. We developed such a parallel solution that makes the discovery of miki from a very large database (up to Terabytes of data) simple and effective.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Enhancing Lung Cancer Classification Effectiveness Through Hyperparameter-Tuned Support Vector Machine
Fita Sheila Gomiasti,Warto Warto,Etika Kartikadarma,Jutono Gondohanindijo,De Rosal Ignatius Moses Setiadi +4 more
- 25 Mar 2024
TL;DR: This study enhances lung cancer classification effectiveness using hyperparameter-tuned Support Vector Machines (SVMs) with Radial Basis Function (RBF) kernels, achieving improved accuracy, precision, specificity, and F1 score, with optimal parameters C=10, Gamma=10, and Probability=True.
6
References
•Book
Elements of information theory
Thomas M. Cover,Joy A. Thomas +1 more
- 01 Jan 1991
TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
An introduction to variable and feature selection
Isabelle Guyon,André Elisseeff +1 more
TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
•Proceedings Article
Fast Algorithms for Mining Association Rules in Large Databases
Rakesh Agrawal,Ramakrishnan Srikant +1 more
- 12 Sep 1994
TL;DR: Two new algorithms for solving thii problem that are fundamentally different from the known algorithms are presented and empirical evaluation shows that these algorithms outperform theknown algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems.
Related Papers (5)
Subrata Saha
- 01 Jan 2017
Stephen Dass,J Prabhu +1 more
- 01 Jan 2018
Mohammed M. Alani,Hissam Tawfik +1 more
- 10 May 2021
Samuel Fakorede,Adel El-Shahat +1 more
- 01 Jan 2017