Parallel Techniques for Big Data Analytics

Open AccessDissertation

Parallel Techniques for Big Data Analytics

- 24 May 2019

2

TL;DR: The experimental results show that by using the parallel technique, the indexing and mining of very large volumes of time series can now be done in very small execution times, which are impossible to achieve using traditional centralized approaches.

Abstract: Nowadays, we are witnessing the production of large volumes of data in many applications domains like social networks, medical monitoring, weather forecasting, biology, agronomy, earth monitoring, etc. Analyzing this data would help us to extract a lot of hidden knowledge about the events happened or to be happened in the future. However, traditional data analytics techniques are not efficient for analyzing such data volumes. A promising solution for improving the performance of data analytics is to take advantage of the computing power of distributed systems and parallel frameworks such as Spark. In this HDR manuscript, I describe my research activities for developing parallel and distributed techniques to deal with two main data analytics problems: 1) similarity search over time series; 2) maximally informative k-itemsets mining. The first problem, i.e., similarity search over time series, is very important for many applications such as fraud detection in finance, earthquake prediction, plant monitoring, etc. In order to improve the performance of similarity queries, index construction is one of the most popular techniques, which has been successfully used in a variety of settings and applications. In our research activities, we took advantage of parallel and distributed frameworks such as Spark, and developed efficient solutions for parallel construction of tree-based and grid-based indexes over large time series datasets. We also developed efficient algorithms for parallel similarity search over distributed time series datasets using indexes. The second problem, i.e., maximally informative k-itemsets mining (miki for short), is one of the fundamental building bricks for exploring informative patterns in databases. Efficient miki mining has a high impact on various tasks such as supervised learning, unsupervised learning or information retrieval, to cite a few. A typical application is the discovery of discriminative sets of features, based on joint entropy, which allows distinguishing between different categories of objects. Indeed, with massive amounts of data, the miki mining is very challenging, due to high number of entropy computations. An efficient miki mining solution should scale up with the increase in the size of the itemsets, calling for cutting edge parallel algorithms and high performance computation of miki. We developed such a parallel solution that makes the discovery of miki from a very large database (up to Terabytes of data) simple and effective.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Data Mining - Concepts and Techniques.

Petra Perner

- 01 Jan 2002

14.6K

Journal Article•10.62411/jcta.10106

Enhancing Lung Cancer Classification Effectiveness Through Hyperparameter-Tuned Support Vector Machine

Fita Sheila Gomiasti, +4 more

- 25 Mar 2024

TL;DR: This study enhances lung cancer classification effectiveness using hyperparameter-tuned Support Vector Machines (SVMs) with Radial Basis Function (RBF) kernels, achieving improved accuracy, precision, specificity, and F1 score, with optimal parameters C=10, Gamma=10, and Probability=True.

...read moreread less

6

References

•Book

Elements of information theory

Thomas M. Cover, +1 more

- 01 Jan 1991

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

52.2K

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

•Journal Article•10.1162/153244303322753616

An introduction to variable and feature selection

Isabelle Guyon, +1 more

- 01 Mar 2003

- Journal of Machine Learning Research

TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

...read moreread less

15.5K

Data Mining - Concepts and Techniques.

Petra Perner

- 01 Jan 2002

14.6K

•Proceedings Article

Fast Algorithms for Mining Association Rules in Large Databases

Rakesh Agrawal, +1 more

- 12 Sep 1994

TL;DR: Two new algorithms for solving thii problem that are fundamentally different from the known algorithms are presented and empirical evaluation shows that these algorithms outperform theknown algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems.

...read moreread less

12.6K