Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem

doi:10.1007/S10766-017-0513-2

Journal Article10.1007/S10766-017-0513-2

Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem

M. Mazhar Rathore, +4 more

- 01 Jun 2018

- International Journal of Parallel Progra...

- Vol. 46, Iss: 3, pp 630-646

83

TL;DR: Results show that the proposed system working with Spark on top and GPUs under the parallel and distributed environment of Hadoop ecosystem is more efficient and real-time as compared to existing standalone CPU-based MapReduce implementation.

Abstract: In this technological era, every person, authorities, entrepreneurs, businesses, and many things around us are connected to the internet, forming Internet of thing (IoT). This generates a massive amount of diverse data with very high-speed, termed as big data. However, this data is very useful that can be used as an asset for the businesses, organizations, and authorities to predict future in various aspects. However, efficiently processing Big Data while making real-time decisions is a quite challenging task. Some of the tools like Hadoop are used for Big Datasets processing. On the other hand, these tools could not perform well in the case of real-time high-speed stream processing. Therefore, in this paper, we proposed an efficient and real-time Big Data stream processing approach while mapping Hadoop MapReduce equivalent mechanism on graphics processing units (GPUs). We integrated a parallel and distributed environment of Hadoop ecosystem and a real-time streaming processing tool, i.e., Spark with GPU to make the system more powerful in order to handle the overwhelming amount of high-speed streaming. We designed a MapReduce equivalent algorithm for GPUs for a statistical parameter calculation by dividing overall Big Data files into fixed-size blocks. Finally, the system is evaluated while considering the efficiency aspect (processing time and throughput) using (1) large-size city traffic video data captured by static as well as moving vehicles’ cameras while identifying vehicles and (2) large text-based files, like twitter data files, structural data, etc. Results show that the proposed system working with Spark on top and GPUs under the parallel and distributed environment of Hadoop ecosystem is more efficient and real-time as compared to existing standalone CPU-based MapReduce implementation.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/J.SCS.2017.12.022

Exploiting IoT and big data analytics: Defining Smart Digital City using real-time urban data

M. Mazhar Rathore, +5 more

- 01 Jul 2018

- Sustainable Cities and Society

TL;DR: This paper has established an IoT-based Smart City by using Big Data analytics while harvesting real-time data from the city by using existing smart systems and IoT devices as city data sources to develop the Smart Digital City.

...read moreread less

339

•Journal Article•10.1016/J.SCS.2020.102231

Smart campus-A sketch.

Nasro Min-Allah, +1 more

- 08 May 2020

- Sustainable Cities and Society

TL;DR: A list of smart campus initiatives that can be prioritized as per a university needs and geographical location is created and the generic model established in this work for a smart campus remains valid.

...read moreread less

126

Journal Article•10.1002/WIDM.1232

A survey on graphic processing unit computing for large‐scale data mining

Alberto Cano

- 01 Jan 2018

- Wiley Interdisciplinary Reviews-Data Min...

TL;DR: This survey analyzes current trends in the use of GPU computing for large‐scale data mining, discusses GPU architecture advantages for handling volume and velocity of data, identifies limitation factors hampering the scalability of the problems, and discusses open issues and future directions.

...read moreread less

98

•Journal Article•10.1186/S40537-020-00388-5

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Nasim Ahmed, +3 more

- 17 Aug 2020

- Journal of Big Data

TL;DR: Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

...read moreread less

82

Journal Article•10.1016/J.SCS.2020.102264

Key performance indicators for Smart Campus and Microgrid

Saleh Alrashed

- 01 Sep 2020

- Sustainable Cities and Society

TL;DR: The aim of this work is to establish a mechanism that allows campus management to monitor the smartness of their university campus in general, and microgrid in particular.

...read moreread less

60

...

Expand

References

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

•Book

Lapack Users' Guide

Ed Anderson

- 01 Feb 1995

TL;DR: The third edition of LAPACK provided a guide to troubleshooting and installation of Routines, as well as providing examples of how to convert from LINPACK or EISPACK to BLAS.

...read moreread less

3.2K

Journal Article•10.1006/JPDC.2000.1714

A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems

Tracy D. Braun, +10 more

- 01 Jun 2001

- Journal of Parallel and Distributed Comp...

TL;DR: It is shown that for the cases studied here, the relatively simple Min?min heuristic performs well in comparison to the other techniques, and one even basis for comparison and insights into circumstances where one technique will out-perform another.

...read moreread less

1.9K

GPU Computing

John D. Owens, +5 more

- 01 May 2008

TL;DR: The background, hardware, and programming model for GPU computing is described, the state of the art in tools and techniques are summarized, and four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications are presented.

...read moreread less

1.7K