Scheduling algorithm based on prefetching in MapReduce clusters

doi:10.1016/J.ASOC.2015.04.039

Open AccessJournal Article10.1016/J.ASOC.2015.04.039

Scheduling algorithm based on prefetching in MapReduce clusters

Mingming Sun, +4 more

- 01 Jan 2016

- Vol. 38, pp 1109-1118

31

TL;DR: The experiment results have shown that the HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs, can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.

Abstract: HighlightsWe explain in detail the architecture of prefetching module in Section 4.4.We detail the framework of HPSO by example in Section 4.1.We modify the scheduling algorithm based on prefetching to fully exploit the potential map tasks with data locality in Section 4.3.1. This method has the advantages of reducing network transmission. Furthermore, we consider part of nodes, whose remaining time is less then threshold Tunder to avoid invalid data prefetching.We conduct a serial of experiments to evaluate performance of the proposed system using different 5 applications (Section 5).A survey on the state-of-the-art method for improving data locality is conducted in Section 6. Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes for future map tasks based on current pending tasks and then preload the needed data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1016/J.JBUSRES.2016.08.001

Critical analysis of Big Data challenges and analytical methods

Uthayasankar Sivarajah, +3 more

- 01 Jan 2017

- Journal of Business Research

TL;DR: In this article, the authors present a state-of-the-art review that presents a holistic view of the BD challenges and BDA methods theorized/proposed/employed by organizations to help others understand this landscape with the objective of making robust investment decisions.

...read moreread less

1.8K

Journal Article•10.1016/J.ASOC.2018.04.030

A novel adaptive e-learning model based on Big Data by using competence-based knowledge and social learner activities

Marouane Birjali, +2 more

- 01 Aug 2018

- Applied Soft Computing

TL;DR: An adaptive e-learning model based on MapReduce-based Social Networks Analysis for determining the learner motivation and social productivity in order to assign a specific learning rhythm to each learner is proposed.

...read moreread less

68

Journal Article•10.1109/COMST.2021.3094993

Big Data Resource Management & Networks: Taxonomy, Survey, and Future Directions

Feras M. Awaysheh, +4 more

- 09 Jul 2021

- IEEE Communications Surveys and Tutorial...

TL;DR: This study addresses the previous concern, offering a comprehensive review of the architectural elements of BD batch query deployment models and environments, and provides an extensive survey of the modern BD deployment architectures, categorizing them based on their underlying infrastructure.

...read moreread less

58

Journal Article•10.1016/J.FUTURE.2018.07.043

A data locality based scheduler to enhance MapReduce performance in heterogeneous environments

Nenavath Srinivas Naik, +3 more

- 01 Jan 2019

- Future Generation Computer Systems

TL;DR: The experimental results prove that the proposed scheduler enhances the MapReduce performance in heterogeneous environments and improves data locality for different parameters as compared to the Hadoop default scheduler, Matchmaking scheduler and Delay scheduler respectively.

...read moreread less

57

•Journal Article•10.1007/S11227-019-03136-6

Detecting Straggler MapReduce Tasks in Big Data Processing Infrastructure by Neural Network

Amir Javadpour, +3 more

- 13 Apr 2020

- arXiv: Distributed, Parallel, and Cluste...

TL;DR: The proposed method is based on the application of a backpropagation neural network on the Hadoop for the detection of straggling tasks, to estimate the remaining execution time of tasks that is very important in straggler task detection.

...read moreread less

29

...

Expand

References

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

Proceedings Article•10.1109/MSST.2010.5496972

The Hadoop Distributed File System

Konstantin Shvachko, +3 more

- 03 May 2010

TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.

...read moreread less

5.8K

•Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

- 25 Apr 2012

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

4.6K

•Proceedings Article•10.5555/1855741.1855744

Improving MapReduce performance in heterogeneous environments

Matei Zaharia, +4 more

- 08 Dec 2008

TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.

...read moreread less

1.9K