Scheduling algorithm based on prefetching in MapReduce clusters
Mingming Sun,Hang Zhuang,Changlong Li,Kun Lu,Xuehai Zhou +4 more
- 01 Jan 2016
- Vol. 38, pp 1109-1118
TL;DR: The experiment results have shown that the HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs, can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.
read more
Abstract: HighlightsWe explain in detail the architecture of prefetching module in Section 4.4.We detail the framework of HPSO by example in Section 4.1.We modify the scheduling algorithm based on prefetching to fully exploit the potential map tasks with data locality in Section 4.3.1. This method has the advantages of reducing network transmission. Furthermore, we consider part of nodes, whose remaining time is less then threshold Tunder to avoid invalid data prefetching.We conduct a serial of experiments to evaluate performance of the proposed system using different 5 applications (Section 5).A survey on the state-of-the-art method for improving data locality is conducted in Section 6. Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes for future map tasks based on current pending tasks and then preload the needed data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Critical analysis of Big Data challenges and analytical methods
TL;DR: In this article, the authors present a state-of-the-art review that presents a holistic view of the BD challenges and BDA methods theorized/proposed/employed by organizations to help others understand this landscape with the objective of making robust investment decisions.
1.8K
A novel adaptive e-learning model based on Big Data by using competence-based knowledge and social learner activities
TL;DR: An adaptive e-learning model based on MapReduce-based Social Networks Analysis for determining the learner motivation and social productivity in order to assign a specific learning rhythm to each learner is proposed.
68
Big Data Resource Management & Networks: Taxonomy, Survey, and Future Directions
TL;DR: This study addresses the previous concern, offering a comprehensive review of the architectural elements of BD batch query deployment models and environments, and provides an extensive survey of the modern BD deployment architectures, categorizing them based on their underlying infrastructure.
58
A data locality based scheduler to enhance MapReduce performance in heterogeneous environments
TL;DR: The experimental results prove that the proposed scheduler enhances the MapReduce performance in heterogeneous environments and improves data locality for different parameters as compared to the Hadoop default scheduler, Matchmaking scheduler and Delay scheduler respectively.
57
Detecting Straggler MapReduce Tasks in Big Data Processing Infrastructure by Neural Network
TL;DR: The proposed method is based on the application of a backpropagation neural network on the Hadoop for the detection of straggling tasks, to estimate the remaining execution time of tasks that is very important in straggler task detection.
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
The Hadoop Distributed File System
Konstantin Shvachko,Hairong Kuang,Sanjay Radia,Robert J. Chansler +3 more
- 03 May 2010
TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
•Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
- 25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Improving MapReduce performance in heterogeneous environments
Matei Zaharia,Andy Konwinski,Anthony D. Joseph,Randy H. Katz,Ion Stoica +4 more
- 08 Dec 2008
TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.