Benchmarking MapReduce Implementations for Application Usage Scenarios
Zacharia Fadika,Elif Dede,Madhusudhan Govindaraju,Lavanya Ramakrishnan +3 more
- 21 Sep 2011
- pp 90-97
TL;DR: A standard benchmark suite for quantifying, comparing, and contrasting the performance of MapReduce platforms under a wide range of representative use cases is presented and allows Grid researchers to choose the Map Reduce implementation that best suits their application's needs.
read more
Abstract: The MapReduce paradigm provides a scalable model for large scale data-intensive computing and associated fault-tolerance. With data production increasing daily due to ever growing application needs, scientific endeavors, and consumption, the MapReduce model and its implementations need to be further evaluated, improved, and strengthened. Several MapReduce frameworks with various degrees of conformance to the key tenets of the model are available today, each, optimized for specific features. HPC application and middleware developers must thus understand the complex dependencies between MapReduce features and their application. We present a standard benchmark suite for quantifying, comparing, and contrasting the performance of MapReduce platforms under a wide range of representative use cases. We report the performance of three different MapReduce implementations on the benchmarks, and draw conclusions about their current performance characteristics. The three platforms we chose for evaluation are the widely used Apache Hadoop implementation, Twister, which has been discussed in the literature, and LEMO-MR, our own implementation. The performance analysis we perform also throws light on the available design decisions for future implementations, and allows Grid researchers to choose the MapReduce implementation that best suits their application's needs.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Load-balancing algorithms in cloud computing
TL;DR: This paper study the literature on the task scheduling and load-balancing algorithms and present a new classification of such algorithms, for example, Hadoop MapReduce load balancing category, Natural Phenomena-based load balancing categories, Agent-basedLoadBalancing category, General load balancingcategory, application-oriented category, network-aware category, and workflow specific category.
407
•Proceedings Article
Proceedings of the nineteenth ACM symposium on Operating systems principles
Michael L. Scott,Larry L. Peterson +1 more
- 19 Oct 2003
TL;DR: This year's SOSP features a strong and varied technical program, showcasing the most significant and influential work in operating systems today, and encouraged student attendance through reduced registration fees and a strong program of financial scholarships.
130
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures
Austin R. Benson,David F. Gleich,James Demmel +2 more
- 23 Dec 2013
TL;DR: This paper describes how to compute a stable tall-and-skinny QR factorization on a MapReduce architecture in only slightly more than 2 passes over the data, and finds that the new stable method is competitive with unstable methods for matrices with a modest number of columns.
An Evaluation of Cassandra for Hadoop
Elif Dede,Bedri Sendir,Pinar Kuzlu,Jessica Hartog,Madhusudhan Govindaraju +4 more
- 28 Jun 2013
TL;DR: This paper presents a thorough evaluation of the Cassandra NoSQL database when used in conjunction with the Hadoop MapReduce engine and characterize the performance for a wide range of representative use cases, and then compare, contrast, and evaluate.
MARLA: MapReduce for Heterogeneous Clusters
Zacharia Fadika,Elif Dede,Jessica Hartog,Madhusudhan Govindaraju +3 more
- 13 May 2012
TL;DR: This paper addresses the problems associated with existing MapReduce implementations affecting cluster heterogeneity, and subsequently presents MARLA, a Map Reduce framework capable of performing well not only in homogeneous settings, but also when the cluster exhibits heterogeneous properties.
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
The Google file system
Sanjay Ghemawat,Howard Gobioff,Shun-Tak Albert Leung +2 more
- 19 Oct 2003
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
The Hadoop Distributed File System
Konstantin Shvachko,Hairong Kuang,Sanjay Radia,Robert J. Chansler +3 more
- 03 May 2010
TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
Improving MapReduce performance in heterogeneous environments
Matei Zaharia,Andy Konwinski,Anthony D. Joseph,Randy H. Katz,Ion Stoica +4 more
- 08 Dec 2008
TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.