1. What are the contributions mentioned in the paper "A comparison of approaches to large-scale data analysis" ?
Although the basic control flow of this framework has existed in parallel SQL database management systems ( DBMS ) for over 20 years, some have called MR a dramatically new computing model [ 8, 17 ].. In this paper, the authors describe and compare both paradigms.. Furthermore, the authors evaluate both kinds of systems in terms of performance and development complexity.. For each task, the authors measure each system ’ s performance for various degrees of parallelism on a cluster of 100 nodes.. The authors speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
read more
2. How many records are processed in the original mapreduce paper?
The measurements in the original MapReduce paper are based on processing 1TB of data on approximately 1800 nodes, which is 5.6 million records or roughly 535MB of data per node.
read more
3. How long does it take to read the UserVisits and Rankings tables off disk?
it takes approximately 600 seconds of raw I/O to read the UserVisits and Rankings tables off of disk and then another 300 seconds to split, parse, and deserialize the various attributes.
read more
4. What is the attractive aspect of the MapReduce programming model?
One of the attractive qualities about the MapReduce programming model is its simplicity: an MR program consists only of two functions, called Map and Reduce, that are written by a user to process key/value data pairs.
read more