A comparison of approaches to large-scale data analysis

Question

1. What are the contributions mentioned in the paper "A comparison of approaches to large-scale data analysis" ?

2. How many records are processed in the original mapreduce paper?

3. How long does it take to read the UserVisits and Rankings tables off disk?

4. What is the attractive aspect of the MapReduce programming model?

Accepted Answer

Although the basic control flow of this framework has existed in parallel SQL database management systems ( DBMS ) for over 20 years, some have called MR a dramatically new computing model [ 8, 17 ].. In this paper, the authors describe and compare both paradigms.. Furthermore, the authors evaluate both kinds of systems in terms of performance and development complexity.. For each task, the authors measure each system ’ s performance for various degrees of parallelism on a cluster of 100 nodes.. The authors speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

Accepted Answer

The measurements in the original MapReduce paper are based on processing 1TB of data on approximately 1800 nodes, which is 5.6 million records or roughly 535MB of data per node.

Accepted Answer

it takes approximately 600 seconds of raw I/O to read the UserVisits and Rankings tables off of disk and then another 300 seconds to split, parse, and deserialize the various attributes.

Accepted Answer

One of the attractive qualities about the MapReduce programming model is its simplicity: an MR program consists only of two functions, called Map and Reduce, that are written by a user to process key/value data pairs.

Accepted Answer

The authors also used MR’s Combine feature to perform the pre-aggregate before data is transmitted to the Reduce instances, improving the first query’s execution time by a factor of two [8].

Accepted Answer

Most programmers are more familiar with object-oriented, imperative programming than with other language technologies, such as SQL.

Accepted Answer

In addition, if a MR system needs 1,000 nodes to match the performance of a 100 node parallel database system, it is ten times more likely that a node will fail while a query is executing.

Accepted Answer

Since parallel DBMSs will be deployed on larger clusters over time, the probability of mid-query hardware failures will increase.

Accepted Answer

Because programmers only need to specify their goal in a high level language, they are not burdened by the underlying storage details, such as indexing options and join strategies.

Accepted Answer

By again separating such constraints from the application and enforcing them automatically by the run time system, as is done by all SQL DBMSs, the integrity of the data is enforced without additional work on the programmer’s behalf.

Accepted Answer

Given these records, the Reduce function then simply counts the number of values for a given key and outputs the URL and the calculated inlink count as the program’s final output.

Accepted Answer

The authors initially believed that this would improve CPU-bound tasks, because the Map and Reduce tasks no longer needed to split the fields by the delimiter.

Accepted Answer

as the total number of allocated Map tasks increases, there is additional overhead required for the central job tracker to coordinate node activities.

Accepted Answer

The authors found that enabling compression reduced the execution times for almost all the benchmark tasks by 50%, and thus the authors only report results with compression enabled.

Accepted Answer

the authors found that other data format options, such as SequenceFileInputFormat or custom Writable tuples, resulted in both slower load and execution times.

Accepted Answer

because all of their benchmarks are read-only, the authors did not enable replication features in DBMS-X, since this would not have improved performance and complicates the installation process.

Accepted Answer

To measure the basic performance without the overhead of coordinating parallel tasks, the authors first execute each task on a single node.

A comparison of approaches to large-scale data analysis

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions mentioned in the paper "A comparison of approaches to large-scale data analysis" ?

2. How many records are processed in the original mapreduce paper?

3. How long does it take to read the UserVisits and Rankings tables off disk?

4. What is the attractive aspect of the MapReduce programming model?

5. How did MR perform the pre-aggregate before data was transmitted to the Reduce instances?

6. What language is more familiar to programmers than SQL?

7. How many nodes do a MR system need to perform a query?

8. What is the probability of mid-query hardware failures in parallel DBMSs?

9. Why do programmers need to specify their goal in a high level language?

10. What is the approach to enforce the integrity of data?

11. What is the function that calculates the inlink count for a given key?

12. Why did the authors initially think that block-level compression would improve the performance of the Map and Reduce?

13. What is the main reason why the central job tracker is required to coordinate node activities?

14. How did the authors find that compression reduced the execution times for almost all the benchmark tasks?

15. What other data format options resulted in slower load and execution times?

16. Why did the authors not enable replication in DBMS-X?

17. How do the authors measure the basic performance without the overhead of coordinating parallel tasks?

Citations

Can the Elephants Handle the NoSQL Onslaught

Benchmarking Big Data Systems: A Review

HSim: A MapReduce simulator in enabling Cloud Computing

epiC: an extensible and scalable system for processing big data

Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

The Google file system

Dryad: distributed data-parallel programs from sequential building blocks

SCOPE: easy and efficient parallel processing of massive data sets

Related Papers (5)

MapReduce: simplified data processing on large clusters

Hive: a warehousing solution over a map-reduce framework

Dryad: distributed data-parallel programs from sequential building blocks

The Google file system

MapReduce: a flexible data processing tool