Proceedings Article10.1145/1807128.1807139
Comet: batched stream processing for data intensive distributed computing
Bingsheng He,Mao Yang,Zhenyu Guo,Rishan Chen,Bing Su,Wei Lin,Lidong Zhou +6 more
- 10 Jun 2010
- pp 63-74
TL;DR: A query processing system called Comet is developed that embraces batched stream processing and integrates with DryadLINQ, and when applied to a real production trace covering over 19 million machine-hours shows an estimated I/O saving of over 50%.
read more
Abstract: Batched stream processing is a new distributed data processing paradigm that models recurring batch computations on incrementally bulk-appended data streams. The model is inspired by our empirical study on a trace from a large-scale production data-processing cluster; it allows a set of effective query optimizations that are not possible in a traditional batch processing model.We have developed a query processing system called Comet that embraces batched stream processing and integrates with DryadLINQ. We used two complementary methods to evaluate the effectiveness of optimizations that Comet enables. First, a prototype system deployed on a 40-node cluster shows an I/O reduction of over 40% using our benchmark. Second, when applied to a real production trace covering over 19 million machine-hours, our simulator shows an estimated I/O saving of over 50%.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
- 25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Mesos: a platform for fine-grained resource sharing in the data center
Benjamin Hindman,Andy Konwinski,Matei Zaharia,Ali Ghodsi,Anthony D. Joseph,Randy H. Katz,Scott Shenker,Ion Stoica +7 more
- 30 Mar 2011
TL;DR: The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
Discretized streams: fault-tolerant streaming computation at scale
Matei Zaharia,Tathagata Das,Haoyuan Li,Timothy Hunter,Scott Shenker,Ion Stoica +5 more
- 03 Nov 2013
TL;DR: D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers, and can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes.
•Proceedings Article
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters
Matei Zaharia,Tathagata Das,Haoyuan Li,Scott Shenker,Ion Stoica +4 more
- 12 Jun 2012
TL;DR: D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup solutions in streaming databases: parallel recovery of lost state across the cluster.
Distributed data stream processing and edge computing
TL;DR: This work describes how existing solutions exploit resource elasticity features of cloud computing in stream processing and presents a gap analysis and future directions on stream processing on heterogeneous environments.
323
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
The Google file system
Sanjay Ghemawat,Howard Gobioff,Shun-Tak Albert Leung +2 more
- 19 Oct 2003
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Models and issues in data stream systems
Brian Babcock,Shivnath Babu,Mayur Datar,Rajeev Motwani,Jennifer Widom +4 more
- 03 Jun 2002
TL;DR: The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated.
Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard,Mihai Budiu,Yuan Yu,Andrew Birrell,Dennis Fetterly +4 more
- 21 Mar 2007
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.