Stream processing

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•

The Design of the Borealis Stream Processing Engine

[...]

Daniel J. Abadi¹, Yanif Ahmad², Magdalena Balazinska¹, Mitch Cherniack³, Jeong-Hyon Hwang², Wolfgang Lindner¹, Anurag S. Maskey³, Alexander Rasin², Esther Ryvkina³, Nesime Tatbul², Ying Xing², Stan Zdonik² - Show less +8 more•Institutions (3)

Massachusetts Institute of Technology¹, Brown University², Brandeis University³

1 Jan 2005

TL;DR: This paper outlines the basic design and functionality of Borealis, and presents a highly flexible and scalable QoS-based optimization model that operates across server and sensor networks and a new fault-tolerance model with flexible consistency-availability trade-offs.

...read moreread less

Abstract: Borealis is a second-generation distributed stream processing engine that is being developed at Brandeis University, Brown University, and MIT. Borealis inherits core stream processing functionality from Aurora [14] and distribution functionality from Medusa [51]. Borealis modifies and extends both systems in non-trivial and critical ways to provide advanced capabilities that are commonly required by newly-emerging stream processing applications. In this paper, we outline the basic design and functionality of Borealis. Through sample real-world applications, we motivate the need for dynamically revising query results and modifying query specifications. We then describe how Borealis addresses these challenges through an innovative set of features, including revision records, time travel, and control lines. Finally, we present a highly flexible and scalable QoS-based optimization model that operates across server and sensor networks and a new fault-tolerance model with flexible consistency-availability trade-offs.

...read moreread less

1,612 citations

Proceedings Article•10.1145/2517349.2522737•

Discretized streams: fault-tolerant streaming computation at scale

[...]

Matei Zaharia¹, Tathagata Das¹, Haoyuan Li¹, Timothy Hunter¹, Scott Shenker¹, Ion Stoica¹ - Show less +2 more•Institutions (1)

University of California, Berkeley¹

3 Nov 2013

TL;DR: D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers, and can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes.

...read moreread less

Abstract: Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming.

...read moreread less

1,142 citations

Journal Article•10.1145/2187671.2187677•

Processing flows of information: From data stream to complex event processing

[...]

Gianpaolo Cugola¹, Alessandro Margara¹•Institutions (1)

Polytechnic University of Milan¹

14 Jun 2012-ACM Computing Surveys

TL;DR: A general, unifying model is proposed to capture the different aspects of an IFP system and use it to provide a complete and precise classification of the systems and mechanisms proposed so far.

...read moreread less

Abstract: A large number of distributed applications requires continuous and timely processing of information as it flows from the periphery to the center of the system. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitoring applications which process raw data coming from sensor networks to identify critical situations; or applications performing online analysis of stock prices to identify trends and forecast future values.Traditional DBMSs, which need to store and index data before processing it, can hardly fulfill the requirements of timeliness coming from such domains. Accordingly, during the last decade, different research communities developed a number of tools, which we collectively call Information flow processing (IFP) systems, to support these scenarios. They differ in their system architecture, data model, rule model, and rule language. In this article, we survey these systems to help researchers, who often come from different backgrounds, in understanding how the various approaches they adopt may complement each other.In particular, we propose a general, unifying model to capture the different aspects of an IFP system and use it to provide a complete and precise classification of the systems and mechanisms proposed so far.

...read moreread less

1,000 citations

Proceedings Article•10.1145/1142473.1142520•

High-performance complex event processing over streams

[...]

Eugene Wu¹, Yanlei Diao², Shariq Rizvi³•Institutions (3)

University of California, Berkeley¹, University of Massachusetts Amherst², Google³

27 Jun 2006

TL;DR: This paper proposes a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications and describes a query plan-based approach to efficiently implementing this language.

...read moreread less

Abstract: In this paper, we present the design, implementation, and evaluation of a system that executes complex event queries over real-time streams of RFID readings encoded as events. These complex event queries filter and correlate events to match specific patterns, and transform the relevant events into new composite events for the use of external monitoring applications. Stream-based execution of these queries enables time-critical actions to be taken in environments such as supply chain management, surveillance and facility management, healthcare, etc. We first propose a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications. We then describe a query plan-based approach to efficiently implementing this language. Our approach uses native operators to efficiently handle query-defined sequences, which are a key component of complex event processing, and pipeline such sequences to subsequent operators that are built by leveraging relational techniques. We also develop a large suite of optimization techniques to address challenges such as large sliding windows and intermediate result sizes. We demonstrate the effectiveness of our approach through a detailed performance analysis of our prototype implementation under a range of data and query workloads as well as through a comparison to a state-of-the-art stream processor.

...read moreread less

954 citations

Book Chapter•10.1016/B978-155860869-6/50038-X•

Chapter 31 – Approximate Frequency Counts over Data Streams

[...]

Gurmeet Singh Manku¹, Rajeev Motwani¹•Institutions (1)

Stanford University¹

1 Jan 2002

TL;DR: This chapter presents algorithms for computing frequency counts exceeding a user-specified threshold over data streams, which are simple and have provably small memory footprints.

...read moreread less

Abstract: Publisher Summary This chapter presents algorithms for computing frequency counts exceeding a user-specified threshold over data streams. The algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a user-specified parameter. The algorithms can easily be deployed for streams of singleton items like those found in IP network monitoring. In several emerging applications, data takes the form of continuous data streams, as opposed to finite stored datasets. Examples include stock tickers, network traffic measurements, Web-server logs, click streams, data feeds from sensor networks, and telecom call records. Stream processing differs from computation over traditional stored datasets in two important aspects: (a) the sheer volume of a stream over its lifetime could be huge, and (b) queries require timely answers; response times should be small. Therefore, it is not possible to store the stream in its entirety on secondary storage and scan it when a query arrives.

...read moreread less

936 citations

...

Expand

Year	Papers
2026	1
2025	52
2024	63
2023	91
2022	195
2021	142

Topic Tools

Papers published on a yearly basis

Papers

The Design of the Borealis Stream Processing Engine

Discretized streams: fault-tolerant streaming computation at scale

Processing flows of information: From data stream to complex event processing

High-performance complex event processing over streams

Chapter 31 – Approximate Frequency Counts over Data Streams

Related Topics (5)

Performance Metrics