Random Sampling for Continuous Streams with Arbitrary Updates
TL;DR: This work develops several fully dynamic algorithms for obtaining random samples from individual relations, and from the join result of two tables, that can handle any update pattern with small space and computational overhead.
read more
Abstract: The existing random sampling methods have at least one of the following disadvantages: they 1) are applicable only to certain update patterns, 2) entail large space overhead, or 3) incur prohibitive maintenance cost. These drawbacks prevent their effective application in stream environments (where a relation is updated by a large volume of insertions and deletions that may arrive in any order), despite the considerable success of random sampling in conventional databases. Motivated by this, we develop several fully dynamic algorithms for obtaining random samples from individual relations, and from the join result of two tables. Our solutions can handle any update pattern with small space and computational overhead. We also present an in-depth analysis that provides valuable insight into the characteristics of alternative sampling strategies and leads to precision guarantees. Extensive experiments validate our theoretical findings and demonstrate the efficiency of our techniques in practice
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Fast classification for large data sets via random selection clustering and Support
Xiaoou Li,Jair Cervantes,Wen Yu +2 more
- 01 Jan 2012
TL;DR: Experimental results demonstrate that the proposed SVM classification approach has good classification accuracy while the training is significantly faster than other SVM classifiers.
11
Sharable file searching in unstructured Peer-to-peer systems
TL;DR: This work develops several fully dynamic algorithms for searching sharing files in unstructured peer to peer systems that can handle any topology pattern with small search time and computational overhead.
9
Efficient sampling of non-strict turnstile data streams
Neta Barkay,Ely Porat,Bar Shalem +2 more
TL;DR: The sampling method improves by an order of magnitude the known processing time of each stream element, a crucial factor in data stream applications, thereby providing a feasible solution to the sampling problem.
8
Efficient sampling of non-strict turnstile data streams
Neta Barkay,Ely Porat,Bar Shalem +2 more
- 19 Aug 2013
TL;DR: The sample is useful for approximating both forward and inverse distribution statistics, within an additive error e and provable success probability 1−δ, and the structure enables the sampling algorithm to run on distributed systems and extract statistics on the difference between streams.
6
Modified histogram: a spatio-temporal aggregate index for moving objects in road networks
Jun Feng,Zhonghua Zhu +1 more
TL;DR: Evaluation shows this new histogram method outperforms the traditional sketch method not only in sketch generation time, query time, and the memory requirements, but also in the approximation error of aggregate query which is in an appropriate range.
4
References
Random sampling with a reservoir
TL;DR: Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin, and an efficient Pascal-like implementation is given that incorporates these modifications and that is suitable for general use.
New sampling-based summary statistics for improving approximate query answers
Phillip B. Gibbons,Yossi Matias +1 more
- 01 Jun 1998
TL;DR: This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution, and considers their application to providing fast approximate answers to hot list queries.
Sampling from a moving window over streaming data
Brian Babcock,Mayur Datar,Rajeev Motwani +2 more
- 06 Jan 2002
TL;DR: This work introduces the problem of sampling from a moving window of recent items from a data stream and develops two algorithms, the first of which, "chain-sample", extends reservoir sampling to deal with the expiration of data elements from the sample and the second, "priority- sample", works even when the number of elements in the window can vary dynamically over time.
425
On random sampling over joins
Surajit Chaudhuri,Rajeev Motwani,Vivek Narasayya +2 more
- 01 Jun 1999
TL;DR: A detailed study of the inefficiency of sampling the output of a query, based on new insights into the interaction between join and sampling, and develops join sampling techniques for the settings where negative results do not apply.
Dynamic sample selection for approximate query processing
Brian Babcock,Surajit Chaudhuri,Gautam Das +2 more
- 09 Jun 2003
TL;DR: In this article, an approximate query processing technique that dynamically constructs an appropriately biased sample for each query by combining samples selected from a family of non-uniform samples that are constructed during a pre-processing phase is described.
303
Related Papers (5)
Altan Birler,Bernhard Radke,Thomas Neumann +2 more
- 15 Jun 2020
T. Johnson,S. Muthukrishnan,V. Shkapenyuk,O. Spatscheck +3 more
- 17 Apr 2007
Mayur Datar,S. Muthukrishnan +1 more
- 17 Sep 2002
S. Muthukrishnan,Irina Rozenbaum +1 more
- 01 Jan 2007