Reservoir sampling

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1145/3147.3165•

Random sampling with a reservoir

[...]

Jeffrey Scott Vitter¹•Institutions (1)

Brown University¹

01 Mar 1985-ACM Transactions on Mathematical Software

TL;DR: Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin, and an efficient Pascal-like implementation is given that incorporates these modifications and that is suitable for general use.

...read moreread less

Abstract: We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O(n(1 + log(N/n))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.

...read moreread less

2,046 citations

Proceedings Article•10.5555/545381.545465•

Sampling from a moving window over streaming data

[...]

Brian Babcock¹, Mayur Datar¹, Rajeev Motwani¹•Institutions (1)

Stanford University¹

6 Jan 2002

TL;DR: This work introduces the problem of sampling from a moving window of recent items from a data stream and develops two algorithms, the first of which, "chain-sample", extends reservoir sampling to deal with the expiration of data elements from the sample and the second, "priority- sample", works even when the number of elements in the window can vary dynamically over time.

...read moreread less

Abstract: We introduce the problem of sampling from a moving window of recent items from a data stream and develop two algorithms for this problem. The first algorithm, "chain-sample", extends reservoir sampling to deal with the expiration of data elements from the sample. The expected memory usage of our algorithm is O(k) when maintaining a sample of size k over a window of the n most recent elements from the data stream, and with high probability the algorithm requires no more than O(k log n) memory.When the number of elements in the window is variable, as is the case when the size of the window is defined as a time duration rather than as a fixed number of data elements, the sampling problem becomes harder. Our second algorithm, "priority-sample", works even when the number of elements in the window can vary dynamically over time. With high probability, the "priority-sample" algorithm uses no more than O(k log n) memory.

...read moreread less

425 citations

Journal Article•10.1016/J.IPL.2005.11.003•

Weighted random sampling with a reservoir

[...]

Pavlos S. Efraimidis¹, Paul G. Spirakis²•Institutions (2)

Democritus University of Thrace¹, Research Academic Computer Technology Institute²

16 Mar 2006-Information Processing Letters

TL;DR: A new algorithm for drawing a weighted random sample of size m from a population of n weighted items, where m ≤ n, is presented and can generate a weightedrandom sample in one-pass over unknown populations.

...read moreread less

422 citations

Journal Article•10.26599/BDMA.2019.9020015•

A survey of data partitioning and sampling methods to support big data analysis

[...]

Mohammad Sultan Mahmud¹, Joshua Zhexue Huang¹, Salman Salloum¹, Tamer Z. Emara¹, Kuanishbay Sadatdiynov¹ - Show less +1 more•Institutions (1)

Shenzhen University¹

27 Feb 2020

TL;DR: It is believed that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

...read moreread less

Abstract: Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

...read moreread less

256 citations

Journal Article•10.1007/BF00140664•

Random sampling from databases: a survey

[...]

Frank Olken¹, Doron Rotem², Doron Rotem¹•Institutions (2)

Lawrence Berkeley National Laboratory¹, San Jose State University²

01 Mar 1995-Statistics and Computing

TL;DR: This paper reviews recent literature on techniques for obtaining random samples from databases, and describes sampling for estimation of aggregates (e.g. the size of query results).

...read moreread less

Abstract: This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees). Algorithms for sampling from simple relational queries, e.g. single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g. the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision-theoretic approaches to sampling for query optimization are reviewed.

...read moreread less

194 citations

...

Expand

Topic Tools

Papers published on a yearly basis

Papers

Random sampling with a reservoir

Sampling from a moving window over streaming data

Weighted random sampling with a reservoir

A survey of data partitioning and sampling methods to support big data analysis

Random sampling from databases: a survey

Related Topics (5)

Performance Metrics

No. of papers in the topic in previous years
Year	Papers
2021	7
2020	11
2019	19
2018	10
2017	2
2016	8