Exact PPS sampling with bounded sample size

Q: What is the cost of processing items in DOWNSAMPLE?

The cost of processing items in DOWNSAMPLE is O(D t), where D t is the number of elements discarded from the latent sample when processing items x 1, . . ., x t. The number of elements discarded is bounded by the number of elements inserted into the latent sample, resulting in O(t). The maximum cost to process an item is Th(n), which occurs when the sample reaches its maximum size of n items and loses many items due to a new heavy item. However, this cost is not prohibitive as producing an output sample also incurs a cost of Th(n).

Q: What is the main advantage of the new weighted random sampling scheme?

The main advantage of the new weighted random sampling scheme is its ability to prioritize the PPS property over maintaining a fixed sample size. This results in expanding the set of known unequal-probability sampling schemes. The scheme enforces an upper bound on the sample size, keeping it as large and stable as possible. Additionally, it is simple to implement, a one-pass algorithm capable of working over data streams, and has the best known amortized execution cost per round. Future work will consider extremely large samples that do not fit in memory, combining ideas from this paper with the notion of a 'geometric file' as introduced by Pol et al. (2008).

Question

1. What is the goal of EB-PPS sampling?

2. What is the PPS property of Algorithm 1.Eb-PPS and how is it established?

3. What is the cost of processing items in DOWNSAMPLE?

4. What is the main advantage of the new weighted random sampling scheme?

Accepted Answer

The goal of EB-PPS sampling is to ensure that the appearance probability of each item is proportional to its weight at all times, and to control the sample size to have maximal expected value and minimal variance relative to all possible bounded PPS samples. The sample size may take on fractional values, and a randomized procedure is used to deliver an integer-sized sample to the user. EB-PPS maintains a data structure called a 'latent sample' to extract actual samples on demand. The algorithm allows latent samples to decrease in size, enabling exact PPS sampling.

Accepted Answer

The PPS (Proportional to Size) property of Algorithm 1.Eb-PPS is established through a theorem that states for all t >= 1 and x i with 1 <= i <= t, we have Pr(x i S t ) = r t w i, where r t = min(1 max 1<=i<=t wi , n t i=1 wi ). The proof is by induction, starting with t = 1 where the algorithm sets r ' = r 1 = 1/w 1. It is shown that Pr(x 1 S 1 ) = Pr(x 1 S '' ) and thus item x 1 is included in S 1 with probability r 1 w 1 = 1. For t > 1, the induction hypothesis is applied, and it is shown that Pr(x t S '' ) = r t w t, where S '' is a sample extracted from T ' . The properties of the UNION function are used to prove the desired result. The downsampling operation on Line 7 is allowed since r t /r t-1 <= 1 by positivity of the weights, and the downsampling operation on Line 9 is allowed since r t w t <= w t / max 1<=i<=t w i <= 1. This proves the first assertion of the theorem. To prove the second assertion, it is observed that E |S t | = t i=1 Pr(x i S t ) = t i=1 r t w i <= n by the first assertion and the definition of r t. This implies that the corresponding latent sample L t contains at most n items, and thus any sample extracted from L t can contain at most n items. The following two theorems show that when EB-PPS sampling produces a sample of size less than n, the expected sample size is the maximum possible under the PPS constraint in (1) and the sample-size variance is the minimum possible given maximal expected size. Theorem 2 states that if E |S t | < n, then E |S H t | <= E |S t |. The proof involves considering all possible distributions over the sample size having a mean value equal to C t and showing that variance is minimized by concentrating all of the probability mass onto C t and C t. This is the sample-size distribution attained by EB-PPS.

Accepted Answer

The cost of processing items in DOWNSAMPLE is O(D t), where D t is the number of elements discarded from the latent sample when processing items x 1, . . ., x t. The number of elements discarded is bounded by the number of elements inserted into the latent sample, resulting in O(t). The maximum cost to process an item is Th(n), which occurs when the sample reaches its maximum size of n items and loses many items due to a new heavy item. However, this cost is not prohibitive as producing an output sample also incurs a cost of Th(n).

Accepted Answer

The main advantage of the new weighted random sampling scheme is its ability to prioritize the PPS property over maintaining a fixed sample size. This results in expanding the set of known unequal-probability sampling schemes. The scheme enforces an upper bound on the sample size, keeping it as large and stable as possible. Additionally, it is simple to implement, a one-pass algorithm capable of working over data streams, and has the best known amortized execution cost per round. Future work will consider extremely large samples that do not fit in memory, combining ideas from this paper with the notion of a 'geometric file' as introduced by Pol et al. (2008).

Exact PPS sampling with bounded sample size

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What is the goal of EB-PPS sampling?

2. What is the PPS property of Algorithm 1.Eb-PPS and how is it established?

3. What is the cost of processing items in DOWNSAMPLE?

4. What is the main advantage of the new weighted random sampling scheme?

Citations

New generalized class of estimators for estimation of finite population mean based on probability proportional to size sampling using two auxiliary variables: A simulation study

References

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Model assisted survey sampling

Algorithm 235: Random permutation

On the Theory of Systematic Sampling, II

List Sequential Sampling with Equal or Unequal Probabilities without Replacement

Related Papers (5)

Understanding the effects of sample size on the variability of the mean

Sampling ecological information: choice of sample size, reconsidered

Adaptive sampling rule for ranking-and-selection problem

Accuracy of Surface Samples from Gravel Bed Material

Sampling Distribution: Impact of the Population Reliability on the Sample Size Determination