1. What is the goal of EB-PPS sampling?
The goal of EB-PPS sampling is to ensure that the appearance probability of each item is proportional to its weight at all times, and to control the sample size to have maximal expected value and minimal variance relative to all possible bounded PPS samples. The sample size may take on fractional values, and a randomized procedure is used to deliver an integer-sized sample to the user. EB-PPS maintains a data structure called a 'latent sample' to extract actual samples on demand. The algorithm allows latent samples to decrease in size, enabling exact PPS sampling.
read more
2. What is the PPS property of Algorithm 1.Eb-PPS and how is it established?
The PPS (Proportional to Size) property of Algorithm 1.Eb-PPS is established through a theorem that states for all t >= 1 and x i with 1 <= i <= t, we have Pr(x i S t ) = r t w i, where r t = min(1 max 1<=i<=t wi , n t i=1 wi ). The proof is by induction, starting with t = 1 where the algorithm sets r ' = r 1 = 1/w 1. It is shown that Pr(x 1 S 1 ) = Pr(x 1 S '' ) and thus item x 1 is included in S 1 with probability r 1 w 1 = 1. For t > 1, the induction hypothesis is applied, and it is shown that Pr(x t S '' ) = r t w t, where S '' is a sample extracted from T ' . The properties of the UNION function are used to prove the desired result. The downsampling operation on Line 7 is allowed since r t /r t-1 <= 1 by positivity of the weights, and the downsampling operation on Line 9 is allowed since r t w t <= w t / max 1<=i<=t w i <= 1. This proves the first assertion of the theorem. To prove the second assertion, it is observed that E |S t | = t i=1 Pr(x i S t ) = t i=1 r t w i <= n by the first assertion and the definition of r t. This implies that the corresponding latent sample L t contains at most n items, and thus any sample extracted from L t can contain at most n items. The following two theorems show that when EB-PPS sampling produces a sample of size less than n, the expected sample size is the maximum possible under the PPS constraint in (1) and the sample-size variance is the minimum possible given maximal expected size. Theorem 2 states that if E |S t | < n, then E |S H t | <= E |S t |. The proof involves considering all possible distributions over the sample size having a mean value equal to C t and showing that variance is minimized by concentrating all of the probability mass onto C t and C t. This is the sample-size distribution attained by EB-PPS.
read more
3. What is the cost of processing items in DOWNSAMPLE?
The cost of processing items in DOWNSAMPLE is O(D t), where D t is the number of elements discarded from the latent sample when processing items x 1, . . ., x t. The number of elements discarded is bounded by the number of elements inserted into the latent sample, resulting in O(t). The maximum cost to process an item is Th(n), which occurs when the sample reaches its maximum size of n items and loses many items due to a new heavy item. However, this cost is not prohibitive as producing an output sample also incurs a cost of Th(n).
read more
4. What is the main advantage of the new weighted random sampling scheme?
The main advantage of the new weighted random sampling scheme is its ability to prioritize the PPS property over maintaining a fixed sample size. This results in expanding the set of known unequal-probability sampling schemes. The scheme enforces an upper bound on the sample size, keeping it as large and stable as possible. Additionally, it is simple to implement, a one-pass algorithm capable of working over data streams, and has the best known amortized execution cost per round. Future work will consider extremely large samples that do not fit in memory, combining ideas from this paper with the notion of a 'geometric file' as introduced by Pol et al. (2008).
read more