1. What is the alternative approach to full sample analysis?
The alternative approach to full sample analysis is subsampling, which is based on the analysis of a selected fraction of the whole data. This approach has been studied for various models based on sampling with or without replacement. Optimal subsampling methods require calculating data-dependent nonuniform sampling probabilities for all data at once, which can be challenging due to the large memory requirement and computational burden. The subsample observations might be independent conditioned on the full data, but they are correlated unconditionally. Current results quantify the difference between the subsample estimator and the full data estimator through conditional distribution and conditional variance. The perturbation subsampling method, developed in this paper, optimizes convex objective functions in the analysis of large scale data by generating stochastic weights through two steps: Bernoulli sampling and random perturbation weights. This method is feasible for distributed computing and provides an empirical distribution for statistical inference.
read more
2. What is the optimization objective function in perturbation subsampling for large-scale data?
The optimization objective function in perturbation subsampling for large-scale data is a continuous, convex function f(b, x, y) that characterizes the relationship between the response variable Y and covariates X. The unknown parameter b0 is the arg min of this function over a d-dimensional space, denoted as b0Rd. The function f(b, x, y) is assumed to exist and be finite. The estimator of b0 is obtained by minimizing the empirical analog of the convex objective function, denoted as bn, which is the arg min of bRd over n i=1 samples (y i, x i). This resulting estimator is an M-estimator, which can be a maximum likelihood estimator (MLE) or a Lp-norm estimator depending on the function f(b, x, y).
read more
3. What is perturbation subsampling method?
Perturbation subsampling is a method that reduces sample size from n to r n via Bernoulli sampling and approximates the full data objective function via perturbation with independently generated stochastic weights. It involves two steps: subsampling using Bernoulli random variables and perturbation using nonnegative stochastic weights from a known probability distribution. This method has been used in subsampling algorithms and can be implemented repeatedly to estimate the variance of the perturbation subsampling estimator. It is a novel approach compared to rescaling with fixed and data-dependent weights in subsampling algorithms. The Bernoulli sampling step is based on survey sampling, while the perturbation step uses a known probability distribution to approximate the full data objective function. The method can be applied using various probability distributions for the stochastic weights. Overall, perturbation subsampling provides a way to reduce sample size while still approximating the full data objective function accurately.
read more
4. What is the purpose of Algorithm 2 in perturbation subsampling?
Algorithm 2 is a perturbation subsampling algorithm used for variance estimation. It generates n i.i.d. random variables, calculates the expected value and variance of V n, and minimizes the perturbed objective function to obtain bn,k. The algorithm estimates the conditional variance of bn and the unconditional variance using formulas (2.5) and (2.6). This procedure allows for the repetitive use of stochastic weighting to estimate the variance of the perturbation subsampling estimator for statistical inference, as described in Jin, Ying, and Wei (2001). Overall, Algorithm 2 provides a method for variance estimation in perturbation subsampling, enhancing the accuracy and reliability of statistical inference.
read more