External Sampling
Alexandr Andoni,Piotr Indyk,Krzysztof Onak,Ronitt Rubinfeld +3 more
- 06 Jul 2009
pp 83-94
TL;DR: This paper considers three well-studied problems: testing of distinctness, uniformity and identity of an empirical distribution induced by data, and shows random-sampling-based algorithms whose number of block accesses are smaller than the main memory complexity of those problems.
read more
Abstract: We initiate the study of sublinear-time algorithms in the external memory model [1]. In this model, the data is stored in blocks of a certain size B , and the algorithm is charged a unit cost for each block access. This model is well-studied, since it reflects the computational issues occurring when the (massive) input is stored on a disk. Since each block access operates on B data elements in parallel, many problems have external memory algorithms whose number of block accesses is only a small fraction (e.g. 1/B ) of their main memory complexity.
However, to the best of our knowledge, no such reduction in complexity is known for any sublinear-time algorithm. One plausible explanation is that the vast majority of sublinear-time algorithms use random sampling and thus exhibit no locality of reference. This state of affairs is quite unfortunate, since both sublinear-time algorithms and the external memory model are important approaches to dealing with massive data sets, and ideally they should be combined to achieve best performance.
In this paper we show that such combination is indeed possible. In particular, we consider three well-studied problems: testing of distinctness , uniformity and identity of an empirical distribution induced by data. For these problems we show random-sampling-based algorithms whose number of block accesses is up to a factor of $1/\sqrt{B}$ smaller than the main memory complexity of those problems. We also show that this improvement is optimal for those problems.
Since these problems are natural primitives for a number of sampling-based algorithms for other problems, our tools improve the external memory complexity of other problems as well.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Optimal Algorithms for Testing Closeness of Discrete Distributions
TL;DR: The first sub-linear time algorithm for this problem was presented in this paper, which matched the lower bounds of Valiant up to a logarithmic factor in $n, and a polynomial factor in O(n 2 ).
Testing Properties of Collections of Distributions
TL;DR: In this paper, the authors propose a framework for studying property testing of collections of distributions, where the number of distributions in the collection is a parameter of the problem, and give almost tight upper and lower bounds for this testing problem, as well as study an extension to a clusterability property.
Range Partitioning Within Sublinear Time in the External Memory Model.
Baoling Ning,Jianzhong Li,Shouxu Jiang +2 more
- 10 Aug 2020
TL;DR: Two lower bounds of the sampling cost required by the external sublinear range partitioning algorithms are proved, which show that it needs to make a full scan of the input in the worst case.
1
Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight
TL;DR: In this paper , the authors propose an enhanced feedback model called ''multiple observations in hindsight'' where after each episode of interaction with the POMDP, the learner may collect multiple additional observations emitted from the encountered latent states, but may not observe the latent states themselves.
1
Range partitioning within sublinear time: Algorithms and lower bounds
TL;DR: In this paper, the authors studied the lower and upper bounds for the range partitioning problem based on the RAM and I/O model, and proposed a sublinear external partitioning algorithm with O(k log ( N/ δ ) w B ϵ 2 ) I/Os.
1
References
External memory algorithms and data structures: dealing with massive data
TL;DR: The state of the art in the design and analysis of external memory algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs is surveyed.
789
•Book
Bulletin of the European Association for Theoretical Computer Science
Vladimiro Sassone
- 01 Oct 2005
TL;DR: The Computational Power of Simple Dynamics, by E. Natale as mentioned in this paper, is a seminal work in the field of computational power of simple dynamics, and is the basis for the present paper.
435
Testing that distributions are close
Tugkan Batu,Lance Fortnow,Ronitt Rubinfeld,Warren D. Smith,Patrick White +4 more
- 12 Nov 2000
TL;DR: A sublinear algorithm which uses O(n/sup 2/3//spl epsiv//sup -4/ log n) independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small or large.
Testing random variables for independence and identity
Tugkan Batu,Eldar Fischer,Lance Fortnow,Ravi Kumar,Ronitt Rubinfeld,Patrick White +5 more
- 14 Oct 2001
TL;DR: Given access to independent samples of a distribution A over [n] /spl times/ [m], this work shows how to test whether the distributions formed by projecting A to each coordinate are independent, i.e., whether A is /spl epsi/-close in the L/sub 1/ norm to the product distribution A/ Sub 1/ times/A/sub 2/.
On testing expansion in bounded-degree graphs
Oded Goldreich,Dana Ron +1 more
- 01 Jan 2011
TL;DR: It is believed that the algorithm rejects any graph that is e-far from having second eigenvalue at most λα/O(1), and proves the validity of this belief under an appealing combinatorial conjecture.
Related Papers (5)
Rolf Fagerberg
- 01 Jan 2008
Matias Korman
- 01 Jan 2016
Yong-Jik Kim,J. H. Aderson +1 more
Risi Thonangi,Jun Yang +1 more
- 01 Jul 2013