Coreset

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•10.1145/1007352.1007400•

On coresets for k-means and k-median clustering

[...]

Sariel Har-Peled¹, Soham Mazumdar¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

13 Jun 2004

TL;DR: This paper shows the existence of small coresets for the problems of computing k-median/means clustering for points in low dimension, and improves the fastest known algorithms for (1+ε)-approximate k-means and k- median.

...read moreread less

Abstract: In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in Rd, one can compute a weighted set S ⊆ P, of size O(k e-d log n), such that one can compute the k-median/means clustering on S instead of on P, and get an (1+e)-approximation. As a result, we improve the fastest known algorithms for (1+e)-approximate k-means and k-median. Our algorithms have linear running time for a fixed k and e. In addition, we can maintain the (1+e)-approximate k-median or k-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.

...read moreread less

700 citations

Proceedings Article•10.5555/2627817.2627920•

Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering

[...]

Dan Feldman¹, Melanie Schmidt², Christian Sohler²•Institutions (2)

Massachusetts Institute of Technology¹, Technical University of Dortmund²

6 Jan 2013

TL;DR: The authors' coresets with the merge-and-reduce approach obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering, and a simple recursive coreset construction that produces coresets of size.

...read moreread less

Abstract: @d can be approximated up to (1 + e)-factor, for an arbitrary small e > 0, using the O(k/e2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + e)-approximated by an optimal k-means clustering of their projection on the O(k/e2) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + e)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n)O(jk) for any j, k ≥ 1 and constant e e (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ~ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size

...read moreread less

481 citations

Geometric Approximation via Coresets

[...]

Pankaj K. Agarwal¹, Sariel Har-Peled², Kasturi Varadarajan³•Institutions (3)

Duke University¹, University of Illinois at Urbana–Champaign², University of Iowa³

1 Jan 2007

TL;DR: The paradigm of coresets has recently emerged as a powerful tool for efficiently approximating various extent measures of a point set P and has been successfully applied to various optimization and extent measure problems.

...read moreread less

Abstract: The paradigm of coresets has recently emerged as a powerful tool for efficiently approximating various extent measures of a point set P . Using this paradigm, one quickly computes a small subset Q of P , called a coreset, that approximates the original set P and and then solves the problem on Q using a relatively inefficient algorithm. The solution for Q is then translated to an approximate solution to the original point set P . This paper describes the ways in which this paradigm has been successfully applied to various optimization and extent measure problems.

...read moreread less

472 citations

Proceedings Article•10.1145/1993636.1993712•

A unified framework for approximating and clustering data

[...]

Dan Feldman¹, Michael Langberg²•Institutions (2)

California Institute of Technology¹, Open University of Israel²

6 Jun 2011

TL;DR: A unified framework for constructing coresets and approximate clustering for general sets of functions, and shows how to generalize the results of the framework for squared distances, distances to the qth power, and deterministic constructions.

...read moreread less

Abstract: Given a set F of n positive functions over a ground set X, we consider the problem of computing x* that minimizes the expression ∑f ∈ Ff(x), over x ∈ X. A typical application is shape fitting, where we wish to approximate a set P of n elements (say, points) by a shape x from a (possibly infinite) family X of shapes. Here, each point p ∈ P corresponds to a function f such that f(x) is the distance from p to x, and we seek a shape x that minimizes the sum of distances from each point in P. In the k-clustering variant, each x\in X is a tuple of k shapes, and f(x) is the distance from p to its closest shape in x.Our main result is a unified framework for constructing coresets and approximate clustering for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of e-approximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of "compressed representation" of the input set F. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). For several function families F for which coresets are known not to exist, or the corresponding (approximate) optimization problems are hard, our framework yields bicriteria approximations, or coresets that are large, but contained in a low-dimensional space.We demonstrate our unified framework by applying it on projective clustering problems. We obtain new coreset constructions and significantly smaller coresets, over the ones that appeared in the literature during the past years, for problems such as: k-Median [Har-Peled and Mazumdar,STOC'04], [Chen, SODA'06], [Langberg and Schulman, SODA'10]; k-Line median [Feldman, Fiat and Sharir, FOCS'06], [Deshpande and Varadarajan, STOC'07]; Projective clustering [Deshpande et al., SODA'06] [Deshpande and Varadarajan, STOC'07]; Linear lp regression [Clarkson, Woodruff, STOC'09 ]; Low-rank approximation [Sarlos, FOCS'06]; Subspace approximation [Shyamalkumar and Varadarajan, SODA'07], [Feldman, Monemizadeh, Sohler and Woodruff, SODA'10], [Deshpande, Tulsiani, and Vishnoi, SODA'11].The running times of the corresponding optimization problems are also significantly improved. We show how to generalize the results of our framework for squared distances (as in k-mean), distances to the qth power, and deterministic constructions.

...read moreread less

468 citations

Journal Article•10.1145/2133803.2184450•

StreamKM++: A clustering algorithm for data streams

[...]

Marcel R. Ackermann¹, Marcus Märtens¹, Christoph Raupach¹, Kamil Swierkot¹, Christiane Lammersen², Christian Sohler³ - Show less +2 more•Institutions (3)

University of Paderborn¹, Simon Fraser University², Technical University of Dortmund³

22 May 2012-ACM Journal of Experimental Algorithms

TL;DR: In this article, a new k-means clustering algorithm for data streams of points from a Euclidean space is proposed, which computes a small weighted sample of the data stream and solves the problem on the sample using the kmeans++ algorithm of Arthur and Vassilvitskii (SODA '07).

...read moreread less

Abstract: We develop a new k-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute the small sample, we propose two new techniques. First, we use an adaptive, nonuniform sampling approach similar to the k-means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a small dependency on the dimensionality of the data. Second, we propose a new data structure, which we call coreset tree. The use of these coreset trees significantly speeds up the time necessary for the adaptive, nonuniform sampling during our coreset construction.We compare our algorithm experimentally with two well-known streaming implementations: BIRCH [Zhang et al. 1997] and StreamLS [Guha et al. 2003]. In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and significantly better than BIRCH (up to a factor of 2). Besides, BIRCH requires significant effort to tune its parameters. In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scalesmuch better with increasing number of centers. We conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large. We also give a theoretical justification of our approach by proving that our sample set is a small coreset in low-dimensional spaces.

...read moreread less

360 citations

...

Expand

Year	Papers
2022	1
2021	100
2020	149
2019	126
2018	52
2017	31

Topic Tools

Papers published on a yearly basis

Papers

On coresets for k-means and k-median clustering

Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering

Geometric Approximation via Coresets

A unified framework for approximating and clustering data

StreamKM++: A clustering algorithm for data streams

Related Topics (5)

Performance Metrics