Fast and Accurate k-means For Large Datasets

Open AccessProceedings Article

Fast and Accurate k-means For Large Datasets

- 12 Dec 2011

- Vol. 24, pp 2375-2383

175

TL;DR: This work considers the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where the authors must use as little memory as possible.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1155/2013/704504

A Review of Data Fusion Techniques

Federico Castanedo

- 27 Oct 2013

- The Scientific World Journal

TL;DR: This paper summarizes the state of the data fusion field and describes the most relevant studies, enumerate and explain different classification schemes for data fusion, and reviews the most common algorithms.

...read moreread less

940

Journal Article•10.1007/S10462-016-9477-7

A review of moving object trajectory clustering algorithms

Guan Yuan, +4 more

- 01 Jan 2017

- Artificial Intelligence Review

TL;DR: The strategies and implement processes of classical moving object clustering algorithms and the measures which can determine the similarity/dissimilarity between two trajectories are discussed and some application scenarios are point out.

...read moreread less

327

Journal Article•10.1016/J.KNOSYS.2019.06.032

Fast density peak clustering for large scale data based on kNN

Yewang Chen, +10 more

- 01 Jan 2020

- Knowledge Based Systems

TL;DR: A simple but fast DPeak, namely FastDPeak, 1 is proposed, which runs in about O ( n l o g ( n ) ) expected time in the intrinsic dimensionality and replaces density with kNN-density, which is computed by fast kNN algorithm such as cover tree, yielding huge improvement for density computations.

...read moreread less

228

•Proceedings Article

Approximate k-means++ in sublinear time

Olivier Bachem, +3 more

- 12 Feb 2016

TL;DR: This work proposes a simple and efficient seeding algorithm for K-Means clustering that retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points and can thus obtain a provably good clustering in sublinear time.

...read moreread less

155

...

Expand

References

Proceedings Article•10.1145/380752.380755

Local search heuristic for k-median and facility location problems

Vijay Arya, +5 more

- 06 Jul 2001

TL;DR: This paper analyzes local search heuristics for the k-median and facility location problems and proves that without this stretch, the problem becomes NP-Hard to approximate.

...read moreread less

421

Journal Article•10.1145/2133803.2184450

StreamKM++: A clustering algorithm for data streams

Marcel R. Ackermann, +5 more

- 22 May 2012

- ACM Journal of Experimental Algorithms

TL;DR: In this article, a new k-means clustering algorithm for data streams of points from a Euclidean space is proposed, which computes a small weighted sample of the data stream and solves the problem on the sample using the kmeans++ algorithm of Arthur and Vassilvitskii (SODA '07).

...read moreread less

360

•Proceedings Article•10.1109/SFCS.2001.959917

Online facility location

Adam Meyerson

- 14 Oct 2001

TL;DR: This work considers the online variant of facility location, in which demand points arrive one at a time and the operator must maintain a set of facilities to service these points, and provides a randomized online O(1)-competitive algorithm in the case where points arrive in random order.

...read moreread less

346

Proceedings Article•10.1145/780542.780548

Better streaming algorithms for clustering problems

Moses Charikar, +2 more

- 09 Jun 2003

TL;DR: A randomized algorithm for the k--Median problem which produces a constant factor approximation in one pass using storage space O(k poly log n) and gives bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.

...read moreread less

306

Journal Article•10.1137/070699007

On Coresets for $k$-Median and $k$-Means Clustering in Metric and Euclidean Spaces and Their Applications

Ke Chen

- 01 Sep 2009

- SIAM Journal on Computing

TL;DR: These are the first streaming algorithms, for those problems, that have space complexity with polynomial dependency on the dimension, using $O(d^2k^2\varepsilon^{-2}\log^8n)$ space.

...read moreread less

298

...

Expand

Fast and Accurate k-means For Large Datasets

Chat with Paper

AI Agents for this Paper

Citations

A Review of Data Fusion Techniques

A review of moving object trajectory clustering algorithms

Fast density peak clustering for large scale data based on kNN

Approximate k-means++ in sublinear time

Elastic Machine Learning Algorithms in Amazon SageMaker

References

Local search heuristic for k-median and facility location problems

StreamKM++: A clustering algorithm for data streams

Online facility location

Better streaming algorithms for clustering problems

On Coresets for $k$-Median and $k$-Means Clustering in Metric and Euclidean Spaces and Their Applications

Related Papers (5)

Least squares quantization in PCM

k-means++: the advantages of careful seeding

BIRCH: an efficient data clustering method for very large databases

A framework for clustering evolving data streams

A density-based algorithm for discovering clusters in large spatial Databases with Noise