Open AccessProceedings Article
Fast and Accurate k-means For Large Datasets
Michael Shindler,Alexander Wong,Adam Meyerson +2 more
- 12 Dec 2011
- Vol. 24, pp 2375-2383
TL;DR: This work considers the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where the authors must use as little memory as possible.
read more
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o(nk) (where n is the number of data points; note that computing the cost, given a solution, takes Θ(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Review of Data Fusion Techniques
TL;DR: This paper summarizes the state of the data fusion field and describes the most relevant studies, enumerate and explain different classification schemes for data fusion, and reviews the most common algorithms.
A review of moving object trajectory clustering algorithms
TL;DR: The strategies and implement processes of classical moving object clustering algorithms and the measures which can determine the similarity/dissimilarity between two trajectories are discussed and some application scenarios are point out.
327
Fast density peak clustering for large scale data based on kNN
Yewang Chen,Xiaoliang Hu,Wentao Fan,Lianlian Shen,Zheng Zhang,Xin Liu,Xin Liu,Ji-Xiang Du,Haibo Li,Yi Chen,Hailin Li +10 more
TL;DR: A simple but fast DPeak, namely FastDPeak, 1 is proposed, which runs in about O ( n l o g ( n ) ) expected time in the intrinsic dimensionality and replaces density with kNN-density, which is computed by fast kNN algorithm such as cover tree, yielding huge improvement for density computations.
228
•Proceedings Article
Approximate k-means++ in sublinear time
Olivier Bachem,Mario Lucic,S. Hamed Hassani,Andreas Krause +3 more
- 12 Feb 2016
TL;DR: This work proposes a simple and efficient seeding algorithm for K-Means clustering that retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points and can thus obtain a provably good clustering in sublinear time.
Elastic Machine Learning Algorithms in Amazon SageMaker
Edo Liberty,Zohar Karnin,Bing Xiang,Laurence Rouesnel,Baris Coskun,Ramesh Nallapati,Julio Delgado,Amir Sadoughi,Yury Astashonok,Piali Das,Can Balioglu,Saswata Chakravarty,Madhav Jha,Philip Gautier,David Arpin,Tim Januschowski,Valentin Flunkert,Yuyang Wang,Jan Gasthaus,Lorenzo Stella,Syama Sundar Rangapuram,David Salinas,Sebastian Schelter,Alexander J. Smola +23 more
- 11 Jun 2020
TL;DR: The computational model behind Amazon SageMaker, which is an ML platform provided as part of Amazon Web Services, and supports incremental training, resumable and elastic learning as well as automatic hyperparameter optimization, is described.
References
Local search heuristic for k-median and facility location problems
Vijay Arya,Naveen Garg,Rohit Khandekar,Adam Meyerson,Kamesh Munagala,Vinayaka Pandit +5 more
- 06 Jul 2001
TL;DR: This paper analyzes local search heuristics for the k-median and facility location problems and proves that without this stretch, the problem becomes NP-Hard to approximate.
StreamKM++: A clustering algorithm for data streams
Marcel R. Ackermann,Marcus Märtens,Christoph Raupach,Kamil Swierkot,Christiane Lammersen,Christian Sohler +5 more
TL;DR: In this article, a new k-means clustering algorithm for data streams of points from a Euclidean space is proposed, which computes a small weighted sample of the data stream and solves the problem on the sample using the kmeans++ algorithm of Arthur and Vassilvitskii (SODA '07).
360
Online facility location
Adam Meyerson
- 14 Oct 2001
TL;DR: This work considers the online variant of facility location, in which demand points arrive one at a time and the operator must maintain a set of facilities to service these points, and provides a randomized online O(1)-competitive algorithm in the case where points arrive in random order.
Better streaming algorithms for clustering problems
Moses Charikar,Liadan O'Callaghan,Rina Panigrahy +2 more
- 09 Jun 2003
TL;DR: A randomized algorithm for the k--Median problem which produces a constant factor approximation in one pass using storage space O(k poly log n) and gives bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.
On Coresets for $k$-Median and $k$-Means Clustering in Metric and Euclidean Spaces and Their Applications
TL;DR: These are the first streaming algorithms, for those problems, that have space complexity with polynomial dependency on the dimension, using $O(d^2k^2\varepsilon^{-2}\log^8n)$ space.
298
Related Papers (5)
David Arthur,Sergei Vassilvitskii +1 more
- 07 Jan 2007
Charu C. Aggarwal,Jiawei Han,Jianyong Wang,Philip S. Yu +3 more
- 09 Sep 2003