StreamKM++: A clustering algorithm for data streams

doi:10.1145/2133803.2184450

Journal Article10.1145/2133803.2184450

StreamKM++: A clustering algorithm for data streams

Marcel R. Ackermann, +5 more

- 22 May 2012

- ACM Journal of Experimental Algorithms

- Vol. 17

354

TL;DR: In this article, a new k-means clustering algorithm for data streams of points from a Euclidean space is proposed, which computes a small weighted sample of the data stream and solves the problem on the sample using the kmeans++ algorithm of Arthur and Vassilvitskii (SODA '07).

Abstract: We develop a new k-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute the small sample, we propose two new techniques. First, we use an adaptive, nonuniform sampling approach similar to the k-means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a small dependency on the dimensionality of the data. Second, we propose a new data structure, which we call coreset tree. The use of these coreset trees significantly speeds up the time necessary for the adaptive, nonuniform sampling during our coreset construction.We compare our algorithm experimentally with two well-known streaming implementations: BIRCH [Zhang et al. 1997] and StreamLS [Guha et al. 2003]. In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and significantly better than BIRCH (up to a factor of 2). Besides, BIRCH requires significant effort to tune its parameters. In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scalesmuch better with increasing number of centers. We conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large. We also give a theoretical justification of our approach by proving that our sample set is a small coreset in low-dimensional spaces.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1109/MCI.2015.2471196

Learning in Nonstationary Environments: A Survey

Gregory Ditzler, +3 more

- 12 Oct 2015

- IEEE Computational Intelligence Magazine

TL;DR: In such nonstationary environments, where the probabilistic properties of the data change over time, a non-adaptive model trained under the false stationarity assumption is bound to become obsolete in time, and perform sub-optimally at best, or fail catastrophically at worst.

...read moreread less

848

Journal Article•10.14778/2180912.2180915

Scalable k-means++

Bahman Bahmani, +4 more

- 01 Mar 2012

TL;DR: In this article, the authors show how to reduce the number of passes needed to obtain, in parallel, a good initialization of k-means++ in both sequential and parallel settings.

...read moreread less

718

Journal Article•10.1016/j.engappai.2022.104743

A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects

Ezugwu E. Absalom, +6 more

- 01 Apr 2022

- Engineering Applications of Artificial I...

TL;DR: Clustering is an essential tool in data mining research and applications as discussed by the authors and it is the subject of active research in many fields of study, such as computer science, data science, statistics, pattern recognition, artificial intelligence, and machine learning.

...read moreread less

573

•Journal Article•10.1145/2522968.2522981

Data stream clustering: A survey

Jonathan de Andrade Silva, +5 more

- 11 Jul 2013

- ACM Computing Surveys

TL;DR: A survey of data stream clustering algorithms is presented, providing a thorough discussion of the main design components of state-of-the-art algorithms and an overview of the usually employed experimental methodologies.

...read moreread less

564

•Proceedings Article•10.1145/3041021.3055135

The Paradigm-Shift of Social Spambots: Evidence, Theories, and Tools for the Arms Race

Stefano Cresci, +4 more

- 03 Apr 2017

TL;DR: In this article, the authors extensively study the social spambots on Twitter and provide quantitative evidence that a paradigm shift exists in spambot design and propose new approaches capable of turning the tide in the fight against this raising phenomenon.

...read moreread less

433

...

Expand

References

Some methods for classification and analysis of multivariate observations

James B. MacQueen

- 01 Jan 1967

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.

...read moreread less

28.1K

UCI Machine Learning Repository

A. Asuncion

- 01 Jan 2007

24.3K

•Journal Article•10.1109/TIT.1982.1056489

Least squares quantization in PCM

S. P. Lloyd

- 01 Mar 1982

- IEEE Transactions on Information Theory

TL;DR: In this article, the authors derived necessary conditions for any finite number of quanta and associated quantization intervals of an optimum finite quantization scheme to achieve minimum average quantization noise power.

...read moreread less

16K

•Proceedings Article•10.5555/1283383.1283494

k-means++: the advantages of careful seeding

David Arthur, +1 more

- 07 Jan 2007

TL;DR: By augmenting k-means with a very simple, randomized seeding technique, this work obtains an algorithm that is Θ(logk)-competitive with the optimal clustering.

...read moreread less

9.5K

•Journal Article•10.1145/272991.272995

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

Makoto Matsumoto, +1 more

- 01 Jan 1998

- ACM Transactions on Modeling and Compute...

TL;DR: A new algorithm called Mersenne Twister (MT) is proposed for generating uniform pseudorandom numbers, which provides a super astronomical period of 2 and 623-dimensional equidistribution up to 32-bit accuracy, while using a working area of only 624 words.

...read moreread less

6.4K

...

Expand

StreamKM++: A clustering algorithm for data streams

Chat with Paper

AI Agents for this Paper

Citations

Learning in Nonstationary Environments: A Survey

Scalable k-means++

A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects

Data stream clustering: A survey

The Paradigm-Shift of Social Spambots: Evidence, Theories, and Tools for the Arms Race

References

Some methods for classification and analysis of multivariate observations

UCI Machine Learning Repository

Least squares quantization in PCM

k-means++: the advantages of careful seeding

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

Related Papers (5)

A framework for clustering evolving data streams

Density-Based Clustering over an Evolving Data Stream with Noise.

k-means++: the advantages of careful seeding

BIRCH: an efficient data clustering method for very large databases

A density-based algorithm for discovering clusters in large spatial Databases with Noise