Journal Article10.1145/2133803.2184450
StreamKM++: A clustering algorithm for data streams
Marcel R. Ackermann,Marcus Märtens,Christoph Raupach,Kamil Swierkot,Christiane Lammersen,Christian Sohler +5 more
TL;DR: In this article, a new k-means clustering algorithm for data streams of points from a Euclidean space is proposed, which computes a small weighted sample of the data stream and solves the problem on the sample using the kmeans++ algorithm of Arthur and Vassilvitskii (SODA '07).
read more
Abstract: We develop a new k-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute the small sample, we propose two new techniques. First, we use an adaptive, nonuniform sampling approach similar to the k-means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a small dependency on the dimensionality of the data. Second, we propose a new data structure, which we call coreset tree. The use of these coreset trees significantly speeds up the time necessary for the adaptive, nonuniform sampling during our coreset construction.We compare our algorithm experimentally with two well-known streaming implementations: BIRCH [Zhang et al. 1997] and StreamLS [Guha et al. 2003]. In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and significantly better than BIRCH (up to a factor of 2). Besides, BIRCH requires significant effort to tune its parameters. In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scalesmuch better with increasing number of centers. We conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large. We also give a theoretical justification of our approach by proving that our sample set is a small coreset in low-dimensional spaces.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Learning in Nonstationary Environments: A Survey
TL;DR: In such nonstationary environments, where the probabilistic properties of the data change over time, a non-adaptive model trained under the false stationarity assumption is bound to become obsolete in time, and perform sub-optimally at best, or fail catastrophically at worst.
Scalable k-means++
Bahman Bahmani,Benjamin Moseley,Andrea Vattani,Ravi Kumar,Sergei Vassilvitskii +4 more
- 01 Mar 2012
TL;DR: In this article, the authors show how to reduce the number of passes needed to obtain, in parallel, a good initialization of k-means++ in both sequential and parallel settings.
A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects
Ezugwu E. Absalom,Abiodun Motunrayo Ikotun,Olaide Nathaniel Oyelade,Laith Abualigah,Jeffrey O. Agushaka,Christopher Ifeanyi Eke,Andronicus Ayobami Akinyelu +6 more
TL;DR: Clustering is an essential tool in data mining research and applications as discussed by the authors and it is the subject of active research in many fields of study, such as computer science, data science, statistics, pattern recognition, artificial intelligence, and machine learning.
573
Data stream clustering: A survey
Jonathan de Andrade Silva,Elaine R. Faria,Rodrigo C. Barros,Eduardo R. Hruschka,André C. P. L. F. de Carvalho,João Gama +5 more
TL;DR: A survey of data stream clustering algorithms is presented, providing a thorough discussion of the main design components of state-of-the-art algorithms and an overview of the usually employed experimental methodologies.
The Paradigm-Shift of Social Spambots: Evidence, Theories, and Tools for the Arms Race
Stefano Cresci,Roberto Di Pietro,Marinella Petrocchi,Angelo Spognardi,Maurizio Tesconi +4 more
- 03 Apr 2017
TL;DR: In this article, the authors extensively study the social spambots on Twitter and provide quantitative evidence that a paradigm shift exists in spambot design and propose new approaches capable of turning the tide in the fight against this raising phenomenon.
References
Some methods for classification and analysis of multivariate observations
James B. MacQueen
- 01 Jan 1967
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Least squares quantization in PCM
TL;DR: In this article, the authors derived necessary conditions for any finite number of quanta and associated quantization intervals of an optimum finite quantization scheme to achieve minimum average quantization noise power.
k-means++: the advantages of careful seeding
David Arthur,Sergei Vassilvitskii +1 more
- 07 Jan 2007
TL;DR: By augmenting k-means with a very simple, randomized seeding technique, this work obtains an algorithm that is Θ(logk)-competitive with the optimal clustering.
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator
TL;DR: A new algorithm called Mersenne Twister (MT) is proposed for generating uniform pseudorandom numbers, which provides a super astronomical period of 2 and 623-dimensional equidistribution up to 32-bit accuracy, while using a working area of only 624 words.
Related Papers (5)
Charu C. Aggarwal,Jiawei Han,Jianyong Wang,Philip S. Yu +3 more
- 09 Sep 2003
David Arthur,Sergei Vassilvitskii +1 more
- 07 Jan 2007