An expectation-maximization algorithm working on data summary

Open Access

An expectation-maximization algorithm working on data summary

- 01 Jan 2002

- pp 221-226

1

TL;DR: The proposed EMACF (Expectation-Maximization Algorithm on Clustering Features) algorithm employs data summary features including weight, mean, and variance explicitly and it is proved that EMacF converges to a local maximum likelihood value.

Abstract: Scalable cluster analysis addresses the problem of processing large data sets with limited resources, e.g., memory and computation time. A data summarization or sampling procedure is an essential step of most scalable algorithms. It forms a compact representation of the data. Based on it, traditional clustering algorithms can process large data sets efficiently. However, there is little work on how to effectively perform cluster analysis on data summaries. From the principle of the general expectation-maximization algorithm, we propose a model-based clustering algorithm to make better use of these data summaries in this paper. The proposed EMACF (Expectation-Maximization Algorithm on Clustering Features) algorithm employs data summary features including weight, mean, and variance explicitly. We prove that EMACF converges to a local maximum likelihood value. The computation time of EMACF is linear with the number of data summaries instead of the number of data items, and thus can be integrated with any efficient data summarization procedure to construct a scalable clustering algorithm.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Book Chapter•10.1007/3-540-45675-9_86

Scaling-Up Model-Based Clustering Algorithm by Working on Clustering Features

Huidong Jin, +2 more

- 12 Aug 2002

TL;DR: The experimental results show that gEMACF can generate more accurate results than other scalable clustering algorithms and can run two order of magnitude faster than the traditional expectation-maximization algorithm with little loss of accuracy.

...read moreread less

4

References

Journal Article•10.1111/J.2517-6161.1977.TB01600.X

Maximum likelihood from incomplete data via the EM algorithm

Arthur P. Dempster, +2 more

- 01 Sep 1977

- Journal of the royal statistical society...

55.2K

•Book

The EM algorithm and extensions

Geoffrey J. McLachlan, +1 more

- 15 Nov 1996

TL;DR: The EM Algorithm and Extensions describes the formulation of the EM algorithm, details its methodology, discusses its implementation, and illustrates applications in many statistical contexts, opening the door to the tremendous potential of this remarkably versatile statistical tool.

...read moreread less

6.7K

Proceedings Article•10.1145/233269.233324

BIRCH: an efficient data clustering method for very large databases

Tian Zhang, +2 more

- 01 Jun 1996

TL;DR: Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) as discussed by the authors is a data clustering method that is especially suitable for very large databases.

...read moreread less

4.8K

•Journal Article•10.1023/A:1017986506241

Accelerating EM for Large Databases

Bo Thiesson, +2 more

- 04 Dec 2001

- Machine Learning

TL;DR: Two approaches are presented that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality.

...read moreread less

129

Proceedings Article•10.1145/347090.347151

Visualization of navigation patterns on a Web site using model-based clustering

Igor V. Cadez, +4 more

- 01 Aug 2000

TL;DR: A new methodology for visualizing navigation patterns on a Web site that clusters users according to the order in which they request Web pages using a mixture of rst-order Markov models using the ExpectationMaximization algorithm.

...read moreread less

An expectation-maximization algorithm working on data summary

Chat with Paper

AI Agents for this Paper

Citations

Scaling-Up Model-Based Clustering Algorithm by Working on Clustering Features

References

Maximum likelihood from incomplete data via the EM algorithm

The EM algorithm and extensions

BIRCH: an efficient data clustering method for very large databases

Accelerating EM for Large Databases

Visualization of navigation patterns on a Web site using model-based clustering

Related Papers (5)

Statistical considerations on the k-means algorithm

Random Projection Clustering on Streaming Data

Autonomous data-driven clustering for live data stream

Learning to Link

Dealing with overlapping clustering: a constraint-based approach to algorithm selection