Clustering algorithms for categorical data

Open Access

Clustering algorithms for categorical data

- 01 Jan 2006

17

TL;DR: A faster simplification of HIERDENC for hierarchical density-based clustering of categorical data, and an extension of MULIC that incorporates in the clustering process information on a software system's runtime execution are presented.

Abstract: Categorical datasets in many domains, such as biology or software analysis, have a rich underlying cluster structure. Categorical clustering methods that are motivated by uncovering interesting local cluster structure could produce high clustering quality, and potentially help analysts to study hidden roles of objects in a dataset. This thesis presents several clustering algorithms for categorical data. First, we introduce the HIERDENC algorithm for hierarchical density-based clustering of categorical data. Then, we present the MULIC algorithm, which is a faster simplification of HIERDENC. MULIC is designed for categorical datasets with a multi-layered structure, such as protein interaction data. Our experimental evaluation of MULIC on such datasets shows that it can uncover their underlying structure better than other algorithms and has comparable runtimes. Next, we present the MULICsoft algorithm for clustering large software systems. MULICsoft is an extension of MULIC that incorporates in the clustering process information on a software system's runtime execution. We evaluate MULICsoft on a large open-source system. MULICsoft is able to produce decompositions that are close to those created by system experts. We continue with the BILCOM algorithm which is an extension of MULIC. BILCOM is used for clustering mixed categorical and numerical biomedical data. We apply BILCOM to datasets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations. The results show that BILCOM can partition these datasets significantly better than using just categorical or numerical type. Finally, we present the M-BILCOM algorithm, which is an extension of BILCOM for clustering mixed numerical and low quality categorical data. M-BILCOM incorporates in the clustering process the confidence on the categorical values' correctness. We apply M-BILCOM to yeast gene expression data with Gene Ontology-annotations and GO Evidence codes representing evidence on the annotations' correctness.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Book

Computational Intelligence

Andries P. Engelbrecht

- 01 Jan 2002

TL;DR: A novel approach that each random combination of the optimized parameters is coded into a Real Coded string and treated as a chromosome in genetic algorithms is given to enhance the performance of fuzzy logic controllers.

...read moreread less

1.3K

Computational Analysis of Microarray Data

Partha S. Vasisht

- 01 Jan 2003

954

•Journal Article

Bipartite graph partitioning and data clustering

Hongyuan Zha, +4 more

- 07 May 2001

- Lawrence Berkeley National Laboratory

TL;DR: In this article, a bipartite graph based data clustering method is proposed, where terms and documents are simultaneously grouped into semantically meaningful co-categories and subject descriptors.

...read moreread less

295

Squeezer：An Efficient Algorithm for Clustering Categorical Data

何增有, +2 more

- 01 Jan 2002

TL;DR: The proposed Squeezer algorithm is extremely suitable for clustering data streams, where given a sequence of points, the objective is to maintain consistently good clustering of the sequence so far, using a small amount of memory and time.

...read moreread less

162

•10.14849/PSJPROC.2007.0_144_2

Large-scale temporal gene expression mapping of central nervous system development

Pascal Nsoh, +9 more

- 01 Jan 2007

102

...

Expand

References

•Journal Article•10.1109/TAC.1974.1100705

A new look at the statistical model identification

Hirotugu Akaike

- 01 Dec 1974

- IEEE Transactions on Automatic Control

TL;DR: In this article, a new estimate minimum information theoretical criterion estimate (MAICE) is introduced for the purpose of statistical identification, which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure.

...read moreread less

53.1K

•Book

The Nature of Statistical Learning Theory

Vladimir Vapnik

- 01 Jan 1995

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?

...read moreread less

46K

•Book

Data Mining: Concepts and Techniques

Jiawei Han, +2 more

- 08 Sep 2000

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

29.9K

Some methods for classification and analysis of multivariate observations

James B. MacQueen

- 01 Jan 1967

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.

...read moreread less

28.1K

•Journal Article•10.1103/REVMODPHYS.74.47

Statistical mechanics of complex networks

Réka Albert, +1 more

- 01 Jan 2001

- Reviews of Modern Physics

TL;DR: In this paper, a simple model based on the power-law degree distribution of real networks was proposed, which was able to reproduce the power law degree distribution in real networks and to capture the evolution of networks, not just their static topology.

...read moreread less

21.7K

...

Expand

Clustering algorithms for categorical data

Chat with Paper

AI Agents for this Paper

Citations

Computational Intelligence

Computational Analysis of Microarray Data

Bipartite graph partitioning and data clustering

Squeezer：An Efficient Algorithm for Clustering Categorical Data

Large-scale temporal gene expression mapping of central nervous system development

References

A new look at the statistical model identification

The Nature of Statistical Learning Theory

Data Mining: Concepts and Techniques

Some methods for classification and analysis of multivariate observations

Statistical mechanics of complex networks

Related Papers (5)

Clustering Categorical Data

Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach

A Link Clustering Based Approach for Clustering Categorical Data

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Space Structure and Clustering of Categorical Data