Open Access
Clustering algorithms for categorical data
William Andreopoulos
- 01 Jan 2006
17
TL;DR: A faster simplification of HIERDENC for hierarchical density-based clustering of categorical data, and an extension of MULIC that incorporates in the clustering process information on a software system's runtime execution are presented.
read more
Abstract: Categorical datasets in many domains, such as biology or software analysis, have a rich underlying cluster structure. Categorical clustering methods that are motivated by uncovering interesting local cluster structure could produce high clustering quality, and potentially help analysts to study hidden roles of objects in a dataset.
This thesis presents several clustering algorithms for categorical data. First, we introduce the HIERDENC algorithm for hierarchical density-based clustering of categorical data. Then, we present the MULIC algorithm, which is a faster simplification of HIERDENC. MULIC is designed for categorical datasets with a multi-layered structure, such as protein interaction data. Our experimental evaluation of MULIC on such datasets shows that it can uncover their underlying structure better than other algorithms and has comparable runtimes.
Next, we present the MULICsoft algorithm for clustering large software systems. MULICsoft is an extension of MULIC that incorporates in the clustering process information on a software system's runtime execution. We evaluate MULICsoft on a large open-source system. MULICsoft is able to produce decompositions that are close to those created by system experts.
We continue with the BILCOM algorithm which is an extension of MULIC. BILCOM is used for clustering mixed categorical and numerical biomedical data. We apply BILCOM to datasets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations. The results show that BILCOM can partition these datasets significantly better than using just categorical or numerical type.
Finally, we present the M-BILCOM algorithm, which is an extension of BILCOM for clustering mixed numerical and low quality categorical data. M-BILCOM incorporates in the clustering process the confidence on the categorical values' correctness. We apply M-BILCOM to yeast gene expression data with Gene Ontology-annotations and GO Evidence codes representing evidence on the annotations' correctness.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Book
Computational Intelligence
Andries P. Engelbrecht
- 01 Jan 2002
TL;DR: A novel approach that each random combination of the optimized parameters is coded into a Real Coded string and treated as a chromosome in genetic algorithms is given to enhance the performance of fuzzy logic controllers.
1.3K
•Journal Article
Bipartite graph partitioning and data clustering
TL;DR: In this article, a bipartite graph based data clustering method is proposed, where terms and documents are simultaneously grouped into semantically meaningful co-categories and subject descriptors.
295
Squeezer:An Efficient Algorithm for Clustering Categorical Data
TL;DR: The proposed Squeezer algorithm is extremely suitable for clustering data streams, where given a sequence of points, the objective is to maintain consistently good clustering of the sequence so far, using a small amount of memory and time.
162
References
A new look at the statistical model identification
TL;DR: In this article, a new estimate minimum information theoretical criterion estimate (MAICE) is introduced for the purpose of statistical identification, which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure.
•Book
The Nature of Statistical Learning Theory
Vladimir Vapnik
- 01 Jan 1995
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
46K
•Book
Data Mining: Concepts and Techniques
Jiawei Han,Micheline Kamber,Jian Pei +2 more
- 08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Some methods for classification and analysis of multivariate observations
James B. MacQueen
- 01 Jan 1967
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Statistical mechanics of complex networks
TL;DR: In this paper, a simple model based on the power-law degree distribution of real networks was proposed, which was able to reproduce the power law degree distribution in real networks and to capture the evolution of networks, not just their static topology.