A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

doi:10.1093/BIOINFORMATICS/BTP123

Open AccessJournal Article10.1093/BIOINFORMATICS/BTP123

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Ashok Sharma, +3 more

- 01 May 2009

- Bioinformatics

- Vol. 25, Iss: 9, pp 1152-1157

19

TL;DR: A new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes and reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs.

Abstract: Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). Availability: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. Contact: rmcindoe@mail.mcg.edu Supplementary information:Supplementary data are available at Bioinformatics online.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1016/J.COR.2012.03.008

Clustering of high throughput gene expression data

Harun Pirim, +3 more

- 01 Dec 2012

- Computers & Operations Research

TL;DR: This paper presents a review of the current clustering algorithms designed especially for analyzing gene expression data and is intended to introduce one of the main problems in bioinformatics - clustering gene expressionData to the operations research community.

...read moreread less

130

•Journal Article•10.3390/EN10071003

Prediction in Photovoltaic Power by Neural Networks

Antonello Rosato, +3 more

- 15 Jul 2017

- Energies

TL;DR: This paper presents three techniques based on neural and fuzzy neural networks, namely the radial basis function, the adaptive neuro-fuzzy inference system and the higher-order neuro-magnifying lens inference system, which are well suited to predict data sequences stemming from real-world applications.

...read moreread less

57

•Journal Article•10.1186/AR2676

Differential gene expression in the salivary gland during development and onset of xerostomia in Sjögren's syndrome-like disease of the C57BL/6.NOD- Aec1Aec2 mouse

Cuong Q. Nguyen, +5 more

- 20 Apr 2009

- Arthritis Research & Therapy

TL;DR: Taking advantage of known functions of these genes, investigators can construct interactive gene pathways, leading to modeling of possible underlying events inducing salivary gland dysfunction, and identify multiple sets of genes of interest whose expressions and expression profiles may correlate with molecular mechanisms, signaling pathways, and/or immunological processes involved in the development and onset of SjS.

...read moreread less

51

Journal Article•10.1109/TGRS.2020.3032427

Sparsity-Based Clustering for Large Hyperspectral Remote Sensing Images

Han Zhai, +3 more

- 01 Dec 2021

- IEEE Transactions on Geoscience and Remo...

TL;DR: Two novel sparsity-based clustering algorithms are proposed for large HSIs, named sparse coding- based clustering (SCC) and joint SCC (JSCC), which are the first to use the sparse representation recovery residual to cluster HSIs and introduce the super-pixel neighborhood.

...read moreread less

35

•Journal Article•10.1155/2012/491237

A hierarchical procedure for the synthesis of ANFIS networks

Massimo Panella

- 01 Jan 2012

- Advances in Fuzzy Systems

TL;DR: A computationally efficient optimization ofANFIS networks is proposed, based on a hierarchical constructive procedure, by which the number of rules is progressively increased and the optimal one is automatically determined on the basis of learning theory in order to maximize the generalization capability of the resulting ANFIS network.

...read moreread less

33

...

Expand

References

•Journal Article•10.1073/PNAS.95.25.14863

Cluster analysis and display of genome-wide expression patterns

Michael B. Eisen, +3 more

- 08 Dec 1998

- Proceedings of the National Academy of S...

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.

...read moreread less

17.5K

•Journal Article•10.1080/01621459.1971.10482356

Objective Criteria for the Evaluation of Clustering Methods

William M. Rand

- 01 Dec 1971

- Journal of the American Statistical Asso...

TL;DR: This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.

...read moreread less

6.7K

Proceedings Article•10.1145/233269.233324

BIRCH: an efficient data clustering method for very large databases

Tian Zhang, +2 more

- 01 Jun 1996

TL;DR: Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) as discussed by the authors is a data clustering method that is especially suitable for very large databases.

...read moreread less

4.8K

•Journal Article•10.1093/BIOINFORMATICS/BTI517

Computational cluster validation in post-genomic data analysis

Julia Handl, +2 more

- 01 Aug 2005

- Bioinformatics

TL;DR: In this article, the authors present a review of clustering validation techniques for post-genomic data analysis, with a particular focus on their application to postgenomic analysis of biological data.

...read moreread less

993

•Journal Article•10.1093/BIOINFORMATICS/BTL406

Evaluation and comparison of gene clustering methods in microarray analysis

Anbupalam Thalamuthu, +3 more

- 15 Sep 2006

- Bioinformatics

TL;DR: The results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clusters and SOM perform among the worst.

...read moreread less

327

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Chat with Paper

AI Agents for this Paper

Citations

Clustering of high throughput gene expression data

Prediction in Photovoltaic Power by Neural Networks

Differential gene expression in the salivary gland during development and onset of xerostomia in Sjögren's syndrome-like disease of the C57BL/6.NOD- Aec1Aec2 mouse

Sparsity-Based Clustering for Large Hyperspectral Remote Sensing Images

A hierarchical procedure for the synthesis of ANFIS networks

References

Cluster analysis and display of genome-wide expression patterns

Objective Criteria for the Evaluation of Clustering Methods

BIRCH: an efficient data clustering method for very large databases

Computational cluster validation in post-genomic data analysis

Evaluation and comparison of gene clustering methods in microarray analysis

Related Papers (5)

High performance parallel $$k$$k-means clustering for disk-resident datasets on multi-core CPUs

Competitive K-Means, a New Accurate and Distributed K-Means Algorithm for Large Datasets

Clustering in very large databases based on distance and density

Clustering Large Databases in Distributed Environment

A scalable parallel subspace clustering algorithm for massive data sets