TL;DR: The survey work and case studies will be useful for all those involved in developing software for data analysis using Ward’s hierarchical clustering method.
Abstract: The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. Two algorithms are found in the literature and software, both announcing that they implement the Ward clustering method. When applied to the same distance matrix, they produce different results. One algorithm preserves Ward's criterion, the other does not. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.
TL;DR: Compared clustering algorithms for compound selection, virtual library generation, High-Throughput Screening, Quantitative Structure-Activity Relationship (QSAR) analysis and Absorption, Distribution, Metabolism, Elimination and Toxicity prediction.
Abstract: Chemoinformatics clustering algorithms are important issues for drug discovery process. So, there are many clustering algorithms that are available for analyzing large chemical data sets of medium and high dimensionality. The quality of these algorithms depends on the nature of data sets and the accuracy needed by the application. The applications of clustering algorithms in the drug discovery process are compound selection, virtual library generation, High-Throughput Screening (HTS), Quantitative Structure-Activity Relationship (QSAR) analysis and Absorption, Distribution, Metabolism, Elimination and Toxicity (ADMET) prediction. Based on Structure-Activity Relationship (SAR) model, compounds with similar structure have similar biological activities. So, clustering algorithms must group more similar compounds in one cluster. K-Means, bisecting K-Means and Ward clustering algorithms are the most popular clustering algorithms that have a wide range of applications in chemoinformatics. In this paper, a comparative study between these algorithms is presented. These algorithms are applied over homogeneous and heterogeneous chemical data sets. The results are compared to determine which algorithms are more suitable depending on the nature of data sets, computation time and accuracy of produced clusters. Accuracy is evaluated using standard deviation metric. Experimental results show that K-Means algorithm is preferable for small number of clusters for homogeneous and heterogeneous data sets in terms of time and standard deviation. Bisecting K-Means and Ward algorithms are preferable for large number of clusters for homogeneous and heterogeneous data sets in term of standard deviation, but bisecting K-Means algorithm is preferable in term of time.
TL;DR: This paper compared proposed method of clustering in which the first stage uses one-pass k-median++ and the second stage uses an agglomerative hierarchical clustering to examine the effectiveness of L1 distance in two-stage methods.
Abstract: The aim of this paper is to propose a two-stage method of clustering in which the first stage uses one-pass k-median++ and the second stage uses an agglomerative hierarchical clustering. To handle medians in the second stage, we proposed two calculation methods. One method uses L 1 distance as similarity. Another uses error of L 1 distance like the Ward method. In this paper, we compared proposed method and a two-stage method of our study which uses k-means++ in the first stage to examine the effectiveness of L 1 distance in two-stage methods. Numerical experiments have been done using two criteria: objective function values and the Rand index.
TL;DR: This paper presents a new methodology to cluster asset in the portfolio theory based on DCA (Difference of Convex functions), an innovative approach in nonconvex optimization framework which has been successfully used on various industrial complex systems.
Abstract: This paper presents a new methodology to cluster asset in the portfolio theory. This new methodology is compare with the classical ward cluster in SAS software. The method is based on DCA (Difference of Convex functions), an innovative approach in nonconvex optimization framework which has been successfully used on various industrial complex systems. The cluster can be used in an empirical example in the context of multi-managers portfolio management, and to identify the one that seems to best fit the objectives of portfolio management of a fund of funds or funds. The cluster is useful to reduce the choice of asset class and to facilitate the optimization of Markowitz frontier.
TL;DR: This study acquired 718 attribute dataset from Statistics Korea and conducted an analysis to select the most suitable variables, which differentiate Gangnam from other districts, using the Genetic algorithm and Dunn’s index and K-means algorithm.
Abstract: Korean government proposed a new initiative ‘government 3.0’ with which the administration will open its dataset to the public before requests. City of Seoul is the front runner in disclosure of government data. If we know what kind of attributes are governing factors for any given segmentation, these outcomes can be applied to real world problems of marketing and business strategy, and administrative decision makings. However, with respect to city of Seoul, selection of optimal variables from the open dataset up to several thousands of attributes would require a humongous amount of computation time because it might require a combinatorial optimization while maximizing dissimilarity measures between clusters. In this study, we acquired 718 attribute dataset from Statistics Korea and conducted an analysis to select the most suitable variables, which differentiate Gangnam from other districts, using the Genetic algorithm and Dunn’s index. Also, we utilized the Microsoft Azure cloud computing system to speed up the process time. As the result, the optimal 28 variables were finally selected, and the validation result showed that those 28 variables effectively group the Gangnam from other districts using the Ward’s minimum variance and K-means algorithm.Keywords: Clustering, Dunn’s Index, Ward’s Minimum Variance, K-means Algorithm, Genetic Algorithm
TL;DR: This work was motivated by clustering software, such as the R function hclust, which accepts a distance matrix as input and applies Ward’s definition of inter-cluster distance to produce a clustering.
Abstract: In this paper, we consider several generalizations of the popular Ward's method for agglomerative hierarchical clustering. Our work was motivated by clustering software, such as the R function hclust, which accepts a distance matrix as input and applies Ward's definition of inter-cluster distance to produce a clustering. The standard version of Ward's method uses squared Euclidean distance to form the distance matrix. We explore the effect on the clustering of using other definitions of distance, such as the Minkowski distance.
TL;DR: The experimental results show that the use of the automatically labeled i-vectors to train supervised methods such as LDA, PLDA or linear logistic regression-based fusion, decreases the minimum decision cost function by up to 22%.
Abstract: The process of manually labeling data is very expensive and sometimes infeasible due to privacy and security issues This paper investigates the use of two algorithms for clustering unlabeled training i-vectors This aims at improving speaker recognition performance by using state-of-the-art supervised techniques in the context of the NIST i-vector Machine Learning Challenge 2014 The first algorithm is the well-known Ward clustering that aims at optimizing an objective function across all clusters The second one is a cascade clustering, which benefits from the latest advances in speaker modeling and session compensation techniques, and relies on both the cosine similarity and probabilistic linear discriminant analysis (PLDA) Furthermore, this paper investigates the multi-clustering fusion that opens the door for further improvements The experimental results show that the use of the automatically labeled i-vectors to train supervised methods such as LDA, PLDA or linear logistic regression-based fusion, decreases the minimum decision cost function by up to 22%