TL;DR: Knowledge Discovery from Data Streams as mentioned in this paper presents a coherent overview of state-of-the-art research in learning from data streams, covering the fundamentals that are imperative to understand data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks and customer click streams.
Abstract: Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.
TL;DR: A method for developing algorithms that can adaptively learn from data streams that drift over time, based on using change detectors and estimator modules at the right places and choosing implementations with theoretical guarantees in order to extend such guarantees to the resulting adaptive learning algorithm.
Abstract: We propose and illustrate a method for developing algorithms that can adaptively learn from data streams that drift over time. As an example, we take Hoeffding Tree, an incremental decision tree inducer for data streams, and use as a basis it to build two new methods that can deal with distribution and concept drift: a sliding window-based algorithm, Hoeffding Window Tree, and an adaptive method, Hoeffding Adaptive Tree. Our methods are based on using change detectors and estimator modules at the right places; we choose implementations with theoretical guarantees in order to extend such guarantees to the resulting adaptive learning algorithm. A main advantage of our methods is that they require no guess about how fast or how often the stream will drift; other methods typically have several user-defined parameters to this effect.
In our experiments, the new methods never do worse, and in some cases do much better, than CVFDT, a well-known method for tree induction on data streams with drift.
TL;DR: This paper takes up the challenge to design a special data structure for feature quality evaluation, and to employ an information-theoretic feature ranking mechanism to efficiently handle feature interaction in subset selection.
Abstract: The evolving and adapting capabilities of robust intelligence are best manifested in its ability to learn. Machine learning enables computer systems to learn, and improve performance. Feature selection facilitates machine learning (e.g., classification) by aiming to remove irrelevant features. Feature (attribute) interaction presents a challenge to feature subset selection for classification. This is because a feature by itself might have little correlation with the target concept, but when it is combined with some other features, they can be strongly correlated with the target concept. Thus, the unintentional removal of these features may result in poor classification performance. It is computationally intractable to handle feature interactions in general. However, the presence of feature interaction in a wide range of real-world applications demands practical solutions that can reduce high-dimensional data while preserving feature interactions. In this paper, we take up the challenge to design a special data structure for feature quality evaluation, and to employ an information-theoretic feature ranking mechanism to efficiently handle feature interaction in subset selection. We conduct experiments to evaluate our approach by comparing with some representative methods, perform a lesion study to examine the critical components of the proposed algorithm to gain insights, and investigate related issues such as data structure, ranking, time complexity, and scalability in search of interacting features.
TL;DR: Intelligent Data Analysis invites submission of research and application articles that comply with the Aims and Scope of the journal and articles that discuss development of new AI architectures, methodologies, and techniques and their applications to the field of data analysis are preferred.
Abstract: Aims and Scope Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains. Papers published in this journal are geared heavily towards applications, with an anticipated split of 70% of the papers published being applications-oriented research and the remaining 30% containing more theoretical research. Submission of Papers Authors are requested to submit their paper electronically as an email attachment to the Editor-in-Chief: submissions@ida-ij.com. Intelligent Data Analysis invites submission of research and application articles that comply with the Aims and Scope of the journal. In particular, articles that discuss development of new AI architectures, methodologies, and techniques and their applications to the field of data analysis are preferred. Manuscripts are received with the understanding that their content is unpublished material and is not being submitted for publication elsewhere. Further, it is understood that each co-author has made substantial contributions to the work described and that each accepts joint responsibility for publication. Subscription Information Intelligent Data Analysis (ISSN 1088-467x) will be published in 1 volume of 6 issues in 2014 (Volume 18). Institutional subscription (online only): €1050 / US$1415. Institutional subscription (print only): €1110 / US$1499 (including postage and handling). Institutional subscription (print & online): €1320 / US$1782 (including postage and handling). Individual subscription (online only): €126 / US$170.
TL;DR: This paper formally define gradual association rules and an original lattice-based approach and the GRITE algorithm is proposed for extracting gradual itemsets in an efficient manner for handling huge volumes of complex numerical data.
Abstract: Mining gradual rules plays a crucial role in many real world applications where huge volumes of complex numerical data must be handled, e.g., biological databases, survey databases, data streams or sensor readings. Gradual rules highlight complex order correlations of the form "The more/less X, then the more/less Y ". Such rules have been studied since the early 70's, mostly in the fuzzy logic domain, where the main efforts have been focused on how to model and use such rules. However, mining gradual rules remains challenging because of the exponential combination space to explore. In this paper, we tackle the particular problem of handling huge volumes by proposing scalable methods. First, we formally define gradual association rules and we propose an original lattice-based approach. The GRITE algorithm is proposed for extracting gradual itemsets in an efficient manner. An experimental study on large-scale synthetic and real datasets is performed, showing the efficiency and interest of our approach.
TL;DR: The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed inThe groups of objects in correspondence of the distinct values of A i a low value of distance is obtained.
Abstract: Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A j . We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.
TL;DR: The main part of the paper presents two classes of reliability estimation approaches and summarizes the relevant terminology, which is often used in this and related research fields.
Abstract: In Machine Learning, estimation of the predictive accuracy for a given model is most commonly approached by analyzing the average accuracy of the model In general, the predictive models do not provide accuracy estimates for their individual predictions The reliability estimates of individual predictions require the analysis of various model and instance properties In the paper we make an overview of the approaches for estimation of individual prediction reliability We start by summarizing three research fields, that provided ideas and motivation for our work: (a) approaches to perturbing learning data, (b) the usage of unlabeled data in supervised learning, and (c) the sensitivity analysis The main part of the paper presents two classes of reliability estimation approaches and summarizes the relevant terminology, which is often used in this and related research fields
TL;DR: Adaptive Learning from Evolving Data Streams and an Application of Intelligent Data Analysis Techniques to a Large Software Engineering Dataset are considered.
Abstract: This edited volume contains the 35 contributions to the 8th Symposium Intelligent Data Analsyis organized in Lyon at the end of August, early September.
TL;DR: Results show that the proposed approach to novelty detection is capable of identifying novel concepts that are pure and correspond to real classes, disregarding unrepresentative clusters and outliers.
Abstract: This paper presents and evaluates an approach to novelty detection that addresses it as the problem of identifying novel concepts in a continuous learning scenario, as an extension to a single-class classification problem. OLINDDA, an OnLIne Novelty and Drift Detection Algorithm that implements this approach, uses efficient standard clustering algorithms to continuously generate candidate clusters among examples that were not explained by the current known concepts. Clusters complying with a validation criterion that takes cohesiveness and representativeness into account are initially identified as concepts. By merging similar concepts, OLINDDA may enhance the representation of some concepts as it advances toward its final goal of describing novel emerging concepts in an unsupervised way. The proposed approach is experimentally evaluated by the use of several measures taken throughout the learning process. Results show that it is capable of identifying novel concepts that are pure and correspond to real classes, disregarding unrepresentative clusters and outliers.
TL;DR: The proposed algorithm, Touch, deals with both FCI/FG-mining separately and is highly efficient and outperforms its levelwise competitors.
Abstract: The effective construction of many association rule bases requires the computation of both frequent closed and frequent generator itemsets (FCIs/FGs). However, only few miners address both concerns, typically by applying levelwise breadth-first traversal. As depth-first traversal is known to be superior, we examine here the depth-first FCI/FG-mining. The proposed algorithm, Touch , deals with both tasks separately, i.e., uses a well-known vertical method, Charm , to extract FCIs and a novel one, Talky-G , to extract FGs. The respective outputs are matched in a post-processing step. Experimental results indicate that Touch is highly efficient and outperforms its levelwise competitors.
TL;DR: This paper introduces a goal-driven procedure for automatically compose algorithms based on the exploitation of KDDONTO, an ontology formalizing the domain of K DD algorithms, allowing us to generate valid and non-trivial processes.
Abstract: One of the most interesting challenges in Knowledge Discovery in Databases (KDD) field is giving support to users in the composition of tools for forming a valid and useful KDD process. Such an activity implies that users have both to choose tools suitable to their knowledge discovery problem, and to compose them for designing the KDD process. To this end, they need expertise and knowledge about functionalities and properties of all KDD algorithms implemented in available tools. In order to support users in this heavy activity, in this paper we introduce a goal-driven procedure for automatically compose algorithms. The proposed procedure is based on the exploitation of KDDONTO, an ontology formalizing the domain of KDD algorithms, allowing us to generate valid and non-trivial processes.
TL;DR: Spatio-Temporal Sensor Graphs (STSG) is proposed to model sensor data at the conceptual and physical levels, which allows the properties of edges and nodes to be modeled as a time series of measurement data.
Abstract: Developing a model that facilitates the representation and knowledge discovery on sensor data presents many challenges With sensors reporting data at a very high frequency, resulting in large volumes of data, there is a need for a model that is memory efficient Since sensor data is spatio-temporal in nature, the model must also support the time dependence of the data Balancing the conflicting requirements of simplicity, expressiveness and storage efficiency is challenging The model should also provide adequate support for the formulation of efficient algorithms for knowledge discovery Though spatio-temporal data can be modeled using time expanded graphs, this model replicates the entire graph across time instants, resulting in high storage overhead and computationally expensive algorithms In this paper, we propose Spatio-Temporal Sensor Graphs (STSG) to model sensor data at the conceptual logical and physical levels This model allows the properties of edges and nodes to be modeled as a time series of measurement data Data at each instant would consist of the measured value and the expected error Also, we evaluate the model using methods to find interesting patterns such as growing hotspots in sensor data and present analytical comparison of the algorithms with methods based on existing models
TL;DR: This paper introduces a novel informative generic basis of association rules, conveying two types of knowledge: "factual" and "implicative" and presents a valid and complete axiomatic system allowing one to infer the set of all association rules.
Abstract: The extremely large number of association rules that can be drawn from - even reasonably sized datasets, bootstrapped the development of more acute techniques or methods to reduce the size of the reported rule sets. In this context, the battery of results provided by the Formal Concept Analysis (FCA) allowed one to define "irreducible" nuclei of association rule subset better known as generic bases. From such a condensed and reduced size set of association rules, it is possible to infer all association rules commonly via an adequate axiomatic system. In this paper, we introduce a novel informative generic basis of association rules, conveying two types of knowledge: "factual" and "implicative". We also present a valid and complete axiomatic system allowing one to infer the set of all association rules. Results of the experiments carried out on real-life datasets have shown important profits in terms of compactness of the introduced generic basis.
TL;DR: A method for ranking of alternatives or objects and its extensions by incomplete pairwise comparisons using random set theory is proposed and extended on the case of independent groups of experts.
Abstract: A method for ranking of alternatives or objects and its extensions by incomplete pairwise comparisons using random set theory are proposed in the paper. The main feature of the method is that it allows us to deal with comparisons of arbitrary groups of alternatives. The method is extended on the case of independent groups of experts. The imprecise Dirichlet model is also used to make cautious decisions in several cases. Various numerical examples illustrate the proposed method and its extensions.
TL;DR: A new algorithm for learning isotonic classification trees that relabels non-monotone leaf nodes by performing the isotonic regression on the collection of leaf nodes is proposed.
Abstract: We propose a new algorithm for learning isotonic classification trees. It relabels non-monotone leaf nodes by performing the isotonic regression on the collection of leaf nodes. In case two leaf nodes with a common parent have the same class after relabeling, the tree is pruned in the parent node. Since we consider problems with ordered class labels, all results are evaluated on the basis of L 1 prediction error. We experimentally compare the performance of the new algorithm with standard classification trees.
TL;DR: This paper presents a general approach for context-aware adaptive mining of data streams that aims to dynamically and autonomously adjust data stream mining parameters according to changes in context and situations.
Abstract: In resource-constrained devices, adaptation of data stream processing to variations of data rates and availability of resources is crucial for consistency and continuity of running applications. However, to enhance and maximize the benefits of adaptation, there is a need to go beyond mere computational and device capabilities to encompass the full spectrum of context-awareness. This paper presents a general approach for context-aware adaptive mining of data streams that aims to dynamically and autonomously adjust data stream mining parameters according to changes in context and situations. We perform intelligent and real-time analysis of data streams generated from sensors that is under-pinned using context-aware adaptation. A prototype of the proposed architecture is implemented and evaluated in the paper through a real-world scenario in the area of healthcare monitoring.
TL;DR: It is proved that this set of constraints called primitive-based constraints not only is a superclass of both kinds of monotone ones and their boolean combinations but also other classes such as convertible and succinct constraints.
Abstract: Constraint-based mining is an active field of research which is a necessary step to achieve interactive and successful KDD processes. The limitations of the task lies in languages being limited to describe the mined patterns and the ability to express varied constraints. In practice, current approaches focus on a language and the most generic frameworks mine individually or simultaneously a monotone and an anti-monotone constraints. In this paper, we propose a generic framework dealing with any partially ordered language and a large set of constraints. We prove that this set of constraints called primitive-based constraints not only is a superclass of both kinds of monotone ones and their boolean combinations but also other classes such as convertible and succinct constraints. We show that the primitive-based constraints can be efficiently mined thanks to a relaxation method based on virtual patterns which summarize the specificities of the search space. Indeed, this approach automatically deduces pruning conditions having suitable monotone properties and thus these conditions can be pushed into usual constraint mining algorithms. We study the optimal relaxations. Finally, we provide an experimental illustration of the efficiency of our proposal by experimenting it on several contexts.
TL;DR: This paper proposes two efficient algorithms for mining weighted frequent itemsets in which the main approaches are to push weight constraints into the Apriori algorithm and the pattern growth algorithm respectively and shows how to maintain the downward closure property in mining weightedrequent itemsets.
Abstract: There have been many studies on mining frequent itemset (or pattern) in the data mining field because of its broad applications in mining association rules, correlations, graph patterns, constraint based frequent patterns, sequential patterns, and many other data mining tasks. One of major challenges in frequent pattern mining is a huge number of result patterns. As the minimum threshold becomes lower, an exponentially large number of itemsets are generated. Therefore, pruning unimportant patterns effectively in mining process is one of main topics in frequent pattern mining. In weighted frequent pattern mining, not only support but also weight are used and important patterns can be detected. In this paper, we propose two efficient algorithms for mining weighted frequent itemsets in which the main approaches are to push weight constraints into the Apriori algorithm and the pattern growth algorithm respectively. Additionally, we show how to maintain the downward closure property in mining weighted frequent itemsets. In our approach, the normalized weights within the weight range are used according to the importance of items. A weight range is used to restrict weights of items and a minimum weight is utilized to balance between weight and support of items for pruning the search space. Our approach generates fewer but important weighted frequent itemsets in large databases, particularly dense databases with low minimum supports. An extensive performance study shows that our algorithm outperforms previous mining algorithms. In addition, it is efficient and scalable.
TL;DR: This work investigates the problem of extracting features from lightweight wireless acceleration sensors by selecting random sets of features and estimating probabilistic classifiers for damage detection purposes, and assesses the relevance of the features in a large population of classifiers.
Abstract: Structural Health Monitoring (SHM) aims at monitoring buildings or other structures and assessing their condition, alerting about new defects in the structure when necessary For instance, vibration measurements can be used for monitoring the condition of a bridge We investigate the problem of extracting features from lightweight wireless acceleration sensors On-line algorithms for frequency domain monitoring are considered, and the resulting features are combined to form a large bank of candidate features We explore the feature space by selecting random sets of features and estimating probabilistic classifiers for damage detection purposes We assess the relevance of the features in a large population of classifiers The methods are assessed with real-life data from a wooden bridge model, where structural problems are simulated with small added weights
TL;DR: This paper presents a novel approach based on ART (Adaptive Resonance Theory) neural networks that improves the multi-label classification performance of Fuzzy ARTMAP and ARAM algorithms and shows the effectiveness of the proposed approach.
Abstract: Multi-label classification is an active and rapidly developing research area of data analysis. It becomes increasingly important in such fields as gene function prediction, text classification or web mining. This task corresponds to classification of instances labeled by multiple classes rather than just one. Traditionally, it was solved by learning independent binary classifiers for each class and combining their outputs to obtain multi-label predictions. Alternatively, a classifier can be directly trained to predict a label set of an unknown size for each unseen instance. Recently, several direct multi-label machine learning algorithms have been proposed. This paper presents a novel approach based on ART (Adaptive Resonance Theory) neural networks. The Fuzzy ARTMAP and ARAM algorithms were modified in order to improve their multi-label classification performance and were evaluated on benchmark datasets. Comparison of experimental results with the results of other multi-label classifiers shows the effectiveness of the proposed approach.
TL;DR: This paper defines an exact condensed representation according to the frequency-based measures and shows how to infer the best patterns according to these measures, i.e., the patterns which maximize them.
Abstract: Condensed representations of patterns are at the core of many data mining works and there are a lot of contributions handling data described by items. In this paper, we tackle sequential data and we define an exact condensed representation for sequential patterns according to the frequency-based measures. These measures are often used, typically in order to evaluate classification rules. Furthermore, we show how to infer the best patterns according to these measures, i.e., the patterns which maximize them. These patterns are immediately obtained from the condensed representation so that this approach is easily usable in practice. Experiments conducted on various datasets demonstrate the feasibility and the interest of our approach.
TL;DR: From pure location data, network analysis leads to a community structure that closely follows the commercial classification of the US Department of Labor, which allows to build a 'quality' index of optimal location niches for stores, which has been empirically tested.
Abstract: Measuring the spatial distribution of locations of many entities (trees, atoms, economic activities, ...), and, more precisely, the deviations from purely random configurations, is a powerful method to unravel their underlying interactions. I study here the spatial organization of retail commercial activities. From pure location data, network analysis leads to a community structure that closely follows the commercial classification of the US Department of Labor. The interaction network allows to build a 'quality' index of optimal location niches for stores, which has been empirically tested.
TL;DR: This work proposes a probabilistic framework for formalising and combining qualitative evidence based on explicitly defined term characteristics to produce a new termhood measure that demonstrates consistently better precision, recall and accuracy compared to three other existing ad-hoc measures.
Abstract: Term recognition identifies domain-relevant terms which are essential for discovering domain concepts and for the construction of terminologies required by a wide range of natural language applications. Many techniques have been developed in an attempt to numerically determine or quantify termhood based on term characteristics. Some of the apparent shortcomings of existing techniques are the ad-hoc combination of termhood evidence, mathematically-unfounded derivation of scores and implicit assumptions concerning term characteristics. We propose a probabilistic framework for formalising and combining qualitative evidence based on explicitly defined term characteristics to produce a new termhood measure. Our qualitative and quantitative evaluations demonstrate consistently better precision, recall and accuracy compared to three other existing ad-hoc measures.
TL;DR: In experiments on breast cancer diagnosis data, it was showed that the proposed new approach to test selection based on the discovery of subgroups of patients sharing the same optimal test is faster than the baseline algorithm APRIORI-SD while preserving its accuracy.
Abstract: We propose a new approach to test selection based on the discovery of subgroups of patients sharing the same optimal test, and present its application to breast cancer diagnosis. Subgroups are defined in terms of background information about the patient. We automatically determine the best t subgroups a patient belongs to, and decide for the test proposed by their majority. We introduce the concept of prediction quality to measure how accurate the test outcome is regarding the disease status. The quality of a subgroup is then the best mean prediction quality of its members (choosing the same test for all). Incorporating the quality computation in the search heuristic enables a significant reduction of the search space. In experiments on breast cancer diagnosis data we showed that it is faster than the baseline algorithm APRIORI-SD while preserving its accuracy.
TL;DR: The proposed metric learning method can improve performance of semi-supervised clustering algorithms and Experimental results on real-world data sets show the effectiveness of this method.
Abstract: Metric learning is a powerful approach for semi-supervised clustering. In this paper, a metric learning method considering both pairwise constraints and the geometrical structure of data is introduced for semi-supervised clustering. At first, a smooth metric is found (based on an optimization problem) using positive constraints as supervisory information. Then, an extension of this method employing both positive and negative constraints is introduced. As opposed to the existing methods, the extended method has the capability of considering both positive and negative constraints while considering the topological structure of data. The proposed metric learning method can improve performance of semi-supervised clustering algorithms. Experimental results on real-world data sets show the effectiveness of this method.
TL;DR: Experimental results show that the proposed techniques are indeed effective and efficient for mining periodic spatio-temporal patterns at different time granularities.
Abstract: With the advancement of technology, it is now easy to collect the location information of mobile users over time. Spatio-temporal data mining techniques were proposed in the literature for the extraction of patterns from spatio-temporal data. However, current techniques can only extract patterns of the finest time granularity, and therefore overlooks potential patterns available at coarser time granularities. In this work, we propose two techniques to allow mining at different time granularities. Experimental results show that the proposed techniques are indeed effective and efficient for mining periodic spatio-temporal patterns at different time granularities.
TL;DR: This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation, and presents some commonly used indices, as well as indices recently proposed in the literature.
Abstract: Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters' compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.
TL;DR: This paper used self-organizing maps (SOM) to search for meaningful associations between speculative attacks' real effects and 28 variables that characterize the economic, financial, legal, and socio-political structure of the country at the onset of the attack.
Abstract: In some cases, currency crises are followed by strong recessions (e.g., recent Asian and Argentinean crises), but in other cases they are not. This paper uses Self-Organizing Maps (SOM) to search for meaningful associations between speculative attacks' real effects and 28 variables that characterize the economic, financial, legal, and socio-political structure of the country at the onset of the attack. SOM is a neural network-based generalization of Principal Component Analysis (PCA) that provides an efficient non-linear projection of the multidimensional data space on a curved surface. This paper finds a strong association of speculative attacks` real effects with fundamentals and the banking sector structure.
TL;DR: The clustering model is accurate to estimate support, given a sufficiently large number of clusters and it is more accurate than correlation, except for sets of two items, and itemset support can be bounded and approximated from both models.
Abstract: Association rules require models to understand their relationship to statistical properties of the data set. In this work, we study mathematical relationships between association rules and two fundamental techniques: clustering and correlation. Each cluster represents an important itemset. We show the sufficient statistics for clustering and correlation on binary data sets are the linear sum of points and the quadratic sum of points, respectively. We prove itemset support can be bounded and approximated from both models. Support bounds and support estimation obey the set downward closure property for fast bottom-up search for frequent itemsets. Both models can be efficiently computed with sparse matrix computations. Experiments with real and synthetic data sets evaluate model accuracy and speed. The clustering model is accurate to estimate support, given a sufficiently large number of clusters and it is more accurate than correlation, except for sets of two items. Accuracy increases as the number of clusters grows, but decreases as the minimum support threshold decreases. Once built, the clustering model represents a faster alternative than the traditional A-priori algorithm and the correlation model to mine associations. The correlation model is faster to compute than clustering, but it is less accurate. Time complexity to compute both models is linear on data set size, whereas dimensionality marginally impacts time when analyzing large transaction data sets.
TL;DR: An approximate clustering method in which a new Approximate MST is repeatedly built in the maximum (d+1) iterations from two sources: a new Hilbert curve created from carefully shifted N data points, and a previous AMST which holds cumulative vicinity information derived from earlier iterations.
Abstract: Minimum spanning tree (MST) clustering sequentially inserts the nearest points in the R$^{d}$ space into a list which is then divided into clusters by using desired criteria. This insertion order, however, can be relaxed provided approximately nearby points in a condensed area are adjacently inserted into a list before distant points in other areas. Based on this observation, we propose an approximate clustering method in which a new Approximate MST (AMST) is repeatedly built in the maximum (d+1) iterations from two sources: a new Hilbert curve created from carefully shifted N data points, and a previous AMST which holds cumulative vicinity information derived from earlier iterations. Although the final AMST may not completely match to a true MST built from an $O(N^{2})$ algorithm, most mismatches occur locally within individual data groups which are unimportant for clustering. Our experiments on synthetic datasets and animal motion vectors extracted from surveillance videos show that high-quality clusters can be efficiently obtained from this approximation method.