TL;DR: In this study, a comprehensive analysis is carried out on hyper-heuristics and the best method is tested against genetic and memetic algorithms on fourteen benchmark functions.
Abstract: Meta-heuristics such as simulated annealing, genetic algorithms and tabu search have been successfully applied to many difficult optimization problems for which no satisfactory problem specific solution exists. However, expertise is required to adopt a meta-heuristic for solving a problem in a certain domain. Hyper-heuristics introduce a novel approach for search and optimization. A hyper-heuristic method operates on top of a set of heuristics. The most appropriate heuristic is determined and applied automatically by the technique at each step to solve a given problem. Hyper-heuristics are therefore assumed to be problem independent and can be easily utilized by non-experts as well. In this study, a comprehensive analysis is carried out on hyper-heuristics. The best method is tested against genetic and memetic algorithms on fourteen benchmark functions. Additionally, new hyper-heuristic frameworks are evaluated for questioning the notion of problem independence.
TL;DR: A novel application of string kernels to the problem of recognising famous pianists from their style of playing, and it is shown that when using the string kernel on this data, both kernel partial least squares and Support Vector Machines outperform the current best results.
Abstract: In this paper we show a novel application of string kernels: that is to the problem of recognising famous pianists from their style of playing. The characteristics of performers playing the same piece are obtained from changes in beat-level tempo and beat-level loudness, which over the time of the piece form a performance worm. From such worms, general performance alphabets can be derived, and pianists' performances can then be represented as strings. We show that when using the string kernel on this data, both kernel partial least squares and Support Vector Machines outperform the current best results. Furthermore we suggest a new method of obtaining feature directions from the Kernel Partial Least Squares algorithm and show that this can deliver better performance than methods previously used in the literature when used in conjunction with a Support Vector Machine.
TL;DR: The results show that FRI correlates more with the clinically accepted DOA index, CSI™ (CSM, Danmeter, Denmark), leading to enhance both interpretability of the results and performance of the system.
Abstract: Estimating the depth of anesthesia (DOA) is still a challenging area in anesthesia research. The objective of this study was to design a fuzzy rule based system which integrates electroencephalogram (EEG) features to quantitatively estimate the DOA.
The proposed method is based on the analysis of single-channel EEG using frequency and time domain methods. A clinical study was conducted on 22 patients to construct subsets of reference data corresponding to four well-defined anesthetic states: awake, moderate anesthesia, surgical anesthesia and isoelectric.
Statistical analysis of features was used to design input membership functions (MFs). The input space was partitioned with respect to the derived MFs and the training data was used to label the partitions and extract efficient fuzzy if-then rules. Consequently, the fuzzy rule-base index (FRI) is derived between 0 (isoelectric) to 100 (fully awake) using fuzzy inference engine and designed output MFs.
We also applied the same features to an adaptive network-based fuzzy inference system (ANFIS) derived without any prior knowledge. The results show that FRI correlates more with the clinically accepted DOA index, CSI™ (CSM, Danmeter, Denmark). In addition to this achievement the main idea behind this study is to simplify the mutual knowledge exchange between the human expert and the machine, leading to enhance both interpretability of the results and performance of the system.
TL;DR: The key design considerations that were addressed during the implementation of a hybrid data mining assistant, based on the case-based reasoning (CBR) paradigm and the use of a formal OWL-DL ontology are presented.
Abstract: Nowadays, decision makers invariably need to use decision support technology (DS) such as data mining (DM) methodologies and tools in order to tackle complex decision making problems. However the successful application of DM technology requires that one possess specific DM decision-making skills. For instance, the effective application of a data mining process is littered with many difficult and technical decisions (i.e. data cleansing, feature transformations, algorithms, parameters, evaluation, etc.) In essence, this contentious problem and burden for decision makers clearly stems from a poor DM-DS integration. As a result, we have strived to improve on this problem by proposing an intelligent DM assistant that can potentially empower decision makers to better leverage DM technology and achieve their intended business objectives. Nonetheless, as this paper will strive to demonstrate, the realization of an intelligent data mining assistant for the decision maker or non-specialist data miner is a challenging and complex endeavour. Hence, in what follows we present the key design considerations (i.e. knowledge representation and reasoning, knowledge elicitation and reuse efforts, etc.) that were addressed during the implementation of a hybrid data mining assistant, based on the case-based reasoning (CBR) paradigm and the use of a formal OWL-DL ontology.
TL;DR: Development of linked data and model ontologies, together with a DM-epistemology, and associated with full exploitation of search and sampling could lead to improved cohesion and efficacy of the DM discipline.
Abstract: Current Business Intelligence (BI) initiatives customize DM-KDD techniques into business analytics, which cannot be used in applications other than business. A review of current methodology at the strategic level of the KM/KDD domains indicates that there exists no general formal framework which can be adopted in new applications, or new application areas. There are no established procedures for the domain expert to express their prior knowledge, understanding and aims in a way which can be linked to KDD/DMM processes and subsequent deployment of discovered knowledge. It is suggested that the sequential life-cycle project-management approach of CRISP-DM needs to be complemented by a dynamic interactive view of a conceptual data/information/knowledge hierarchy in the KM context. It is also suggested that a graphical/visual knowledge representation framework needs to be developed as the basis of a knowledge and discovery and communication framework (KDCF).
A review of the limitations in DM methodology at the technical/technological level leads to the conclusion that there is no coherent DM methodology to guide the choice of models and their evaluation, that the DM discipline is fractionated, and that the fundamental search and sampling paradigms have been insufficiently utilized in DM development. It is proposed that development of linked data and model ontologies, together with a DM-epistemology, and associated with full exploitation of search and sampling could lead to improved cohesion and efficacy of the DM discipline.
TL;DR: In this article, the problem of finding frequent items in a continuous stream of itemsets is studied, and a new frequency measure is introduced, based on a flexible window length for a given item, defined as the maximal frequency over all windows from any point in the past until the current state.
Abstract: We study the problem of finding frequent items in a continuous stream of itemsets A new frequency measure is introduced, based on a flexible window length For a given item, its current frequency in the stream is defined as the maximal frequency over all windows from any point in the past until the current state We study the properties of the new measure, and propose an incremental algorithm that allows to produce the current frequency of an item immediately at any time It is shown experimentally that the memory requirements of the algorithm are extremely small for many different realistic data distributions
TL;DR: This paper proposes a combination of statistical and rough set methods to reduce important attributes in a simpler way while maintaining a lesser degree of information loss from the raw data and shows that the fitness-rough method (FsR) has performed comparatively well with higher reduction strength and smaller rules set against the benchmarking methods.
Abstract: Attribute reduction has become an important pre-processing task to reduce the complexity of the data mining task. Rough reducts, statistical methods and correlation-based methods have gradually contributed towards improving attribute reduction techniques to a certain extent. Statistical methods are generally lower in computational complexity compared to the rough reducts and the correlation-based methods, but many have proven that the rough reducts method is significant in reducing important attributes without causing too much information loss. Correlation-based methods on the other hand evaluate features as a subset instead of individual attribute. In this paper, we propose a combination of statistical and rough set methods to reduce important attributes in a simpler way while maintaining a lesser degree of information loss from the raw data. The fitness-rough method (FsR) indicates important attributes from raw data and it is further simplified to a more compact information table. Besides that, we have also looked into the problem of information loss in this method. Ten UCI machine learning datasets were used as testing sets on the proposed method as compared to the classical rough reducts (RR) method, the statistical entropy (ENT) method and the correlation-based feature selection (CFS) method. Experimental results show that our method has performed comparatively well with higher reduction strength and smaller rules set against the benchmarking methods, especially in medium size datasets. However, the FsR method is basically less efficient when used on mix-mode and nominal datasets as the non-quantitative attributes involved in these datasets are normally pre-categorised.
TL;DR: A new model based on cultural algorithm and fuzzy clustering procedure is proposed that promotes the formation and maintenance of subpopulations and implements the concept of cultural exchanges among them.
Abstract: Interest in multimodal optimization function is expanding rapidly since real-world optimization problems often require the location of multiple optima in the search space. In this context, a new model based on cultural algorithm and fuzzy clustering procedure is proposed. This model promotes the formation and maintenance of subpopulations and implements the concept of cultural exchanges among them. The validity of this model is confirmed using some well-known test function. Moreover, an electromagnetic benchmark is solved to show the usefulness of the proposed approach in real world optimization.
TL;DR: A modular IE system, based on the DIAL language, is presented, which demonstrates in detail an implementation of a system for extracting relations in the intelligence news domain and an evaluation of the system is presented.
Abstract: In today's information age, the amount of text documents available electronically (on the Web, on corporate intranets, on news wires and elsewhere) is overwhelming. Search engines and information retrieval, while useful to find documents that satisfy a certain query, offer little help with analyzing the unstructured documents themselves. Text Mining is the automated process of analyzing unstructured, natural language text in order to discover information and knowledge that are difficult to retrieve. Information Extraction (IE) centers on finding entities and relations in free text and provides a solid foundation for text mining. In this paper we present a modular IE system, based on the DIAL language. DIAL allows users to implement IE solutions for various domains rapidly, based on a common Natural Language Processing (NLP) infrastructure. We demonstrate in detail an implementation of a system for extracting relations in the intelligence news domain. We present an evaluation of our system and discuss enhancements for other domains, such as emails.
TL;DR: The aim of this paper is to improve the performance of this algorithm in different ways: simplifying the complexity of the induced models, adding the ability to deal with continuous data, improving the detection of noise, selecting new criteria for evolutionating the model, including the use of more powerful prediction techniques, etc.
Abstract: Classification is a quite relevant task within data analysis field. This task is not a trivial task and different difficulties can arise depending on the nature of the problem. All these difficulties can become worse when the datasets are too large or when new information can arrive at any time. Incremental learning is an approach that can be used to deal with the classification task in these cases. It must alleviate, or solve, the problem of limited time and memory resources. One emergent approach uses concentration bounds to ensure that decisions are made when enough information supports them. IADEM is one of the most recent algorithms that use this approach. The aim of this paper is to improve the performance of this algorithm in different ways: simplifying the complexity of the induced models, adding the ability to deal with continuous data, improving the detection of noise, selecting new criteria for evolutionating the model, including the use of more powerful prediction techniques, etc. Besides these new properties, the new system, IADEM-2, preserves the ability to obtain a performance similar to standard learning algorithms independently of the datasets size and it can incorporate new information as the basic algorithm does: using short time per example.
TL;DR: UpDown Tree based approach can greatly improve the efficiency of CISP mining in terms of both time and memory comparing to previous approaches.
Abstract: In this paper the problem of Contiguous Item Sequential Pattern (CISP) Mining is presented as a sequential pattern mining problem under two constraints. First, each element in a sequence consists of only one item. Second, items appearing in the sequences that contain a pattern must be adjacent with respect to the underlying order as they appear in the pattern. Even though the problem of CISP mining can be solved by using previous approaches on sequential pattern mining under a general constraint description framework, this may lead to poor performance due to the large searching space. To efficiently solve this problem, a new data structure, UpDown Tree, is proposed for CISP mining. UpDown Tree based approach can greatly improve the efficiency of CISP mining in terms of both time and memory comparing to previous approaches. An extensive experimental study has shown promising results with our approach.
TL;DR: It is shown that Relational Data Mining (RDM) can handle multiple constrains, initial rules and background knowledge very naturally to reduce the search space in contrast with attribute-based data mining.
Abstract: Currently statistical and artificial neural network methods dominate in data mining applications. Alternative relational (symbolic) data mining methods have shown their effectiveness in robotics, drug design, and other areas. Neural networks and decision tree methods have serious limitations in capturing relations that may have a variety of forms. Learning systems based on symbolic first-order logic (FOL) representations capture relations naturally. The learned regularities are understandable directly in domain terms that help to build a domain theory. This paper describes relational data mining methodology and develops it further for numeric data such as financial and spatial data. This includes (1) comparing the attribute-value representation with the relational representation, (2) defining a new concept of joint relational representations, (3) a process of their use, and the Discovery algorithm. This methodology handles uniformly the numerical and interval forecasting tasks as well as classification tasks. It is shown that Relational Data Mining (RDM) can handle multiple constrains, initial rules and background knowledge very naturally to reduce the search space in contrast with attribute-based data mining. Theoretical concepts are illustrated with examples from financial and image processing domains.
TL;DR: In this work, data clustering techniques are explored to make the solution process to multicriteria optimization problems efficient via Data Envelopment Analysis.
Abstract: In manufacturing it is common to be required to simultaneously meet several performance measures with varying degrees of conflict among them. Such situation poses a multiple criteria optimization problem. Finding solutions to this kind of problems in an efficient manner is critical for industrial application. In this work, data clustering techniques are explored to make the solution process to multicriteria optimization problems efficient via Data Envelopment Analysis. The results of different clustering schemes are reported and conclusions are drawn from their evaluation.
TL;DR: The paper concludes with a review of the RDM approach and 'Discovery' system built on this methodology that can analyze any hypotheses represented in the first-order logic and use any input by representing it in many-sorted empirical system.
Abstract: Knowledge discovery and data mining methods have been successful in many domains. However, their abilities to build or discover a domain theory remain unclear. This is largely due to the fact that many fundamental KDD&DM methodological questions are still unexplored such as (1) the nature of the information contained in input data relative to the domain theory, and (2) the nature of the knowledge that these methods discover. The goal of this paper is to clarify methodological questions of KDD&DM methods. This is done by using the concept of Relational Data Mining (RDM), representative measurement theory, an ontology of a subject domain, a many-sorted empirical system (algebraic structure in the first-order logic), and an ontology of a KDD&DM method. The paper concludes with a review of our RDM approach and 'Discovery' system built on this methodology that can analyze any hypotheses represented in the first-order logic and use any input by representing it in many-sorted empirical system.
TL;DR: NPClu is proposed, an approach for clustering sets of objects taken into account their geometric and topological properties, based on three steps, that is, pre-processing, clustering and refinement.
Abstract: The majority of clustering algorithms deal with collections of data that can be represented as sets of points in the multidimensional Euclidean space. There is a large variety of application domains, such as spatiotemporal databases, medical applications and others, which produce datasets of non-point objects (i.e. objects that occupy a specific hyperspace). Traditional clustering algorithms are mainly based on statistical properties of data and therefore are not able to efficiently partition sets of spatially extended objects.
In this paper we propose NPClu, an approach for clustering sets of objects taken into account their geometric and topological properties. The spatial objects are approximated by their MBRs. Then our approach discovers the clusters in the set of the MBRs' vertices based on three steps, that is, pre-processing, clustering and refinement. We experimentally evaluated the performance of our approach to show its effectiveness.
TL;DR: A new incremental method, AD-Miner, to discover Approximate Dependencies (ADs), based on logical operations which aim to reduce the computational complexity and the complexity of the method is lower than major incremental methods namely partitioning and Pair-wise comparison methods.
Abstract: Discovery of possible relations between attribute values in a relational database (i.e., functional dependencies) is an important issue in the field of data mining and knowledge discovery. Many search techniques have been proposed to discover classical and extended functional dependencies; but even the most efficient solutions do not have an acceptable performance in the case of large relation instances. In addition, most of the proposed algorithms assume that the database is static and thus database updates require re-scanning of the entire data repeatedly. In this paper, we propose a new incremental method, AD-Miner, to discover Approximate Dependencies (ADs). The main part of our work is based on logical operations which aim to reduce the computational complexity. The method is incremental and thus avoids re-scans of database when a set of tuples is added to the relation. Our experimental results indicate that our method is more efficient than FastFDs [22] which is one of the most efficient algorithms for mining of perfect dependencies. Furthermore, we have shown that the complexity of our method is lower than major incremental methods namely partitioning and Pair-wise comparison methods. In addition, our method has the extra advantage of marking the index of the tuples that violate a dependency. This feature can be used to find the exceptional cases that are inconsistent with the rest of the data. We have implemented AD-Miner and tested it on several benchmarks and synthetic data.
TL;DR: This paper uses a novel algorithm with Partial Least Squares (PLS) for selecting relevant variables and finds variables describing the banking sector, the international trade, the severity of the crisis, and foreign interest rates to be significant.
Abstract: The effects of a currency crisis on a country's economy depend on non-linear relations among several variables that characterize the economic, financial, legal, and socio-political structure of the country at the onset of the crisis. We seek to determine which variables are significant in explaining currency crises' real effects when they are all considered together. This paper uses a novel algorithm with Partial Least Squares (PLS) for selecting relevant variables. This algorithm works well with datasets characterized by few observations relative to the number of right-hand side variables and nonlinearity. Variables describing the banking sector, the international trade, the severity of the crisis, and foreign interest rates are found to be significant. On the other hand, socio-political variables, IMF's intervention, and legal variables are found to be less significant. Our algorithm's results are compared with all-best subsets variable selection and their predictive power is examined using neural networks.
TL;DR: An improved algorithm for mining frequent closed itemsets using the index array, which is used for discovering those items that always appear together, is proposed and it is proved that the reduced pre-set and reduced post-set not only retain the function of pre- set and post- set, but also have smaller sizes.
Abstract: The set of frequent closed itemsets determines exactly the complete set of all frequent itemsets and is usually much smaller than the latter. This paper proposes an improved algorithm for mining frequent closed itemsets. Firstly, the index array is proposed, which is used for discovering those items that always appear together. Then, by using bitmap, an algorithm for computing index array is presented. Thirdly, based on the heuristic information provided by index array, frequent items, which co-occur together and share the same support, are merged together. Thus, initial generators are calculated. Finally, based on index array, reduced pre-set and reduced post-set are proposed. It is proved that the reduced pre-set and reduced post-set not only retain the function of pre-set and post-set, but also have smaller sizes. Therefore, the redundant items in pre-set and post-set are deleted, thus making it possible to save a lot of work related to inclusion check. The experimental results show that the proposed algorithm is efficient especially on dense dataset.
TL;DR: A fast matching algorithm is devised that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed.
Abstract: We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records - which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.
TL;DR: This paper addresses the problem of detecting anomalies in horizontally distributed data, where only a limited ratio of the instances at each remote site are allowed to be shared, and no single entity is allowed to observe the whole dataset, neither at once nor incrementally.
Abstract: Anomaly detection is an important branch of the classification problem which has attracted much attention during the previous years. This, as well as the growing need for distributed data mining techniques, and concerns for privacy and security issues of gathering all distributed data in a central location, emphasizes the importance of the distributed anomaly detection problem, which has thus far received little attention. In this paper, we address the problem of detecting anomalies in horizontally distributed data, where only a limited ratio of the instances at each remote site are allowed to be shared, and no single entity is allowed to observe the whole dataset, neither at once nor incrementally. In our proposed method, local predictors are trained and association rules are extracted, using the difference between predicted and actual values on a context dataset. These association rules are used to represent normal and anomalous behaviors, while a final set of learners use these representations to detect anomalies. The contributions of our work are: 1) distributed anomaly detection, where (a) both data and process are distributed, (b) only a limited form of sharing is allowed and (c) no single entity is allowed to observe the whole data, in anyway, 2) solving the problem in cases where concept drifts might occur, 3) providing a solution which is able to handle potential dishonesty from participating entities, and 4) using association rules for anomaly detection, while maintaining the speed requirement in anomaly detection which is necessary in various applications. We have conducted a set of experiments, comparing our proposed method to other typical anomaly detection methods (oversampling, undersampling, SMOTE), which indicate the superiority of the proposed method, while preserving the privacy of participating datasets by avoiding the communication of all local samples to other local datasets.
TL;DR: It was the above list of concerns about the continuing lack of coherence and integration of KDD/DM that led to the DEXA Workshops of 2005/2006 on “Philosophies and Methodologies for Knowledge Discovery” (PMKD), with the hope of addressing some of these core inadequacies in KDD.
Abstract: as a new cohesive discipline, formed from the confluence of statistics, machine-learning and information systems, with the aim being “to discover by automatic means useful new knowledge from large and complex data stored in databases”. Data mining (DM), often taken to be synonymous to KDD, or a technical component of KDD, is strictly speaking more general than KDD since the italicized part of the above definition of KDD is dropped. However, defining KDD/DM as a new discipline does not mean that such a disciple exists as a coherent entity, or will ever exist. The diverse range of developments under the KDD/DM banner in the last 10 years do not give the impression of KDD/DM being a coherent discipline. Quite the converse! The dichotomy in DM between those in the machine-learning camp and those in the statistics camp seems to persist. DM seems to have been pursued by the machine-learning community into ever more specialized specific applications, e.g. text, screen, image, . . . , mining, but with no general intellectual framework being constructed. The statistical community seems almost to have left the KDD/DM theatre of operations, possibly due to their historical antipathy to the multiple-comparison paradigm that is implicit within KDD/DM, and seem to have focused their communal efforts, such as they are, into the development of R, the open-source descendent of S and S-Plus. Similarly, there seems to have been a continuing dichotomy between KDD/DM over whether the database/warehouse structure is fundamental or not. Researchers from the IS discipline continue to regard the DBMS/warehouse structures as relevant to KDD/DM, and have generally limited their DM techniques to SQL-like analytical tools, mostly making use of the materialized data hypercube, or other simple DM-algorithms with scale linearly with the size of the database. It is ironic that KDD, concerned as it is with knowledge, has no theory of knowledge as a part of it, and lacking such a feature KDD has remained essentially incoherent. Furthermore there has never been a general framework enunciated to guide the selection of appropriate DM-models, let alone one which would support the hope for DM to be an automatic process. It was the above list of concerns about the continuing lack of coherence and integration of KDD/DM that led to the DEXA Workshops of 2005/2006 on “Philosophies and Methodologies for Knowledge Discovery” (PMKD), with the hope of addressing some of these core inadequacies in KDD/DM. The five papers in this special issue derive from the papers presented at these PMKD Workshops but have been much extended, completely rewritten, or in some cases are the result of an integration of ideas from more than one workshop paper. We feel that these five papers do address, each in their own way, the
TL;DR: From the experimental results, the output projection accuracy by the hybrid intelligent approach was significantly better than that of some existing approaches.
Abstract: A hybrid intelligent approach is proposed which can be used to estimate the output of each product type in a semiconductor fabrication plant. This is a critical task for plant operation. First, the hybrid fuzzy-c-means (FCM) and fuzzy-back-propagation-neural-network (FBPN) approach is applied to estimate the output time for every job in the plant. Subsequently, the fuzzy output projection function (FOPF) is proposed to project the outputs into each future time period. To evaluate the advantages and/or disadvantages of the hybrid intelligent approach, a simulated semiconductor plant model is also used in this study to generate test data. From the experimental results, the output projection accuracy by the hybrid intelligent approach was significantly better than that of some existing approaches.
TL;DR: In this article, the problem of finding frequent items in a continuous stream of itemsets is studied, and a new frequency measure is introduced, based on a flexible window length, for a given item, its current frequency.
Abstract: We study the problem of finding frequent items in a continuous stream of itemsets. A new frequency measure is introduced, based on a flexible window length. For a given item, its current frequency ...
TL;DR: It is shown that traditional unmodified Apriori is not well suited to this task, and that the proposed algorithms are capable of producing interesting sporadic rules without having any expert domain knowledge.
Abstract: Most previous research into association rule data mining has focused on finding frequent rules; rules with high support and high confidence. However detecting rare or sporadic association rules, which have low support and high confidence, is a worthwhile task as well, as they represent rare, but potentially interesting and important associations. Mining rare or sporadic rules is a difficult data mining problem, and most previous approaches use an Apriori [1] like method [2-9]. However in order for Apriori to find rare rules, minimum support must be set very low, which results in a large amount of redundant rules and a long runtime. We previously proposed the Apriori-Inverse [10] and MIISR [11] algorithms to find sporadic rules quickly and efficiently. This paper provides an insight into the qualitative results produced by our proposed algorithms. We explore a specific real-world case study in more detail to get a qualitative understanding, namely the Dermatology dataset, for the diagnosis of the Erythemato-Squamous diseases. We show that traditional unmodified Apriori is not well suited to this task, and that our proposed algorithms are capable of producing interesting sporadic rules without having any expert domain knowledge.
TL;DR: This paper proposes a novel framework, named HIgh-order Substate Chain (HISC) modeling, to capture the entire system dynamics underlying the transaction time series, where the transaction contains explosive states due to the combinatorics of massively observed inputs.
Abstract: This paper proposes a novel framework, named HIgh-order Substate Chain (HISC) modeling, to capture the entire system dynamics underlying the transaction time series, where the transaction contains explosive states due to the combinatorics of massively observed inputs. In a practical situation, the objective system consists of multiple subsystems where a state of each subsystem is represented by a subset of the transaction. Thus, a transaction observed from the entire objective system is considered to be a collection of such subsets, and each subset is called a "substate" of the objective system. The basic task of our HISC modeling is to efficiently and simultaneously identify the substates and their transitions embedded in the time series. For application, the methods for system dynamics simulation and substate prediction by using the HISC model have been developed. Its significant performance has been confirmed through the evaluation on synthetic data, the comparisons with some High-order Markov chain models in the state of the art and the application to practical data analysis.
TL;DR: A novel algorithm for large data set clustering is devised, which utilizes efficient image processing techniques to cluster the data set after mapping its points into a binary image map and avoids exhaustive search by using the mapped image.
Abstract: In this paper, we devise a novel algorithm for large data set clustering. Our algorithm utilizes efficient image processing techniques to cluster the data set after mapping its points into a binary image map. To this end, the algorithm avoids exhaustive search by using the mapped image, which contain the critical boundary information needed to detect clusters. Compared to available data clustering techniques, the proposed algorithm produces similar quality results and outperforms them in execution time and storage requirements.
TL;DR: A new bounded index for cluster validity called the score function (SF), a double exponential expression that is based on a ratio of standard cluster parameters that is shown to work well on multidimensional and noisy data sets.
Abstract: Cluster validity indices are used for both estimating the quality of a clustering algorithm and for determining the correct number of clusters in data. Even though several indices exist in the literature, most of them are only relevant for data sets that contain at least two clusters. This paper introduces a new bounded index for cluster validity called the score function (SF), a double exponential expression that is based on a ratio of standard cluster parameters. Several artificial and real-life data sets are used to evaluate the performance of the score function. These data sets contain a range of features and patterns such as unbalanced, overlapped and noisy clusters. In addition, cases involving sub-clusters and perfect clusters are tested. The score function is tested against six previously proposed validity indices. In the case of hyper-spheroidal clusters, the index proposed in this paper is found to be always as good or better than these indices. In addition, it is shown to work well on multidimensional and noisy data sets. One of its advantages is the ability to handle single cluster case and sub-cluster hierarchies.
TL;DR: A successful attempt to assist ocean chemists in their attempts to predict absorption rate of CO$_2$ in certain regions of the ocean is described, which has found equations that outperform those suggested by field experts assisted by regression techniques.
Abstract: The idea to automate the discovery of numeric laws goes back to early 1980's when some authors showed how to use to this end AI search techniques. Later, the community somewhat lost interest in this task, reasoning that it was unlikely that a computer program would ever "outperform" human intuition supported by background knowledge. Only recently did some authors manage to overcome this scepticism: their programs were able to discover new numeric laws in soft-science domains such as ecology and psychology. In the case study reported here, we describe a successful attempt to assist ocean chemists in their attempts to predict absorption rate of CO$_2$ in certain regions of the ocean. The system we have developed has found equations that outperform those suggested by field experts assisted by regression techniques. Our experience indicates that more attention should be paid to the following: the impact of spatio-temporal aspects, the existence of "hidden causes" (not directly reflected in a given set of variables), and the need for a field expert to post-process the results.
TL;DR: It is suggested that the research area inside the DM community should be made broader than the current heavily technology-oriented one and researchers who take into account human and organisational aspects related to DM systems need to have also some understanding about DM.
Abstract: Data mining (DM) research has successfully developed advanced DM techniques and algorithms over the last few decades, and many organisations have great expectations to take more benefit of their data warehouses in decision making. Currently, the strong focus of most DM-researchers is still only on technology-oriented topics. Commonly the DM research has several stakeholders, the major of which can be divided into internal and external ones each having their own point of view, and which are at least partly conflicting. The most important internal groups of stakeholders are the DM research community and academics in other disciplines. The most important external stakeholder groups are managers and domain experts who have their own utility-based interests to DM and DM research results. In this paper we discuss these practice-oriented points of view towards DM research and suggest broader discussions inside the DM research community about who should do that kind of research. We bring in the discussion several topics developed in the information systems (IS) discipline and show some similarities between IS and DM systems. DM systems have also their own peculiarities and we conclude that researchers who take into account human and organisational aspects related to DM systems need to have also some understanding about DM. This makes us suggest that the research area inside the DM community should be made broader than the current heavily technology-oriented one.
TL;DR: A new method for combining hierarchical clustering is proposed and the results show that more accurate results are obtained using hierarchical combination than combination of partitional clusterings.
Abstract: In the field of pattern recognition, combining different classifiers into a robust classifier is a common approach for improving classification accuracy. Recently, this trend has also been used to improve clustering performance especially in non-hierarchical clustering approaches. Generally hierarchical clustering is preferred in comparison with the partitional clustering for applications when the exact number of the clusters is not determined or when we are interested in finding the relation between clusters. To the best of our knowledge clustering combination methods proposed so far are based on partitional clustering and hierarchical clustering has been ignored.
In this paper, a new method for combining hierarchical clustering is proposed. In this method, in the first step the primary hierarchical clustering dendrograms are converted to matrices. Then these matrices, which describe the dendrograms, are aggregated (using the matrix summation operator) into a final matrix with which the final clustering is formed. The effectiveness of different well known dendrogram descriptors and the one proposed by us for representing the dendrograms are evaluated and compared. The results show that all these descriptor work well and more accurate results (hierarchy of clusters) are obtained using hierarchical combination than combination of partitional clusterings.