TL;DR: This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.
Abstract: Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.
TL;DR: This paper first provides an overview of data preprocessing, focusing on problems of real world data, and details of dataPreprocessing techniques achieving each of the above mentioned objectives.
Abstract: This paper first provides an overview of data preprocessing, focusing on problems of real world data. These are primarily problems that have to be carefully understood and solved before any data analysis process can start. The paper discusses in detail two main reasons for performing data preprocessing: i problems with the data and ii preparation for data analysis. The paper continues with details of data preprocessing techniques achieving each of the above mentioned objectives. A total of 14 techniques are discussed. Two examples of data preprocessing applications from two of the most data rich domains are given at the end. The applications are related to semiconductor manufacturing and aerospace domains where large amounts of data are available, and they are fairly reliable. Future directions and some challenges are discussed at the end.
TL;DR: The technique of dynamic path generation is described in the context of tree-based classification methods and the waste of data which can result from casewise deletion of missing values in statistical algorithms is discussed and alternatives proposed.
Abstract: A brief overview of the history of the development of decision tree induction algorithms is followed by a review of techniques for dealing with missing attribute values in the operation of these methods. The technique of dynamic path generation is described in the context of tree-based classification methods. The waste of data which can result from casewise deletion of missing values in statistical algorithms is discussed and alternatives proposed.
TL;DR: The BANG-Clustering system presented in this paper is a novel approach to hierarchical data analysis and uses a multidimensional grid data structure to organize the value space surrounding the pattern values.
Abstract: For the analysis of large images the clustering of the data set is a common technique to identify correlation characteristics of the underlying value space In this paper a new approach to hierarchical clustering of very large data sets is presented The BANG-Clustering system presented in this paper is a novel approach to hierarchical data analysis It is based on the BANG-Clustering method ([Sch96]) and uses a multidimensional grid data structure to organize the value space surrounding the pattern values The patterns are grouped into blocks and clustered with respect to the blocks by a topological neighbor search algorithm
TL;DR: A method — called Bound and Collapse (bc) — to learn Bayesian Belief Networks from incomplete databases which allows the analyst to efficiently integrate information provided by the database and exogenous knowledge about the pattern of missing data.
Abstract: Current methods to learn Bayesian Belief Networks (bbns) from incomplete databases share the common assumption that the unreported data are missing at random. This paper describes a method — called Bound and Collapse (bc) — to learn bbns from incomplete databases which allows the analyst to efficiently integrate information provided by the database and exogenous knowledge about the pattern of missing data. bc starts by bounding the set of estimates consistent with the information conveyed by the database and then collapses the resulting set to a point via a convex combination of the extreme points, with weights depending on the assumed pattern of missing data. Experiments comparing bc to Gibbs Sampling are provided.
TL;DR: A method to segment the electrocardiogram (ECG) using time-warping, a technique commonly used in speech recognition, to cut the ECG into distinct periods (R-R interval).
Abstract: We present a method to segment the electrocardiogram (ECG) using time-warping, a technique commonly used in speech recognition. First, the ECG is transformed to a piecewise linear approximation. Next, the slope amplitude is used to cut the ECG into distinct periods (R-R interval). These periods are then compared to each other using timewarping, and the pair which is most similar is selected. Finally, this pair is segmented into the different subpatterns usually encountered in the ECG, such as the QRS complex, the T wave, and the P wave.
TL;DR: An extension of a fuzzy query language called Summary SQL is introduced which can be used for knowledge discovery and data mining and it is shown how it could be used to search for fuzzy rules.
Abstract: The increasing use of computers for transactions and communication have created mountains of data that contain potentially valuable knowledge. To search for this knowledge we have to develop a new generation of tools, which have the ability of flexible querying and intelligent searching. In this paper we will introduce an extension of a fuzzy query language called Summary SQL which can be used for knowledge discovery and data mining. We show how it can be used to search for fuzzy rules.
TL;DR: This work demonstrates that the correlations and resulting monitoring models can be improved greatly with the addition of pre-filtering the time signals using a median filter, and time-scale decomposition using a multi-resolution wavelet function.
Abstract: Producing a uniform product is important for several reasons such as maintenance of a competitive position, reduction in the number of shutdowns and startups, and the elimination of the sources of variability. Multivariate statistical methods can assist in the identification of process correlations and the development of process monitoring models. This work extends these concepts by demonstrating that the correlations and resulting monitoring models can be improved greatly with the addition of pre-filtering the time signals using a median filter, and time-scale decomposition using a multi-resolution wavelet function. After the data are filtered and decomposed, the multivariate statistical method of principal component analysis PCA is used to develop a process monitoring model. Data that was taken from a difficult-to-operate industrial process are used to demonstrate these ideas.
TL;DR: This paper addresses the specific problem of creating semantic term associations from a text database by using a hierarchical model made up of Fuzzy Adaptive Resonance Theory ART neural networks to cluster isolated words into semantic classes.
Abstract: The growing availability of databases on the information highways motivates the development of new processing tools able to deal with a heterogeneous and changing information environment. A highly desirable feature of data processing systems handling this type of information is the ability to automatically extract its own key words. In this paper we address the specific problem of creating semantic term associations from a text database. The proposed method uses a hierarchical model made up of Fuzzy Adaptive Resonance Theory ART neural networks. First, the system uses several Fuzzy ART modules to cluster isolated words into semantic classes, starting from the database raw text. Next, this knowledge is used together with coocurrence information to extract semantically meaningful term associations. These associations are asymmetric and one-to-many due to the polisemy phenomenon. The strength of the associations between words can be measured numerically. Besides this, they implicitly define a hierarchy between descriptors. The underlying algorithm is appropriate for employment on large databases. The operation of the system is illustrated on several real databases.
TL;DR: Preliminary results on experiments in parallelising C4.5, a classification-rule learning system using decision-trees as a model representation, which has been used as a base model for investigating methods for parallelising induction algorithms are presented.
Abstract: In the last decade, there has been an explosive growth in the generation and collection of data. Nonetheless, the quality of information inferred from this voluminous data has not been proportional to its size. One of the reasons for this is that the computational complexities of the algorithms used to extract information from the data are normally proportional to the number of input data items resulting in prohibitive execution time on large data sets. Parallelism is one solution to this problem. In this paper we present preliminary results on experiments in parallelising C4.5, a classification-rule learning system using decision-trees as a model representation, which has been used as a base model for investigating methods for parallelising induction algorithms. The experiments assess the potential for improving the execution time by exploiting parallelism in the algorithm.
TL;DR: Ltree is able to define decision surfaces both orthogonal and oblique to the axes defined by the attributes of the input space by combining a decision tree with a linear discriminant by means of constructive induction.
Abstract: In this paper we present system Ltree for proposicional supervised learning. Ltree is able to define decision surfaces both orthogonal and oblique to the axes defined by the attributes of the input space. This is done combining a decision tree with a linear discriminant by means of constructive induction. At each decision node Ltree defines a new instance space by insertion of new attributes that are projections of the examples that fall at this node over the hyper-planes given by a linear discriminant function. This new instance space is propagated down through the tree. Tests based on those new attributes are oblique with respect to the original input space. Ltree is a probabilistic tree in the sense that it outputs a class probability distribution for each query example. The class probability distribution is computed at learning time, taking into account the different class distributions on the path from the root to the actual node. We have carried out experiments on sixteen benchmark datasets and compared our system with other well known decision tree systems (orthogonal and oblique) like C4.5, OC1 and LMDT. On these datasets we have observed that our system has advantages in what concerns accuracy and tree size at statistically significant confidence levels.
TL;DR: The techniques used for categorizing variables in Snout an intelligent assistant for exploratory data analysis of survey and similar data sets that is currently under development are described.
Abstract: This paper describes the techniques used for categorizing variables in Snout an intelligent assistant for exploratory data analysis of survey and similar data sets that is currently under development. We begin by reviewing existing work on category formation in data mining which has been mainly concerned with enabling decision tree programs to handle numeric variables. It is argued that there are other important but neglected aspects of category formation, notably the formation of new categorizations of nominal variables. We report the limited success achieved in categorizing variables from survey data using either endogenous methods or exogenous methods that maximise the association with only one dependent variable. We then describe the categorization technique used in Snout: a procedure that selects a partition that both maximises the number of variables associated with the partitioned variable and maximises the strength of those associations. We report on the success achieved using this procedure in exploring real survey data.
TL;DR: A new algorithm for constructing one type of overview model: state transition diagrams is described, called State Transition Dependency Detection (STDD), which is the latest in a family of statistics based algorithms for modeling event sequences called Dependency detection.
Abstract: Discrete event sequences have been modeled with two types of representation: snapshots and overviews. Snapshot models describe the process as a collection of relatively short sequences. Overview models collect key relationships into a single structure, providing an integrated but abstract view. This paper describes a new algorithm for constructing one type of overview model: state transition diagrams. The algorithm, called State Transition Dependency Detection (STDD), is the latest in a family of statistics based algorithms for modeling event sequences called Dependency Detection. We present accuracy results for the algorithm on synthetic data and data from the execution of two AI systems.
TL;DR: The goal of an intelligent analyser is to produce robust rules, stable in the presence of data change, which allow easy rule maintenance as data changes, and provide rapid query reformulation, refutation or answering.
Abstract: Data analysis is needed in connection with query processing, to produce data summary information in the form of rules or assertions that allow semantic query optimisation or direct query answering without consulting the data itself. The goal of an intelligent analyser in this context is to produce robust rules, stable in the presence of data changes, which allow easy rule maintenance as data changes, and provide rapid query reformulation, refutation or answering. It must also limit the rule set to rules useful for query processing.
TL;DR: Experiments show that the modulated Parzen-windows approach is more efficient in probability density function estimation, without costly preprocessing or severe loss of accuracy.
Abstract: The Parzen-window approach is a well-known technique for estimating probability density functions. This paper introduces a modulated Parzen-windows approach. This approach uses kernels at equidistant samples to obtain a probability density function more efficiently. Experiments on both artificial and real data show that the modulated Parzen-windows approach is more efficient in probability density function estimation, without costly preprocessing or severe loss of accuracy.
TL;DR: The problem of clustering of multivariate random data is considered in presence of outliers and the new clustering algorithm with smoothing is presented.
Abstract: The problem of clustering of multivariate random data is considered in presence of outliers. The hypothetical model of data is described by a mixture of regular m-parametric probability densities. Clustering of data is made by the often used in practice decision rule which is derived by substitution of ML-estimators (on the unclassified sample) of parameters for their unknown true values in Bayesian decision rule. Robustness of probability of classification error is evaluated. The new clustering algorithm with smoothing is presented. Illustration for the case of the Gaussian hypothetical model and for the Fisher's data under outliers is given.
TL;DR: Empirical results show that the composite learner strategy is capable of partially overcoming the problem of locally low predictive accuracy, and at the same time improving the overall performance of its constituent algorithms in most of the domains studied.
Abstract: In this article, we first explore an intrinsic problem that exists in the models induced by learning algorithms. Regardless of the selected algorithm, search methodology and hypothesis representation by which the model is induced, one would expect the model to make better predictions in some regions of the description space than others. We present the fact that an induced model will have some regions of relatively poor performance: the problem of locally low predictive accuracy. Holte, Arker, Porter [21] addressed this intrinsic problem in learning systems that describe the induced model as a disjunction of conjunctions of conditions. In this article, we investigate the characterisation of the problem in instance-based and Naive Bayesian classifiers.Having characterised the problem of locally low predictive accuracy, we propose to counter the problem in these two types of learning algorithms, using a composite learner framework. The strategy is to select an estimated better performing model to do the final prediction during classification. Empirical results from fifteen real-world domains show that the strategy is capable of partially overcoming the problem of locally low predictive accuracy, and at the same time improving the overall performance of its constituent algorithms in most of the domains studied. The composite learner is also found to outperform four methods of stacked generalisation, and also a model selection method based on cross-validation, in most of the experimental domains studied.
TL;DR: Intelligent data analysis techniques based on symbolic learning by examples have been explored in order to automatically devise and parametrize effective quantitative models in the computer-vision inspection of industrial workpieces.
Abstract: The paper describes the use of data analysis techniques in the computer-vision inspection of industrial workpieces. Computer-vision inspection aims at accomplishing quality verification of fabricated parts by means of automated visual procedures. Gathering the visual information into models proves a critical task, especially when subjective judgement is involved in quality verification. In this work, intelligent data analysis techniques based on symbolic learning by examples have been explored in order to automatically devise and parametrize effective quantitative models. The paper reports and discusses the experimental results achieved in an industrial application.
TL;DR: An algorithm for learning a classification procedure to minimize the cost of misclassified examples is explored, based on the generation of oblique decision trees, which seems very promising.
Abstract: We explore an algorithm for learning a classification procedure to minimize the cost of misclassified examples. The described approach is based on the generation of oblique decision trees. The various misclassification costs are defined by a cost matrix. A special splitting criterion is defined to determine the next node for splitting. Clustering techniques are used to process the splitting. The specific splitting criterion is based on cost histograms that count the misclassification costs per class. To avoid overfitting cross-validation techniques are directly integrated into the training cycle to terminate the splitting process. Several successful tests with different data sets cause this method to seem very promising.
TL;DR: Techniques developed for detecting patterns in time-varying data with the ultimate aim of generating textual descriptions of the data are described and preliminary experiments are described in which the visually significant features in weather data are extracted and compared against hand-written expert descriptions.
Abstract: Reasoning effectively about time-varying data requires sophisticated pattern detection mechanisms. This paper describes techniques developed for detecting patterns in time-varying data with the ultimate aim of generating textual descriptions of the data. Preliminary experiments are described in which the visually significant features in weather data are extracted and compared against hand-written expert descriptions.
TL;DR: Change detection algorithms are proposed that are based on the comparison of distribution functions and fuzzy concepts are used to combine partial evaluations to a measure that indicates the departure of a signal from its reference.
Abstract: Change detection algorithms are proposed that are based on the comparison of distribution functions. Estimated values of distributions are associated with a binomial distribution that is used to define fuzzy similarity classes. Fuzzy concepts are used to combine partial evaluations to a measure that indicates the departure of a signal from its reference.
TL;DR: This paper makes this distinction between measurement errors and measurement errors by modelling measurement errors instead, and is better suited to those applications where it is difficult to obtain relevant knowledge about real measurements.
Abstract: Outliers are difficult to handle because some of them can be measurement errors, while others may represent phenomena of interest, something “significant” from the viewpoint of the application domain. Statistical methods for managing outliers do not distinguish between these two possibilities. In our previous work, we suggested a method for distinguishing these two possibilities by modelling “real measurements” — how measurements should be distributed in a domain of interest. In this paper, we make this distinction by modelling measurement errors instead. The proposed method is better suited to those applications where it is difficult to obtain relevant knowledge about real measurements. The test data collected from a recent glaucoma case finding study in a general practice are used to evaluate the method.
TL;DR: The entropy measure is presented as a monotonically decreasing function, symmetrical to the measure of dissonance, which can lead to further advances in optimization in information theory, which in turn may have a wide impact on decision and control.
Abstract: This article addresses the issue of quantitative information measurement within the Dempster--Shafer belief function formalism. Entropy computation in Dempster--Shafer depends on the way uncertainty measures are conceptualized. However, freed of most probability constraints, uncertainty measures in Dempster--Shafer theory can lead to further advances in optimization in information theory, which in turn may have a wide impact on decision and control. This article examines one form of current development regarding the entropy measure induced from the measure of dissonance. For a significant period, the measure of dissonance has been taken as a measure of entropy. We present in this article the entropy measure as a monotonically decreasing function, symmetrical to the measure of dissonance.
TL;DR: A new tree compression method is introduced in order to decrease the complexity when the authors have to manage with a large set of samples.
Abstract: In this paper, a new method of pattern recognition based on images splitting into a set of trees composed of fuzzy regions is presented. First, either a gradient inverse function is applied on the raster image to define the fuzzy regions supports, or we manage with the basic grey level image if regions are easily topologically separable. Then, topologic features are computed on these sets. Therefore, a tree description of the image, which consists of fuzzy regions with associated topological features, is obtained. A set of sample trees is achieved from the application of the fuzzy segmentation algorithm on characteristic objects small images. Then a tree isomorphism is defined to recognize a particular object in an image. At last, a new tree compression method is introduced in order to decrease the complexity when we have to manage with a large set of samples.
TL;DR: A new approach for the intelligent analysis of longitudinal data coming from diabetic patients home monitoring is presented, exploiting temporal abstractions to pre-process the raw data and to obtain a new time series of abstract episodes, whose features are then interpreted through statistical and probabilistic techniques.
Abstract: In this paper we present a new approach for the intelligent analysis of longitudinal data coming from diabetic patients home monitoring. This approach consists in exploiting temporal abstractions to pre-process the raw data and to obtain a new time series of abstract episodes, whose features are then interpreted through statistical and probabilistic techniques. We finally show the application of this methodology on the data of two diabetic patients monitored for six months.
TL;DR: Building correctly-sized models is a central challenge for induction algorithms, and under a broad range of circumstances, these approaches exhibit a nearly linear relationship between training set size and tree size, even after accuracy has ceased to increase.
Abstract: Building correctly-sized models is a central challenge for induction algorithms. Many approaches to decision tree induction fail this challenge. Under a broad range of circumstances, these approaches exhibit a nearly linear relationship between training set size and tree size, even after accuracy has ceased to increase. These algorithms fail to adjust for the statistical effects of comparing multiple subtrees. Adjusting for these effects produces trees with little or no excess structure.
TL;DR: The growing availability of databases on the information highways motivates the development of new processing tools able to deal with a heterogeneous and changing information environment.
Abstract: The growing availability of databases on the information highways motivates the development of new processing tools able to deal with a heterogeneous and changing information environment. A highly ...
TL;DR: A comparative study of the results against the quotations in literature reveals that the standard c-means FC technique is outperformed by the proposed technique in the count of misclassifications aspect.
Abstract: It has been observed that in the previous Genetic Algorithms (GA) based Fuzzy Clustering (FC) works only some of the parameters of an FC system are developed. Here, a new approach is proposed to develop directly the membership functions for the clusters using GA. This new technique is implemented and tested on common test data. A comparative study of the results against the quotations in literature reveals that the standard c-means FC technique is outperformed by the proposed technique in the count of misclassifications aspect.
TL;DR: A framework is presented that systematically describes problems that involve construction of decision trees or rules, optimising accuracy as well as measurement- and misclassification costs, and how this framework can be used to configure greedy algorithms for constructing such trees or Rules.
Abstract: This paper defines a class of problems involving combinations of induction and (cost) optimisation. A framework is presented that systematically describes problems that involve construction of decision trees or rules, optimising accuracy as well as measurement- and misclassification costs. It does not present any new algorithms but shows how this framework can be used to configure greedy algorithms for constructing such trees or rules. The framework covers a number of existing algorithms. Moreover, the framework can also be used to define algorithm configurations with new functionalities, as expressed in their evaluation functions.
TL;DR: Random Set Theory is used to build possibilistic uncertainty models from sampled data and Goodman's one-point coverage function of a class of random sets is estimated from data.
Abstract: When only incomplete information about the probability distribution of an experiment is available, we may have to admit imprecision in the formulation of an uncertainty model. In this paper Random Set Theory is used to build possibilistic uncertainty models from sampled data. In particular Goodman's one-point coverage function of a class of random sets is estimated from data. Finally, we focus on an example to illustrate how from random sets induced possibility distributions may be used in the detection of changes in time-series data.