TL;DR: CMM, a meta-learner that seeks to retain most of the accuracy gains of multiple model approaches, while still producing a single comprehensible model, is proposed and evaluated.
Abstract: If it is to qualify as knowledge, a learner's output should be accurate, stable and comprehensible. Learning multiple models can improve significantly on the accuracy and stability of single models, but at the cost of losing their comprehensibility when they possess it, as do, for example, simple decision trees and rule sets. This article proposes and evaluates CMM, a meta-learner that seeks to retain most of the accuracy gains of multiple model approaches, while still producing a single comprehensible model. CMM is based on reapplying the base learner to recover the frontiers implicit in the multiple model ensemble. This is done by giving the base learner a new training set, composed of a large number of examples generated and classified according to the ensemble, plus the original examples. CMM is evaluated using C4.5RULES as the base learner, and bagging as the multiple-model methodology. On 26 benchmark datasets, CMM retains on average 60% of the accuracy gains obtained by bagging relative to a single run of C4.5RULES, while producing a rule set whose complexity is typically a small multiple 2--6 of C4.5RULES's, and also improving stability. Further studies show that accuracy and complexity can be traded off by varying the number of artificial examples generated.
TL;DR: A new approach for the intelligent analysis of longitudinal data coming from chronic patients home monitoring is presented, which exploits temporal abstractions to pre-process the raw data and to obtain a new time series of abstract episodes, whose features are then interpreted through statistical and probabilistic techniques.
Abstract: In this article we present a new approach for the intelligent analysis of longitudinal data coming from chronic patients home monitoring. This approach exploits temporal abstractions to pre-process the raw data and to obtain a new time series of abstract episodes, whose features are then interpreted through statistical and probabilistic techniques. We describe in detail an application of the presented technique to the analysis of diabetic patients' data, showing some results obtained on a real case monitored for six months.
TL;DR: A new interactive interface is proposed to help the user to interpret the result of such a clustering process, according to the item characteristics, and has been applied successfully to a special case of items providing nice graphical representations electric load curves.
Abstract: Automatic clustering methods are part of data mining methods. They aim at building clusters of items so that similar items fall into the same cluster while unsimilar items fall into separate clusters. A particular class of clustering methods are hierarchical ones where recursive clusters are formed to grow a binary tree representing an approximation of similarities between items. We propose a new interactive interface to help the user to interpret the result of such a clustering process, according to the item characteristics. The prototype has been applied successfully to a special case of items providing nice graphical representations electric load curves but can also be used with other types of curves or with more standard items.
TL;DR: The research focuses on the importance of a suitable generalization hierarchy and representation for learning profiles which are predictively accurate and comprehensible and the promise of combining natural language processing techniques with machine learning ML to address an information retrieval IR problem.
Abstract: As more information becomes available electronically, tools for finding information of interest to users becomes increasingly important. The goal of the research described here is to build a system for generating comprehensible user profiles that accurately capture user interest with minimum user interaction, The research focuses on the importance of a suitable generalization hierarchy and representation for learning profiles which are predictively accurate and comprehensible. In our experiments we evaluated both traditional features based on weighted term vectors as well as subject features corresponding to categories which could be drawn from a thesaurus. Our experiments, conducted in the context of a content-based profiling system for on-line newspapers on the World Wide Web the IDD News Browser, demonstrate the importance of a generalization hierarchy and the promise of combining natural language processing techniques with machine learning ML to address an information retrieval IR problem.
TL;DR: Estimators of predictive performance for voted decision trees induced from bootstrap bagged or adaptive boosted resampling are described and it is shown that these estimates are usually quite accurate, with occasional weaker estimates.
Abstract: Decision tree induction is a prominent learning method, typically yielding quick results with competitive predictive performance However, it is not unusual to find other automated learning methods that exceed the predictive performance of a decision tree on the same application To achieve near-optimal classification results, resampling techniques can be employed to generate multiple decision-tree solutions These decision trees are individually applied and their answers voted The potential for exceptionally strong performance is counterbalanced by the substantial increase in computing time to induce many decision trees We describe estimators of predictive performance for voted decision trees induced from bootstrap bagged or adaptive boosted resampling The estimates are found by examining the performance of a single tree and its pruned subtrees over a single, training set and a large test set Using publicly available collections of data, we show that these estimates are usually quite accurate, with occasional weaker estimates The great advantage of these estimates is that they reveal the predictive potential of voted decision trees prior to applying expensive computational procedures
TL;DR: The paper shows the first results obtained, discusses difficulties found in the performed experiments and introduces an architecture based on qualitative models/causal relationships to ease the process of knowledge extraction from the historical data and the assessment of the extracted knowledge.
Abstract: This paper describes an ongoing work on the application of machine learning techniques in the domain of water distribution networks. This research is performed in the framework of the European project WATERNET, whose aim is to develop a system to control and manage water distribution networks. WATERNET is composed of a supervision system, a distributed information management subsystem, an optimization subsystem, a water quality monitoring subsystem, and a simulation subsystem. In addition to these components, a machine learning subsystem is included to extract knowledge from historical data and improve the performance of the water management system. This paper is focused on the approach and methodology followed for the development of the machine learning subsystem. The basic raw material for this work are historical data from a Portuguese water distribution company that has 45 water stations and some of them with six years of data collected every five minutes. The paper also shows the first results obtained, discusses difficulties found in the performed experiments and introduces an architecture based on qualitative models/causal relationships to ease the process of knowledge extraction from the historical data and the assessment of the extracted knowledge.
TL;DR: It is argued, on theoretical grounds, that the accuracy of the system should be positively correlated to the product of the number of equivalence classes for all of the SNs, and was applied to the classification of melodies presented as direct audio events temporal sequences played by a human and subject, therefore, to biological variations.
Abstract: In this paper, we investigate a form of modular neural network for classification with a pre-separated input vectors entering its specialist expert networks, b specialist networks which are self-organized radial-basis function or self-targeted feedforward type and c which fuses or integrates the specialists with a single-layer net. When the modular architecture is applied to spatiotemporal sequences, the Specialist Nets are recurrent; specifically, we use the Input Recurrent type.The Specialist Networks SNs learn to divide their input space into a number of equivalence classes defined by self-organized clustering and learning using the statistical properties of the input domain. Once the specialists have settled in their training, the Fusion Network is trained by any supervised method to map to the semantic classes.We discuss the fact that this architecture and its training is quite distinct from the hierarchical mixture of experts HME type as well as from stacked generalization.Because the equivalence classes to which the SNs map the input vectors are determined by the natural clustering of the input data, the SNs learn rapidly and accurately. The fusion network also trains rapidly by reason of its simplicity.We argue, on theoretical grounds, that the accuracy of the system should be positively correlated to the product of the number of equivalence classes for all of the SNs.This network was applied, as an empirical test case, to the classification of melodies presented as direct audio events temporal sequences played by a human and subject, therefore, to biological variations. The audio input was divided into two modes: a frequency or pitch variation and b rhythm, both as functions of time. The results and observations show the technique to be very robust and support the theoretical deductions concerning accuracy.
TL;DR: This paper combines artificial neural network and genetic algorithm to mine classification rules from databases and generates rules of better performance than the decision tree approach and the number of extracted rules is fewer than that of C4.5.
Abstract: Classification, which involves finding rules that partition a given dataset into disjoint groups, is one class of data mining problems. Approaches proposed so far for mining classification rules from databases are mainly decision tree based on symbolic learning methods. In this paper, we combine artificial neural network and genetic algorithm to mine classification rules. Some experiments have demonstrated that our method generates rules of better performance than the decision tree approach and the number of extracted rules is fewer than that of C4.5.
TL;DR: The major topics of papers include data reduction, feature selection, ensembles of classifiers, natural language learning, text categorization, inductive logic programming, stochastic models, and reinforcement learning.
Abstract: We briefly review each paper of the Fourteenth International Conference on Machine Learning, along with some general observations on the conference as a whole. The major topics of papers include data reduction, feature selection, ensembles of classifiers, natural language learning, text categorization, inductive logic programming, stochastic models, and reinforcement learning.
TL;DR: Learning multiple models can improve significantly on the accuracy and stability of single models and create new models that are more accurate, stable and comprehensible.
Abstract: If it is to qualify as knowledge, a learner's output should be accurate, stable and comprehensible. Learning multiple models can improve significantly on the accuracy and stability of single models...
TL;DR: Results show that the new phase transition predictor is able to produce predictions as good as the state-of-the-art predictor in general, but do considerably better in sparsely constrained problems, particularly when the node degree variation in their constraint graphs is high.
Abstract: Constraint satisfaction is at the core of many applications, such as scheduling. The study of phase transition has benefited algorithm selection and algorithm development in constraint satisfaction. Recent research provides evidence that constraint graph topology affects where phase transitions occur in constraint satisfaction problems. In this article, a new phase transition predictor which takes constraint graph information into consideration is proposed. The new predictor allows variation in the tightness of individual constraints and node degree variation in constraint graph. Experiments were conducted to study the usefulness of the new predictor on random binary constraint satisfaction problems. Results show that the new predictor is able to produce predictions as good as the state-of-the-art predictor in general, but do considerably better in sparsely constrained problems, particularly when the node degree variation in their constraint graphs is high.
TL;DR: A new approach for fast estimation of the position of the modal concentration, and consequently, its value, drawn from a unimodally distributed and sorted set of data, based on the formula for Inclusive Graphic Skewness is proposed.
Abstract: This study proposes a new approach for fast estimation of the position of the modal concentration, and consequently, its value, drawn from a unimodally distributed and sorted set of data. In terms of only five specific percentile values, the proposed procedure estimates the position of the modal concentration of unimodal curves with a relatively high accuracy. These five percentile values in particular, 5th, 16th, 50th, 84th, and 95th can be easily accessed from any given sorted distribution, regardless of the number of records in the database. The whole approach is based on the formula for Inclusive Graphic Skewness. Besides being one of the essential statistical parameters, the mode and the information about its location may be useful in other domains, such as searching non-uniform distributions, qualitative description of data, etc.
TL;DR: The main problem considered in this paper consists of binarizing categorical nominal attributes having a very large number of values 204 in the authors' application and grouping the L values of a categorical attribute by means of an hierarchical clustering method, which defines a new categoricalattribute with only J values.
Abstract: The main problem considered in this paper consists of binarizing categorical nominal attributes having a very large number of values 204 in our application. A small number of relevant binary attributes are gathered from each initial attribute. Let us suppose that we want to binarize a categorical attribute v with L values, where L is large or very large. The total number of binary attributes that can be extracted from v is 2L-1-1, which in the case of a large L is prohibitive. Our idea is to select only those binary attributes that are predictive; and these shall constitute a small fraction of all possible binary attributes. In order to do this, the significant idea consists in grouping the L values of a categorical attribute by means of an hierarchical clustering method. To do so, we need to define a similarity between values, which is associated with their predictive power. By clustering the L values into a small number of clusters J, we define a new categorical attribute with only J values. The hierarchical clustering method used by us, AVL, allows to choose a significant value for J. Now, we could consider using all the 2L-1-1 binary attributes associated with this new categorical attribute. Nevertheless, the J values are tree-structured, because we have used a hierarchical clustering method. We profit from this, and consider only about 2×J binary attributes. If L is extremely large, for complexity and statistical reasons, we might not be able to apply a clustering algorithm directly. In this case, we start by “factorizing” v into a pair (v2,v2), each one with about $\sqrt{Lv}$ values. For a simple example, consider an attribute v with only four values m1,m2,m3,m4. Obviously, in this example, there is no need to factorize the set of values of v, because it has a very small number of values. Nevertheless, for illustration purposes, v could be decomposed factorized into 2 attributes with only two values each; the correspondence between the values of v and (v2,v2) would be \[\begin{array}{[email protected]{\qquad}[email protected]{\qquad}c}vv J1 resp. J2 is the number of values of $\bar{v}^{1}$ resp. $\bar{v}^{2}$. Now, we apply a final clustering to the values of v10, and proceed as above. The solution that we propose is independent of the number of classes and can be applied to various situations. The application of ARCADE to the protein secondary structure prediction problem, proves the validity of our approach.
TL;DR: It is demonstrated that by a careful design of the fitness function the global landscape becomes smoother, its correlation increases, and facilitates the search.
Abstract: This article proposes a study of inductive Genetic Programming with Decision Trees GPDT. The theoretical underpinning is an approach to the development of fitness functions for improving the search guidance. The approach relies on analysis of the global fitness landscape structure with a statistical correlation measure. The basic idea is that the fitness landscape could be made informative enough to enable efficient search navigation. We demonstrate that by a careful design of the fitness function the global landscape becomes smoother, its correlation increases, and facilitates the search. Another claim is that the fitness function has not only to mitigate navigation difficulties, but also to guarantee maintenance of decision trees with low syntactic complexity and high predictive accuracy.
TL;DR: The size of machine-readable data sets has increased and computational methods and tools are being developed that enhance traditional statistical analysis, which has created a new range of problems and challenges for analysts, as well as new opportunities for intelligent systems in data analysis.
Abstract: Two factors have affected the work of modem data analysts more than any others. First, the size of machine-readable data sets has increased, especially during the last decade or so. Second, computational methods and tools are being developed that enhance traditional statistical analysis. These two developments have created a new range of problems and challenges for analysts, as well as new opportunities for intelligent systems in data analysis. To provide an international forum for the discussion of these topics, a series of symposia on Intelligent Data Analysis was initiated in 1995 [4]. The second Intelligent Data Analysis conference (IDA-97) was held at Birkbeck College, University of London, 4th-6th August 1997. Almost 130 people from twenty countries in four continents took part in the symposium. A total of 107 papers were submitted to the IDA-97 conference, of which 50 were accepted as either oral or poster presentations. After the conference, five papers were chosen from the conference program and their authors were invited to prepare extended versions for publication in the Intelligent Data Analysis (IDA) JournaL A second round of review provided additional feedback to the authors and the papers are now presented in this special Issue.
TL;DR: A new approach to extract plausible rules, which consists of the characterization of decision attributes given classes is extracted from databases and the classes are classified into several groups with respect to the characterization, and two kinds of sub-rules are induced.
Abstract: One of the most important problems on rule induction methods is that they cannot extract rules, which plausibly represent experts' decision processes. On one hand, rule induction methods induce probabilistic rules, the description length of which is too short, compared with the experts' rules. On the other hand, construction of Bayesian networks generates too lengthy rules. In this paper, the characteristics of experts' rules are closely examined and a new approach to extract plausible rules is introduced, which consists of the following three procedures. First, the characterization of decision attributes given classes is extracted from databases and the classes are classified into several groups with respect to the characterization. Then, two kinds of sub-rules, characterization rules for each group and discrimination rules for each class in the group are induced. Finally, those two parts are integrated into one rule for each decision attribute. The proposed method was evaluated on medical databases, the experimental results of which show that induced rules correctly represent experts' decision processes.
TL;DR: A fast and simple algorithm for approximately calculating the principal components PCs of a dataset and so reducing its dimensionality is described and shows a fast convergence rate compared with other methods and robustness to the reordering of the samples.
Abstract: A fast and simple algorithm for approximately calculating the principal components PCs of a dataset and so reducing its dimensionality is described. This Simple Principal Components Analysis SPCA method was used for dimensionality reduction of two high-dimensional image databases, one of handwritten digits and one of handwritten Japanese characters. It was tested and compared with other techniques. On both databases SPCA shows a fast convergence rate compared with other methods and robustness to the reordering of the samples.
TL;DR: This work analyzes the questions that, in the experience, end users of machine learning tend to ask of the structures inferred from their empirical data and creates new graphical representations that show the flow of examples through a decision structure.
Abstract: Researchers in machine learning primarily use decision trees, production rules, and decision graphs for visualizing classification data, with the graphic form in which a structure is portrayed as having a strong influence on comprehensibility We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure These knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set In the case of decision trees, an automated software tool has been developed to construct the visualizations
TL;DR: A method to learn Bayesian Networks from incomplete databases which allows the analyst to efficiently integrate information provided by the observed data and exogenous knowledge about the pattern of missing data is described.
Abstract: Current methods to learn Bayesian Networks from incomplete databases share the common assumption that the unreported data are missing at random. This paper describes a method --called Bound and Collapse BC --to learn Bayesian Networks from incomplete databases which allows the analyst to efficiently integrate information provided by the observed data and exogenous knowledge about the pattern of missing data. BC starts by bounding the set of estimates consistent with the available information and then collapses the resulting set to a point estimate via a convex combination of the extreme points, with weights depending on the assumed pattern of missing data. Experiments comparing BC to Gibbs Sampling are provided.
TL;DR: A key to intelligent data analysis is the ability to recognise what is important in a problem --what counts and what doesn't count.
Abstract: Data analysis is an interdisciplinary science. Traditionally its development has been driven by the areas of application, but nowadays its development is also stimulated by the ever-changing possibilities promised by progress in computer technology. Huge data sets and non-numerical data, such as text data, image data, and metadata, present both challenges and opportunities for modern data analysts. These in turn lead to new types of problems and require the development of new types of models. Intelligent data analysis also requires that one take proper advantage of the largely complementary abilities of humans and computers. Interactive graphics, an important tool for modem intelligent data analysis, nicely illustrates this: the production of such graphics, and the ability to manipulate them in real time, requires advanced computational facilities; but the ability to interpret them requires the capacity to synthesise possessed only by the human eye and mind. Intelligent data analysis also requires one to have a proper strategy for analysis. Analysis without strategy is surely one of the hallmarks of unintelligent data analysis. Likewise, a key to intelligent data analysis is the ability to recognise what is important in a problem --what counts and what doesn't count.