TL;DR: The assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines is investigated.
Abstract: In machine learning problems, differences in prior class probabilities -- or class imbalances -- have been reported to hinder the performance of some standard classifiers, such as decision trees. This paper presents a systematic study aimed at answering three different questions. First, we attempt to understand the nature of the class imbalance problem by establishing a relationship between concept complexity, size of the training set and class imbalance level. Second, we discuss several basic re-sampling or cost-modifying methods previously proposed to deal with the class imbalance problem and compare their effectiveness. The results obtained by such methods on artificial domains are linked to results in real-world domains. Finally, we investigate the assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines.
TL;DR: ELSA is used, an evolutionary local selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multi-dimensional objective space and results in models with better and clearer semantic relevance.
Abstract: Feature subset selection is important not only for the insight gained from determining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. Feature selection has traditionally been studied in supervised learning situations, with some estimate of accuracy used to evaluate candidate subsets. However, we often cannot apply supervised learning for lack of a training signal. For these cases, we propose a new feature selection approach based on clustering. A number of heuristic criteria can be used to estimate the quality of clusters built from a given feature subset. Rather than combining such criteria, we use ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multi-dimensional objective space. Each evolved solution represents a feature subset and a number of clusters; two representative clustering algorithms, K-means and EM, are applied to form the given number of clusters based on the selected features. Experimental results on both real and synthetic data show that the method can consistently find approximate Pareto-optimal solutions through which we can identify the significant features and an appropriate number of clusters. This results in models with better and clearer semantic relevance.
TL;DR: This paper defines statistical tests, analyzes the statistical foundation underlying the approach, design a fast algorithm to detect spatial outliers, and provides cost models for outlier detection procedures.
Abstract: Identification of outliers can lead to the discovery of unexpected and interesting knowledge. Existing methods are designed for detecting spatial outliers in multidimensional geometric data sets, where a distance metric is available. In this paper, we focus on detecting spatial outliers in graph structured data sets. We define statistical tests, analyze the statistical foundation underlying our approach, design a fast algorithm to detect spatial outliers, and provide cost models for outlier detection procedures. In addition, we provide experimental results from the application of our algorithm on a Minneapolis-St. Paul (Twin Cities) traffic data set to show its effectiveness and usefulness.
TL;DR: An alternative approach methodology for Correspondence Analysis is presented, which utilizes Spearman's rho to detect underlying associations and trends and which includes a "grade" concepts example in an appendix.
Abstract: An alternative approach methodology for Correspondence Analysis is presented. This approach, called Grade Correspondence Analysis (GCA), utilizes Spearman's rho to detect underlying associations and trends. Two examples are presented using: (1) a contingency table (Heuer's suicide data) with cause of death, gender, and age; and (2) a survey questionnaire (data matrix) concerning employment, personal economics, computer skills, and disability level of handicapped computer specialists in Poland. GCA uses a search strategy (multi-starts / random starts) to detect trends (not forced to be orthogonal) among rows and columns. (A similar strategy permits the determination of significance levels.) Results are discussed using measures of the "representativness" of the trends, as well as measures of their "regularity". Visualization of trends (as well as outlier trend detection) is via the concept of "overrepresentation" maps. Survey data may be measured on any non-negative scale. Meaningful disjoint aggregation (or division) of sub-populations and variables are possible. This paper is written for the practitioner and includes a "grade" concepts example in an appendix. There is also, however, an appendix with GCA theory relating to: grade distributions; local maxima of Spearman's rho and their representativness, regularity and regions of attraction; total positivity of order 2 (TP2); similarity measures; suitable "random references" for the determination of significance levels; and the application of GCA to non-negative data matrices.
TL;DR: This work presents a computational framework for modelling this type of data, and reports experimental results of applying this framework to the analysis of gene expression data in the virology domain, producing promising results.
Abstract: Short, high-dimensional, Multivariate Time Series (MTS) data are common in many fields such as medicine, finance and science, and any advance in modelling this kind of data would be beneficial. Nowhere is this truer than functional genomics where effective ways of analysing gene expression data are urgently needed. Progress in this area could help obtain a “global” view of biological processes, and ultimately lead to a great improvement in the quality of human life. We present a computational framework for modelling this type of data, and report experimental results of applying this framework to the analysis of gene expression data in the virology domain. The framework contains a three-step modelling strategy: correlation search, variable grouping, and short MTS modelling. Novel research is involved in each step which has been individually tested on different real-world datasets in engineering and medicine. This is the first attempt to integrate all these components into a coherent computational framework, and test the framework on a very challenging application area, producing promising results.
TL;DR: The aim of this work is to show the application of a bayesian algorithm (K2) in data mining problems as a data preparation and classification tool.
Abstract: Dealing with missing values is one important task in data mining. There are many ways to work with this kind of data, but the literature doesn't determine the best one to all kinds of data set. The aim of this work is to show the application of a bayesian algorithm (K2) in data mining problems as a data preparation and classification tool. In this paper, the algorithm generates a bayesian network which is used to substitute the missing values. It's done by predicting the most probable instance for the features in each object of the database. The prediction uses an heuristic bayesian conditioning algorithm generating a preprocessed sample. Having this preprocessed sample, the classification is done. The results of the classification with and without the data preparation are analyzed.
TL;DR: The compression ability of the PCA-based method is proved via a reconstitution process of the original SAR images from a small number of new images with a minimal loss of information.
Abstract: A new PCA-based method for an optimal representation of multi-frequency polarimetric SAR images is proposed. The method performs the simultaneous diagonalization of the signal and multiplicative noise covariance matrices via one orthogonal matrix. The covariance matrix of the multiplicative noise becomes an identity matrix, which implies that the variance of the noise in each new image is unity, and is uncorrelated between transformed images. The covariance matrix of the SAR images is transformed to a diagonal matrix whose diagonal elements are ordered in decreasing value, which means that the new images are uncorrelated and will be ordered by their variances (qualities). The theoretical analysis and the implementation procedure of the method are given. The method has been applied on real SAR images. The compression ability of the method is proved via a reconstitution process of the original SAR images from a small number of new images with a minimal loss of information.
TL;DR: The proposed methodology improves upon existing HMM clustering methods in two ways: an explicit HMM model size selection procedure is incorporated into the clustering process, and a partition selection method is developed to ensure an objective, data-driven selection of the number of clusters in the partition.
Abstract: This paper discusses a temporal data clustering system that is based on the Hidden Markov Model(HMM) methodology. The proposed methodology improves upon existing HMM clustering methods in two ways. First, an explicit HMM model size selection procedure is incorporated into the clustering process, i.e., the sizes of the individual HMMs are dynamically determined for each cluster. This improves the interpretability of cluster models, and the quality of the final clustering partition results. Second, a partition selection method is developed to ensure an objective, data-driven selection of the number of clusters in the partition. The result is a heuristic sequential search control algorithm that is computationally feasible. Experiments with artificially generated data and real world ecology data show that: (i) the HMM model size selection algorithm is effective in re-discovering the structure of the generating HMMs, (ii) the HMM clustering with model size selection significantly outperforms HMM clustering using uniform HMM model sizes for re-discovering clustering partition structures, (iii) it is able to produce interpretable and "interesting" models for real world data.
TL;DR: An effective, adaptive and robust boosting algorithm, DMBoost, is developed by optimising a bound on the generalisation error of ensembled classifiers in terms of the 2-norm of the margin slack vector.
Abstract: This paper introduces a strategy for training ensemble classifiers by analysing boosting within margin theory We present a bound on the generalisation error of ensembled classifiers in terms of the 2-norm of the margin slack vector We develop an effective, adaptive and robust boosting algorithm, DMBoost, by optimising this bound The soft margin based quadratic loss function is insensitive to points having a large margin The algorithm improves the generalisation performance of a system by ignoring the examples having small or negative margin
We evaluate the efficacy of the proposed method by applying it to a text categorization task Experimental results show that DMBoost performs significantly better than AdaBoost, hence validating the effectiveness of the method Furthermore, experimental results on UCI data sets demonstrate that DMBoost generally outperforms AdaBoost
TL;DR: This paper attempts to design an effective maintenance algorithm for sequential patterns as records are deleted, and utilizes previously discovered large sequences in the maintenance process, thus reducing numbers of rescanning databases.
Abstract: Mining sequential patterns from temporal transaction databases attempts to find customer behavior models and to assist managers in making correct and effective decisions. The sequential patterns discovered may, however, become invalid or inappropriate when databases are updated. Conventional approaches may re-mine entire databases to get correct sequential patterns for maintenance. However, when a database is massive in size, this will require considerable computation time. In the past, Lin and Lee proposed an incremental mining algorithm for maintenance of sequential patterns as new records were inserted. In addition to record insertion, record deletion is also commonly seen in real-world applications. Processing record deletion is, however, different from processing record insertion. The former can even be thought of the contrary of the latter. In this paper, we thus attempt to design an effective maintenance algorithm for sequential patterns as records are deleted. Our proposed algorithm utilizes previously discovered large sequences in the maintenance process, thus reducing numbers of rescanning databases. In addition, rescanning requirement depends on decreased numbers of customers, which are usually zero when numbers of deleted records are not large. This characteristic is especially useful for dynamic database mining.
TL;DR: This work uses a clustering method that computes bi-partitions and an efficient association rule mining technique to describe the membership of examples within each cluster and proposes a technique for removing rules that are not relevant enough for the cluster characterization.
Abstract: We combine different recent data mining techniques to improve the symbolic description of unsupervised clusters. First, we use a clustering method that computes bi-partitions (a partition of examples and a related partition of attribute-value pairs). Then, we use an efficient association rule mining technique to describe the membership of examples within each cluster. We propose a technique for removing rules that are not relevant enough for the cluster characterization. An experimental validation on a real world medical data set is provided. Keywords. Conceptual clustering, association rule, characterization of clusters.
TL;DR: The experimental result shows that the proposed models outperforms conventional neural network model when it comes to interest rate forecasting and examines three different models based on various change point detection methods.
Abstract: This study proposes a piecewise nonlinear model based on the segmentation of financial time series. The basic concept of proposed model is to obtain intervals divided by change points, to identify them as change-point groups, and to use them in the forecasting model. The proposed model consists of two stages. The first stage detects successive change points in time series dataset and forecasts change-point groups with backpropagation neural networks (BPNs). In this stage, the following three change-point detection methods are applied and compared: the parametric method, the nonparametric approach, and the model-based approach. The next stage forecasts the final output with BPN using the groups. This study applies the proposed model to interest rate forecasting and examines three different models based on various change point detection methods. The experimental result shows that the proposed models outperforms conventional neural network model.
TL;DR: A graphical method for evaluating the quality of a feature extraction mapping based on the Bilipschitz criterion, which can be used to evaluate dimension reducing mappings for relative quality and to estimate the injectivity of the reduction map (as well as the associated reconstruction map).
Abstract: We present a graphical method for evaluating the quality of a feature extraction mapping. Based on the Bilipschitz criterion, this Bilipschitz Criterion Plot (BCP) can be used to evaluate dimension reducing mappings for relative quality and to estimate the injectivity of the reduction map (as well as the associated reconstruction map). It can also be used to survey regions where the map is locally an expansion or contraction map. The plot is easy and fast to construct, and gives much more insight than any single value can, such as the distance preservation error. We demonstrate the value of such a mapping when examining the quality of the Sammon map, Neuroscale, the autoassociative map, and a recent technique that is designed to optimize the BCP in a linear fashion, the adaptive secant basis algorithm.
TL;DR: This paper presents a method of searching for templates using probabilistic neural networks that has so far been applied to a data set of 2408 UK construction companies and can explain how badly a company performs and what the problem is if its financial situation is not sound.
Abstract: Other than identifying whether a company may fail or not, explaining why a company may fail is essential. The most common way of explaining is to use a template like the standards used in commercial society. Because of the existence of heteroscedasticity, it is impossible to expect that there is only one standard within an industry. For instance, it is unrealistic to use one standard to evaluate performance of both a new-born company and a fifty-year old company. This paper presents a method of searching for templates using probabilistic neural networks. Each template represents a number of companies, which have similar financial performance and therefore similar financial outcomes. A comparison between a company and a template can explain how badly a company performs and what the problem is if its financial situation is not sound. The method has so far been applied to a data set of 2408 UK construction companies.
TL;DR: The proposed procedure for designing a multilayer perceptron for predicting time series is intended to help the user in the task of specifying as simple models as possible, providing an unambiguous methodology to construct neural networks for time series forecasting.
Abstract: A procedure for designing a multilayer perceptron for predicting time series is proposed. It is based on the generation, according to a set of rules emerging from an ARIMA model previously fitted, of a set of nonlinear forecasting models. These rules are extracted from the set of non-zero coefficients in the ARIMA model, so they consider the autocorrelation structure of the time series. The proposed procedure is intended to help the user in the task of specifying as simple models as possible, providing an unambiguous methodology to construct neural networks for time series forecasting. The performance of this procedure is empirically studied by means of a comparative analysis involving time series from three domains. The first part of the experiment is very extensive and works over 33 time series from the Active Population Survey in Andalusia, Spain. The training of the multilayer perceptron is performed by three different learning rules, incorporating multiple repetitions, and the hidden layer size is determined by means of a grid search. The obtained results show a better performance of these neural network models, in comparison with pure classical statistical techniques, namely ARIMA models and exponential smoothing techniques. These results are confirmed over two more concise studies from the tourist and geodynamic domains, where we graphically illustrate the superiority of the constructed neural networks in long-term forecasting, in comparison with ARIMA models.
TL;DR: The notion of {\it weak fuzzy similarity relations}, a generalization of fuzzy similarity Relations, is used to provide a more realistic description of relationships between elements in which properties of symmetry and transitivity are no longer hold.
Abstract: In 1982, Pawlak proposed the concept of {\it rough sets} with practical purpose of representing indiscernibility of elements. Although rough set theory built on equivalence relation has the advantage of being easy to analyze, it may not be a widely applicable model as equivalence relations, because of their properties of symmetry and transitivity, may not provide a realistic view of relationships between elements in real world. Therefore a covering of the universe was introduced in order to represent a more realistic model. However, it is still unclear regarding what kinds of relations may use in defining the coverings.
In this paper, the notion of {\it weak fuzzy similarity relations}, a generalization of fuzzy similarity relations, is used to provide a more realistic description of relationships between elements in which properties of symmetry and transitivity are no longer hold. A special type (concrete example) of weak fuzzy similarity relations called conditional probability relation is discussed. A generalized concept of rough set approximations are proposed based on α-coverings of the universe induced by conditional probability relations. Rough membership functions are also re-defined into three values, minimum, maximum and average. Their properties are also examined. In addition, by extending the concept of α-coverings of the universe, some properties and applications related to {\it Knowledge Discovery and Data Mining} (KDD) are provided. First, application of α-redundancy of objects is proposed in order to reduce decision rules in the presence of decision table. Next, an important concept of dependency of domain attributes is introduced in corresponding to the concept of fuzzy functional dependency.
TL;DR: This paper proposes a new technique, called B^+-Tree based Weighted Random Sampling (BTWRS), that alters the inclusion probabilities of records accordingly to allow more records from leaves, along the paths with higher fanouts, to be extracted.
Abstract: Sampling techniques are becoming increasingly important for very large databases. However, the problem of obtaining a random sample from index structures has not received much attention. In this paper, we examine sampling techniques for B^+-tree. As the fanout of each node varies, a random walk through the index structure does not produce a good representative sample of the data set. We propose a new technique, called B^+-Tree based Weighted Random Sampling (BTWRS), that alters the inclusion probabilities of records accordingly to allow more records from leaves, along the paths with higher fanouts, to be extracted. We extensively evaluated our method, and the results show that there is an improvement in BTWRS over the existing schemes in terms of the quality of the samples obtained and the efficiency of the sampling process. The proposed method can be readily adopted in existing commercial systems.
TL;DR: An entropy-based heuristic is presented that gives higher scores for wavelengths more likely to distinguish between classes, and results are presented for four different classes, showing reasonable improvements in identifying some, but not all, of the mineral classes tested.
Abstract: The ability to identify the mineral composition of rocks and soils is an important tool for the exploration of geological sites. Even though expert knowledge is commonly used for this task, it is desirable to create automated systems with similar or better performance. For instance, NASA intends to design robots that are sufficiently autonomous to perform this task on planetary missions. Spectrometer readings provide one important source of data for identifying sites with minerals of interest. Reflectance spectrometers measure intensities of light reflected from surfaces over a range of wavelengths. Spectral intensity patterns may in some cases be sufficiently distinctive for proper identification of minerals or classes of minerals. For some mineral classes, carbonates for example, specific short spectral intervals are known to carry a distinctive signature. Finding similar distinctive spectral ranges for other mineral classes is not an easy problem. We propose and evaluate data-driven techniques in two stages: first, evaluating algorithms to identify which components are probably present in a given rock; second, trying to improve this classification by automatically searching for spectral ranges optimized for specific classes of minerals. In one set of studies, we partition the whole interval of wavelengths available in our data into sub-intervals, or bins, and use a genetic algorithm to evaluate a candidate selection of subintervals. As an alternative to these computationally expensive search techniques, we present an entropy-based heuristic that gives higher scores for wavelengths more likely to distinguish between classes. Results are presented for four different classes, showing reasonable improvements in identifying some, but not all, of the mineral classes tested.
TL;DR: This paper shows that it sufficient to compute only a small portion of the matrix of size linear in the number of objects, as opposed to quadratic, to guarantee a small probability of approximation error, so an effective ordering can be constructed without actually having to compute pairwise similarities of values.
Abstract: An important issue in visualizing categorical data is how to order categorical values -- non-numeric values that do not have a natural ordering, which makes it difficult to map them to visual coordinates. The focus of this paper is on constructing categorical orderings efficiently without compromising their visual quality. In order to avoid the inherent intractability of previous discrete formulations, we consider a continuous relaxation of the problem solvable exactly using the spectral method. The latter is based on computing certain algebraic information about the similarity matrix of the dataset. However, even computing the similarity matrix itself is prohibitive for large datasets. In order to achieve greater efficiency, we propose a new multi-level scheme based on an approximate representation of the matrix. We show that it sufficient to compute only a small portion of the matrix of size linear in the number of objects, as opposed to quadratic, to guarantee a small probability of approximation error. Thus an effective ordering can be constructed without actually having to compute {\it most} pairwise similarities of values. Experiments have been conducted to qualitatively verify the effectiveness of resulting visualizations.
TL;DR: The paper addresses the need to guide visitors during the configuration of a product by proposing to apply data mining on the quotes for every group and find associations between different components of the product.
Abstract: Most e-commerce web sites are very large, often confusing and overwhelm the visitor with a huge amount of information. People cannot easily find what they are looking for. Moreover, the web site is presented in the same format to every visitor, irrespective of his needs. This paper proposes a model to solve the above problems. Our model divides the visitors into groups. Then it arranges web pages in the decreasing order of preference for each group and applies path prediction to find sink pages for that group. The results of these algorithms are displayed in a separate frame without modifying the site. Secondly, our paper addresses the need to guide visitors during the configuration of a product. We propose to apply data mining on the quotes for every group and find associations between different components of the product. We can then display these results as suggestions while the prospective buyer is configuring the product. These suggestions will dynamically change as each selection is made. We have discussed the data mining algorithms applicable to our model.
TL;DR: Research into a method developed for decomposing a large number of objects into mutually exclusive subsets where within-group dependencies are high and between- group dependencies are low and the results are promising when compared with standard statistical methods and a Hill Climbing algorithm, all applied to email log file data.
Abstract: Grouping problems arise in many industrial and medical applications; examples include bin packing, workshop layout design, and graph colouring. This type of problem has been successfully handled using Grouping Genetic Algorithms. However in problems where there are perhaps thousands of objects to be grouped, we have found that Genetic Algorithm approaches can run into problems. This paper continues our research into a method we have developed for decomposing a large number of objects into mutually exclusive subsets where within-group dependencies are high and between-group dependencies are low. The method uses an Evolutionary Algorithm approach but where the whole population is a solution to the grouping problem rather than considering many candidate solutions. This reduces the resource overheads during computer implementation and the results are promising when compared with standard statistical methods and a Hill Climbing algorithm, all applied to email log file data.
TL;DR: A new alternative based on the global convexity analysis based on an adequate hyperbolic filtering scheme is presented, which requires neither a starting classification, nor an a priori number of clusters or their distribution.
Abstract: Among most of the existing procedures for mode detection of the underlying probability density function (pdf), preliminary to unsupervised statistical clustering, the ones that research modes as regions where the pdf is concave remain very interesting approaches. These techniques make use of a test that determines locally the convexity of the underlying pdf from the input patterns. However, the test area of sampling points may straddle a boundary between a convex region and a concave one, so that the assumptions for the test of the local convexity can be violated. Furthermore, this local test of convexity is very sensitive to details in the data structure and would rapidly become impracticable as the dimensionality of the data increases.
The present paper presents a new alternative based on the global convexity analysis instead the local convexity testing. A recursive separable hyperbolic filter, used as the principal tool for this proposed technique, is generalized to a multidimensional space. This filter is with a reliability criterion allowing to model as well the pdf variations as the noise attached to the density function.
Based on the characteristic theorem of convexity, the proposed technique assigns the concave label to modal regions and the convex label to valleys of the pdf according to an adequate hyperbolic filtering scheme. Modes are then extracted as concave connected components corresponding to the clusters in the mixture, and are used to assign the available observations to the clusters attached to them.
Experimental results, using real and artificially generated data sets with various complexities, demonstrate the effectiveness of the proposed method, which requires neither a starting classification, nor an a priori number of clusters or their distribution.
TL;DR: This work gives an account of how prediction accuracy for conventional local prediction methods can be understood and explains why local prediction is so difficult.
Abstract: We developed computational and theoretical methods to analyze the nature of experimental data. Our objective was to reveal how the protein secondary structure types behave in a space defined by a sequence of a certain length. Structure α-helix was only slightly more compact than the β-strand. The mean distance within the PPII structure class was the smallest, but the structure was not as compact as the others. This could be a consequence of the distance metric applied and the sensitivity of the structure to proline. In addition, this work describes some mathematical properties of the sequence space which explains the behaviour of secondary structure types in the space. This work gives an account of how prediction accuracy for conventional local prediction methods can be understood and explains why local prediction is so difficult.
TL;DR: This paper presents a meta-modelling procedure that automates the very labor-intensive and therefore time-heavy and therefore expensive process of manually cataloging and classifying data.
Abstract: Most classification methods are based on the assumption that the data conforms to a stationary distribution. However, the real-world data is usually collected over certain periods of time, ranging ...
TL;DR: It is shown that the benefits of the efficient use of user defined constraints and the computation of condensed representations for frequent itemsets, e.g., the frequent closed sets, can be combined into a levelwise algorithm that can be used for the discovery of association rules in difficult cases.
Abstract: Levelwise algorithms (e.g., the APRIORI algorithm) have been proved effective for association rule mining from sparse data. However, in many practical applications, the computation turns to be intractable for the user-given frequency threshold and the lack of focus leads to huge collections of frequent itemsets. To tackle these problems, two promising issues have been investigated during the last four years: the efficient use of user defined constraints and the computation of condensed representations for frequent itemsets, e.g., the frequent closed sets. We show that the benefits of these two approaches can be combined into a levelwise algorithm. It can be used for the discovery of association rules in difficult cases (dense and highly-correlated data). For instance, we report an experimental validation related to the discovery of association rules with negations.
TL;DR: The simulation shows that the DSVMs generalize better than the standard SVMs in forecasting non-stationary time series and use fewer support vectors, resulting in a sparser representation of the solution.
Abstract: This paper proposes a modified version of support vector machines (SVMs), called dynamic support vector machines (DSVMs), to model non-stationary time series The DSVMs are obtained by incorporating the problem domain knowledge -- non-stationarity of time series into SVMs Unlike the standard SVMs which use fixed values of the regularization constant and the tube size in all the training data points, the DSVMs use an exponentially increasing regularization constant and an exponentially decreasing tube size to deal with structural changes in the data The dynamic regularization constant and tube size are based on the prior knowledge that in the non-stationary time series recent data points could provide more important information than distant data points In the experiment, the DSVMs are evaluated using both simulated and real data sets The simulation shows that the DSVMs generalize better than the standard SVMs in forecasting non-stationary time series Another advantage of this modification is that the DSVMs use fewer support vectors, resulting in a sparser representation of the solution
TL;DR: It turns out that the method scales linearly with the number of given data points and is well suited for data mining applications where the amount of data is very large, but where the dimension of the feature space is moderately high.
Abstract: Recently we presented a new approach [20] to the classification problem arising in data mining. It is based on the regularization network approach but in contrast to other methods, which employ ansatz functions associated to data points, we use a grid in the usually high-dimensional feature space for the minimization process. To cope with the curse of dimensionality, we employ sparse grids [52]. Thus, only O(h_n^{-1} n^{d-1}) instead of O(h_n^{-d}) grid points and unknowns are involved. Here d denotes the dimension of the feature space and h_n = 2^{-n} gives the mesh size. We use the sparse grid combination technique [30] where the classification problem is discretized and solved on a sequence of conventional grids with uniform mesh sizes in each dimension. The sparse grid solution is then obtained by linear combination.
The method computes a nonlinear classifier but scales only linearly with the number of data points and is well suited for data mining applications where the amount of data is very large, but where the dimension of the feature space is moderately high. In contrast to our former work, where d-linear functions were used, we now apply linear basis functions based on a simplicial discretization. This allows to handle more dimensions and the algorithm needs less operations per data point. We further extend the method to so-called anisotropic sparse grids, where now different a-priori chosen mesh sizes can be used for the discretization of each attribute. This can improve the run time of the method and the approximation results in the case of data sets with different importance of the attributes.
We describe the sparse grid combination technique for the classification problem, give implementational details and discuss the complexity of the algorithm. It turns out that the method scales linearly with the number of given data points. Finally we report on the quality of the classifier built by our new method on data sets with up to 14 dimensions. We show that our new method achieves correctness rates which are competitive to those of the best existing methods.
TL;DR: The results suggest that the NCL rule is a useful method for improving modeling of difficult small classes, as well as for building classifiers that identify these classes from the real world data which frequently have an imbalanced class distribution.
Abstract: We studied three different methods to improve identification of small classes, which are also difficult to classify, by balancing an imbalanced class distribution with data reduction. The new method, neighborhood cleaning (NCL) rule, outperformed simple random sampling within classes and one-sided selection method in the experiments with ten real world data sets. All reduction methods improved clearly identification of small classes (20--30%) true-positive rates of the three-nearest neighbor method and the C4.5 decision tree generator, but the differences between the methods were insignificant. However, the significant differences in accuracies, true-positive rates, and true-negative rates obtained from the reduced data were in favor of our method. The results suggest that the NCL rule is a useful method for improving modeling of difficult small classes, as well as for building classifiers that identify these classes from the real world data which frequently have an imbalanced class distribution.
TL;DR: A different framework is introduced, based on Shortliffe and Buchanan's certainty factors and the new concept of very strong rules, and some intuitive properties of the new framework are discussed, showing that it can avoid the discovery of misleading rules, improving the manageability and quality of the results.
Abstract: It has been pointed out that the usual framework to assess association rules, based on support and confidence as measures of importance and accuracy, has several drawbacks. In particular, the presence of items with very high support can lead to obtain many misleading rules, even in the order of 95% of the discovered rules in some of our experiments. In this paper we introduce a different framework, based on Shortliffe and Buchanan's certainty factors and the new concept of very strong rules, and we discuss some intuitive properties of the new framework. Both the theoretical properties and the experiments we have performed show that we can avoid the discovery of misleading rules, improving the manageability and quality of the results.
TL;DR: In machine learning problems, differences in prior class probabilities -- or class imbalances -- have been reported to hinder the performance of some standard classifiers, such as decision trees.
Abstract: In machine learning problems, differences in prior class probabilities -- or class imbalances -- have been reported to hinder the performance of some standard classifiers, such as decision trees. T...