TL;DR: The problem of constructing accurate decision tree models from data streams is studied with respect to drift, noise, the order of examples, and the initial parameters in different problems and VFDTc is extended with the ability to deal with concept drift.
Abstract: In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algorithms. One of the most successful algorithms for mining data streams is VFDT. We have extended VFDT in three directions: the ability to deal with continuous data; the use of more powerful classification techniques at tree leaves, and the ability to detect and react to concept drift. VFDTc system can incorporate and classify new information online, with a single scan of the data, in time constant per example. The most relevant property of our system is the ability to obtain a performance similar to a standard decision tree algorithm even for medium size datasets. This is relevant due to the any-time property. We also extend VFDTc with the ability to deal with concept drift, by continuously monitoring differences between two class-distribution of the examples: the distribution when a node was built and the distribution in a time window of the most recent examples. We study the sensitivity of VFDTc with respect to drift, noise, the order of examples, and the initial parameters in different problems and demonstrate its utility in large and medium data sets.
TL;DR: This paper proposes a method to extend an association-rule based approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization, which outperforms CBA and provides better results than SVM on some corpus.
Abstract: Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important for users to describe trends in the data they have to analyze. In this framework, an association-rule based approach has been proposed by Bing Liu (CBA). We propose, in this paper, to extend this approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization. Taking order into account allows us to represent the succession of words through a document without complex and time-consuming representations and treatments such as those performed in natural language and grammatical methods. The original method we propose here consists in mining sequential patterns in order to build a classifier. We experimentally show that our proposal is relevant, and that it is very interesting compared to other methods. In particular, our method outperforms CBA and provides better results than SVM on some corpus.
TL;DR: Data mining techniques were applied to detection and analysis of risks potentially existing in the organizations and to usage of risk information for better organizational management and results show that data mining methods were effective to detection of risk factors.
Abstract: Organizations in our modern society grow larger and more complex to provide advanced services due to the varieties of social demands. Such organizations are highly efficient for routine work processes but known to be not robust to unexpected situations. According to this observation, the importance of the organizational risk management has been noticed in recent years. On the other hand, a large amount of data on the work processes has been automatically stored since information technology was introduced to the organizations. Thus, it has been expected that reuse of collected data should contribute to risk management for large-scale organizations. This paper proposes risk mining, where data mining techniques were applied to detection and analysis of risks potentially existing in the organizations and to usage of risk information for better organizational management. We applied this technique to the following three medical domains: risk aversion of nurse incidents, infection control and hospital management. The results show that data mining methods were effective to detection of risk factors.
TL;DR: A novel encoding scheme that uses links to identify clusters in a partition that restricts the links so that objects to be clustered form a linear pseudo-graph and a one-to-one mapping is achieved between the genotype and phenotype spaces.
Abstract: Various methods have been proposed to utilize Genetic Algorithms (GA) in handling the clustering problem. GA work on encoded strings, namely chromosomes, and the representation of different clusters as a linear structure is an important issue about the usage of GA in this domain. In this paper, we present a novel encoding scheme that uses links to identify clusters in a partition. Particularly, we restrict the links so that objects to be clustered form a linear pseudo-graph. A one-to-one mapping is thus achieved between the genotype and phenotype spaces. The other feature of the proposed approach is the use of multiple objectives in the process. One of the two objectives we use is to minimize the Total Within Cluster Variation (TWCV), identical to the one used by other k-means clustering approaches. However, unlike other k-means methods, number of clusters is not required specified in advance. Combined with a second objective, minimizing the number of clusters in a partition, our approach obtains the optimal partitions for all the possible numbers of clusters in the Pareto Optimal set returned by a single GA run. The performance of the proposed approach has been tested using two well-known data sets, namely Iris Data and Ruspini Data. The obtained results are compared with the output of the classical Group Number Encoding and it has been observed that a clear improvement has been achieved with the new representation.
TL;DR: A lazy-learning approach that extends a traditional distance weighted k-Nearest Neighbor classification algorithm to SOs, is presented and implemented in the system SO-NN (Symbolic Objects Nearest Neighbor) and evaluated on symbolic datasets.
Abstract: Symbolic data analysis aims at generalizing some standard statistical data mining methods, such as those developed for classification tasks, to the case of symbolic objects (SOs). These objects synthesize information concerning a group of individuals of a population, eventually stored in a relational database, and ensure confidentiality of original data. Classifying SOs is an important task in symbolic data analysis. In this paper a lazy-learning approach that extends a traditional distance weighted k-Nearest Neighbor classification algorithm to SOs, is presented. The proposed method has been implemented in the system SO-NN (Symbolic Objects Nearest Neighbor) and evaluated on symbolic datasets.
TL;DR: This article shows that systems employing reduction of variance as a heuristic for selecting tests during the tree construction process may exhibit pathological behaviour, and proposes an alternative heuristic that yields equally accurate but simpler trees with better explanatory power, and this at little or no additional computational cost.
Abstract: The term "model trees" is commonly used for regression trees that contain some non-trivial model in their leaves. Popular implementations of model tree learners build trees with linear regression models in their leaves. They use reduction of variance as a heuristic for selecting tests during the tree construction process. In this article, we show that systems employing this heuristic may exhibit pathological behaviour in some quite simple cases. This is not visible in the predictive accuracy of the tree, but it reduces its explanatory power. We propose an alternative heuristic that yields equally accurate but simpler trees with better explanatory power, and this at little or no additional computational cost. The resulting model tree induction algorithm is experimentally evaluated and compared with simpler and more complex approaches on a variety of synthetic and real world data sets.
TL;DR: A novel technique for identifying observations with class noise in a dataset using frequent itemsets, which is analyzed in numerous case studies using real-world software measurement datasets with either inherent or injected noise.
Abstract: The presence of a substantial number of noisy instances in a given dataset may adversely affect the hypothesis learnt from that data. Removing noisy instances prior to the construction of a classifier has been shown to improve the classification ability of a learner on new data. This paper introduces a novel technique for identifying observations with class noise in a dataset using frequent itemsets. For the given dataset, each instance is assigned a NoiseFactor, indicating a relative likelihood that it contains class noise. A frequent itemset is a set of instances with common attribute values which contains at least as many instances as a user-defined minimum support threshold. Consequently, the set of frequent itemsets contains information related to the structure and dependence between the attributes. Each frequent itemset is assigned a class, based on the proportion of instances within the itemset from each class. Instances that are contained in itemsets that have a large proportion of instances from the other class are identified as noisy. The technique proposed in this paper is analyzed in numerous case studies using real-world software measurement datasets with either inherent or injected noise. A comparison is provided with two well-known techniques for the identification of class noise: Classification Filter and Ensemble Filter. The results demonstrate that this new algorithm is very effective at identifying instances with class noise.
TL;DR: This paper studies the problem of constructing accurate decision tree models from data streams and develops a simple, scalable, and scalable approaches to this problem.
Abstract: In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algori...
TL;DR: It is proved that a distributed system significantly surpasses the centralized version in terms of efficiency and it is shown that a selective materialization of indexing structure fragments strongly increases system efficiency.
Abstract: In this paper we present a Spatial Data Warehouse system that we use for aggregation and analysis of huge amounts of spatial data. The data is generated by utilities meters communicating via radio. In order to provide sufficient efficiency for our system we propose data and workload distribution as well as advanced indexing techniques. The system is based on a cascaded star model, which is a spatial development of a standard star schema and contains interconnected and often nested star schemas. The cascaded star allows efficient storage and analysis of spatial data, whose range extends from meter measurement values to weather information. The indexing tree structure and operation is tightly integrated with the spatial character of the data. Thanks to an available memory evaluating mechanism the system is very flexible in the field of aggregates accuracy. We also implemented indexing structure updating mechanism. The system is written in Java; for the data base we use Oracle 9i. Basing on the wide variety of tests results, we prove that a distributed system significantly surpasses the centralized version in terms of efficiency. We also show that a selective materialization of indexing structure fragments strongly increases system efficiency.
TL;DR: This paper is concerned with ambiguous label confusion in supervised machine learning and proposes a novel approach to solve the problem of classification ambiguity.
Abstract: Inducing a classification function from a set of examples in the form of labeled instances is a standard problem in supervised machine learning. In this paper, we are concerned with ambiguous label...
TL;DR: P perturbation was more effective than subsampling and both clearly outperformed the bootstrapping technique in the detection of correct clustering consensus results and intelligent control of the resampling parameters can increase the achievable confidence in clustering results.
Abstract: Data resampling techniques are increasingly used for assigning confidence to clustering results, in particular for tumor class discovery based on genomic data. One factor that determines the success of this approach is the capability of a resampling scheme to simulate the sampling variability by using the information of sparse sample data. We present a method for evaluating resampling performance based on model simulations. This method was applied to results of 40 cluster validity indices and one partition stability index obtained from 12 clustering procedures including different distance measures. The results were generated for benchmark data of five statistical models, gene expression profiles of three multi-class tumor sample data sets, four data sets of the widely used UCI repository, and spatiotemporal neuroimaging data. The results suggest a ranking of the three resampling techniques analyzed: perturbation (adding noise to the data) was more effective than subsampling and both clearly outperformed the bootstrapping technique in the detection of correct clustering consensus results. Due to the consistency of the results this ranking may have impact on the selection of a resampling method for the cluster validation in future studies. Moreover, intelligent control of the resampling parameters can increase the achievable confidence in clustering results.
TL;DR: Experimental results indicate that itemset trees can now with advantage be used to answer both targeted and general queries, and that the technique compares favorably with previous atttempts under a broad range of data parameters.
Abstract: One of the goals of Association Mining is to develop algorithms capable of finding frequently co-occurring groups of items ("itemsets") in transaction databases. The recently published technique of Itemset Trees expedited the processing of so-called "targeted queries" where the user is interested only in itemsets that contain certain prespecified items. However, the technique did not seem to offer any cost-effective way how to find all frequent itemsets ("general queries") as it is common with other association-mining algorithms. The purpose of this paper is to rectify this deficiency by a newly developed algorithm that we call IT-Mining. Experimental results indicate that itemset trees can now with advantage be used to answer both targeted and general queries, and that the technique compares favorably with previous atttempts under a broad range of data parameters.
TL;DR: An algorithmic extension to the technique of Stacking for regression that prunes the ensemble set before application based on a consideration of the training accuracy and diversity of the ensemble members is investigated.
Abstract: In this paper we investigate an algorithmic extension to the technique of Stacking for regression that prunes the ensemble set before application based on a consideration of the training accuracy and diversity of the ensemble members. We evaluate two variants of this approach in comparison to the standard Stacking algorithm, one of which is a static approach that prunes back the ensemble to the same constant size; the other of which is a variable approach prunes the ensemble to an appropriate level based on measures of accuracy and diversity of the ensemble members. We show that on average both techniques are robust in performance to their non-pruned counterpart, while having the advantage of producing smaller and less complex ensembles. In the latter respect, the static approach proved more effective, but we show that the variable approach lends itself better for further optimization.
TL;DR: The results indicate that the SEC filings may be quite ambiguous, with experienced raters disagreeing on one category for a training sample of 600 filings in about 30% of the cases, however, allowing classifications into more than one category using document level information yields accuracy of about 90% in a test sample of 200 filings.
Abstract: This study explores a system to retrieve and classify the reasons for late mandatory SEC (Securities and Exchange Commission) filings. From the source documents, the system identifies the reasons for the late filing and classifies them into one or more of seven categories. The system can be used by potential investors who have to track a large number of filings concentrated within a day or two.
Our results indicate that the SEC filings may be quite ambiguous, with experienced raters disagreeing on one category for a training sample of 600 filings in about 30% of the cases. However, allowing classifications into more than one category using document level information yields accuracy of about 90% in a test sample of 200 filings. We also show that the stock market reactions to over 9,000 late filings vary in an intuitive way according to the classified reasons.
TL;DR: A generalization of the widely used hidden Markov models, which is motivated by the need of modeling complex structures which are encountered in many natural sequences pertaining to areas such as computational molecular biology, speech/handwriting recognition and content-based information retrieval.
Abstract: We introduce in this paper a generalization of the widely used hidden Markov models (HMM's), which we name "structural hidden Markov models" (SHMM). Our approach is motivated by the need of modeling complex structures which are encountered in many natural sequences pertaining to areas such as computational molecular biology, speech/handwriting recognition and content-based information retrieval. We consider observations as strings that produce the structures derived by an unsupervised learning process. These observations are related in the sense they all contribute to produce a particular structure. Four basic problems are assigned to a structural hidden Markov model: (1) probability evaluation, (2) state decoding, (3) structural decoding, and (4) parameter re-estimation. We have applied our methodology to recognize handwritten numerals. The results reported in this application show that the structural hidden Markov model outperforms the traditional hidden Markov model with a 23.9% error-rate reduction.
TL;DR: This paper proposes a physiologically motivated prediction methodology for trading off general precision with improved predictive performance in regions of higher interest and compares very accurate forecasts of ground based geomagnetic activity index and sunspot number obtained by a neuro-fuzzy technique.
Abstract: The increased dependence of human life-world on earth upon orbiting satellites and the threat of solar events that could cause serious damage have triggered a new surge of space weather forecasting research. In this paper, we propose a physiologically motivated prediction methodology for trading off general precision with improved predictive performance in regions of higher interest. Results are compared to very accurate forecasts of ground based geomagnetic activity index (K) and sunspot number obtained by a neuro-fuzzy technique. It has been shown that we can design powerful alert system by using computationally intelligent forecasting techniques.
TL;DR: In this article, the authors presented new textural features which are based on association rules and gave a texture representation, which is an appropriate formalism, that allows straightforward application of association rules algorithms.
Abstract: This paper presents new textural features which are based on association rules. We give a texture representation, which is an appropriate formalism, that allows straightforward application of association rules algorithms. This representation has several good properties like invariance to global lightness and invariance to rotation. Association rules capture structural and statistical information and are very convenient to identify the structures that occur most frequently and have the most discriminative power. The results from our experiments show that this representation gives comparable results to standard texture descriptions and better results than general image descriptions.
TL;DR: This paper proposes the extension of the Kolmogorov-Smirnov's binary splitting criterion to interval data, and presents some results using the pure assignment in order to examine the quality and the precision of this criterion.
Abstract: With the information technology development, data sets often contain a very large number of observations. Symbolic data analysis treats new units that are underlying concepts on the given data base or found by clustering. In this way, it is possible to reduce the size of the data set to be processed by transforming the initial classical variables into variables called symbolic variables. In symbolic data analysis, the values of the variables can be, among others, intervals. The algebraic structure of these variables leads us to adapt criteria to be able to study them. In this paper, we propose the extension of the Kolmogorov-Smirnov's binary splitting criterion to interval data. This criterion is used as a test selection metric for decision tree induction. For this criterion, the values taken by the explanatory variables have to be ordered. We have been interested in different possible orders of these interval values. We present some results using the pure assignment in order to examine the quality and the precision of this criterion. We compare this criterion to some classical criteria (Gini and entropy) in the case of pure assignment. An application in the case where the variable to be explained is a correlation is presented. We end this paper with a probabilistic method of assignment using the criterion of Komogorov-Smirnov.
TL;DR: A generic approach exploits collections of local patterns which satisfy some user-defined constraints in the data, and a measure of the accuracy of a given local pattern as a bi-cluster characterization pattern is introduced.
Abstract: Clustering or co-clustering techniques have been proved useful in many application domains. A weakness of these techniques remains the poor support for grouping characterization. As a result, interpreting clustering results and discovering knowledge from them can be quite hard. We consider potentially large Boolean data sets which record properties of objects and we assume the availability of a bi-partition which has to be characterized by means of a symbolic description. Our generic approach exploits collections of local patterns which satisfy some user-defined constraints in the data, and a measure of the accuracy of a given local pattern as a bi-cluster characterization pattern. We consider local patterns which are bi-sets, i.e., sets of objects associated to sets of properties. Two concrete examples are formal concepts (i.e., associated closed sets) and the so-called δ-bi-sets (i.e., an extension of formal concepts towards fault-tolerance). We introduce the idea of characterizing query which can be used by experts to support knowledge discovery from bi-partitions thanks to available local patterns. The added-value is illustrated on benchmark data and three real data sets: a medical data set and two gene expression data sets.
TL;DR: Data that appear to have different characteristics than the rest of the population are called outliers and are identified from huge data repositories called outlier mining.
Abstract: Data that appear to have different characteristics than the rest of the population are called outliers. Identifying outliers from huge data repositories is a very complex task called outlier mining...
TL;DR: The proposed Minimum Sum-squared Residue for Fuzzy Co-Clustering (MSR-FCC) is proposed to address two issues faced by many existing clustering algorithms, namely the high-dimensionality and the inherent fuzziness found in most real-world data.
Abstract: Clustering is often seen as a more practical but very challenging answer to the task of categorizing objects. Minimum Sum-squared Residue for Fuzzy Co-Clustering (MSR-FCC) is proposed to address two issues faced by many existing clustering algorithms, namely the high-dimensionality and the inherent fuzziness found in most real-world data. MSR-FCC is able to simultaneously cluster data and features using fuzzy techniques. It suggests a new partitioning fuzzy co-clustering algorithm based on the mean squared residue approach. Besides handling overlap clusters, MSR-FCC offers the flexibility that allows the number of data clusters to be different from the number of feature clusters, which reflects the distribution characteristic inherited in real-world data. In this paper, mathematical formulation of MSR-FCC is derived and explained. Experiments were conducted on standard datasets to demonstrate that the proposed algorithm is able to cluster high-dimensional data with overlaps feasibly and at the same time, it provides a new and promising mechanism for improving the interpretability of the co-clusters through the fuzzy membership function.
TL;DR: This paper deals with next bit prediction of pseudo-random binary sequences generated by Linear Feedback Shift Register (LFSR) and LFSR-based Pseudo-Random Bit Generators (PRBG), using inductive Machine Learning (ML) paradigm, namely C4.5.
Abstract: Random number generation is an integral part of strong cipher systems. If a pseudo-random sequence can be predicted with better than chance probability then the generator is considered to be cryptographically weak. This paper deals with next bit prediction of pseudo-random binary sequences generated by Linear Feedback Shift Register (LFSR) and LFSR-based Pseudo-Random Bit Generators (PRBG), using inductive Machine Learning (ML) paradigm, namely C4.5 the most common and widely used inductive data mining algorithm. This machine learning technique has been introduced to convert the theoretical prediction problem into a classification problem, which we coined as Classificatory Prediction problem. We further extended the use of this technique to predict next bit without having any knowledge of subsequent bits of the PRBG and can be termed as true Next Bit Predictor. The technique used is independent of the parameters and domain knowledge of the pseudo-random bit generators. The present study is a comprehensive extension of the work done by Hernandez et al. [15]. We performed meticulous experiments (over wide range of LFSRs) and came out with a more explanatory analysis. Our classificatory prediction results paved the way for the evolution of the next bit prediction model.
TL;DR: A CBIR System named STIRF (Shape, Texture, Intensity-distribution with Relevance Feedback) that uses a neural network for nonlinear combination of the heterogenous STI features that showed better and more robust performance compared to existing CBIR systems.
Abstract: Multimedia mining primarily involves, information analysis and retrieval based on implicit knowledge. The ever increasing digital image databases on the Internet has created a need for using multimedia mining on these databases for effective and efficient retrieval of images. Contents of an image can be expressed in different features such as Shape, Texture and Intensity-distribution(STI). Content Based Image Retrieval(CBIR) is an efficient retrieval of relevant images from large databases based on features extracted from the image. Most of the existing systems either concentrate on a single representation of all features or linear combination of these features. The paper proposes a CBIR System named STIRF (Shape, Texture, Intensity-distribution with Relevance Feedback) that uses a neural network for nonlinear combination of the heterogenous STI features. Further the system is self-adaptable to different applications and users based upon relevance feedback. Prior to retrieval of relevant images, each feature is first clustered independent of the other in its own space and this helps in matching of similar images. Testing the system on a database of images with varied contents and intensive backgrounds showed good results with most relevant images being retrieved for a image query. The system showed better and more robust performance compared to existing CBIR systems.
TL;DR: A descriptive approach is adopted for the supervised evaluation of medoid-based Voronoi partitions and the resulting criterion measures the discrimination of the classes, is parameter free and prevents from overfitting.
Abstract: Since its introduction, the nearest neighbor rule has been widely refined and there exists many techniques for prototypes selection or construction. The underlying structure of such rules is the Voronoi partition induced by the prototypes. Construction of the best Voronoi partition often relies on the generalisation performance and thus faces the risk of overfitting the data.In this paper, we adopt a descriptive approach for the supervised evaluation of medoid-based Voronoi partitions. The resulting criterion measures the discrimination of the classes, is parameter free and prevents from overfitting. Experiments on real and synthetic datasets illustrate these properties. Although this criterion is not related to the classifying task, the accuracy and robustness of the induced classifier are also compared with standard methods, such as the nearest neighbor rule and the linear vector quantization method.
TL;DR: A novel technique for compressing a symbolic data table using the recently emerged Compound Term Composition Algebra (CTCA) is proposed.
Abstract: Although symbolic data tables summarize huge sets of data they can still become very large in size. This paper proposes a novel technique for compressing a symbolic data table using the recently emerged Compound Term Composition Algebra. One advantage of CTCA is that the closed world hypotheses of its operations can lead to a remarkably high "compression ratio". The compacted form apart from having much lower storage space requirements, it allows designing more efficient algorithms for symbolic data analysis.
TL;DR: A novel method is developed, which allows subtasks (1) and (2) to be linked more closely together and improves not only relation recognition but also entity recognition in some degree.
Abstract: The entity and relation recognition, i.e. (1) assigning semantic classes (e.g., person, organization and location) to entities in a sentence, and (2) determining the relations (e.g., born-in and employee-of) held between the corresponding entities, is an important task in areas such as information extraction and question answering. Subtasks (1) and (2) are typically carried out sequentially, and this procedure is problematic: errors made during subtask (1) are propagated to subtask (2) with an accumulative effect; and in many cases information that becomes available only during subtask (2) (e.g., the class of an entity corresponds to the first argument of relation born-in (X, China)) would be helpful for subtask (1) (e.g., the class of the entity cannot be a location but a person). To address problems of this kind, this paper develops a novel method, which allows subtasks (1) and (2) to be linked more closely together. The procedure is separated to three stages. Firstly, employ two classifiers to perform subtasks (1) and (2) independently. Secondly, the semantic class of each entity is determined by taking into account the classes of all the entities in the sentence, as computed during the previous step. This is achieved using a special model dubbed "entity relation propagation diagram" and "entity relation propagation tree". Thirdly, each relation is then assigned a class by considering the semantic classes of the entities produced at the previous step. Our experimental results show that the method improves not only relation recognition but also entity recognition in some degree.
TL;DR: A simple neural network architecture is proposed as an efficient way of combining and reinforcing the discriminatory capabilities of different popular statistics commonly used in conventional hypothesis testing procedures.
Abstract: The aim of this work is to provide a new approach to the classical problem of determining whether or not a set of data has been sampled from a univariate normal distribution. A simple neural network architecture is proposed as an efficient way of combining and reinforcing the discriminatory capabilities of different popular statistics commonly used in conventional hypothesis testing procedures. Special emphasis is placed on the fact that these procedures lack a reliable measure of the degree to which the observed data supports the normality assumption. Several authors have shown that the so-called P-values are inefficient and ambiguous when dealing with this matter, so the Bayesian posterior probabilities are suggested as the best candidates to play this role. For this reason, a significant part of our work has focused on training the neural networks so that their outputs accurately approximate these probabilities.
TL;DR: This paper explores a new architecture of Bayesian classifier that can be used to understand how biological mechanisms differ with respect to time and shows that this classifier improves the classification of microarray data and ensures that the models can easily be analysed by biologists by incorporating time transparently.
Abstract: The analysis of microarray data from time-series experiments requires specialised algorithms, which take the temporal ordering of the data into account. In this paper we explore a new architecture of Bayesian classifier that can be used to understand how biological mechanisms differ with respect to time. We show that this classifier improves the classification of microarray data and at the same time ensures that the models can easily be analysed by biologists by incorporating time transparently. In this paper we focus on data that has been generated to explore different types of muscular dystrophy.
TL;DR: A new method for MRI data segmentation is proposed, which aims at improving the support of medical researchers in the context of cancer therapy, and is focused on the processing of raw output obtained by Dynamic ContrastEnhanced MRI (DCE-MRI) techniques.
Abstract: The application of machine learning techniques to open problems in different medical research fields appears to be stimulating and fruitful, especially in the last decade. In this paper, a new method for MRI data segmentation is proposed, which aims at improving the support of medical researchers in the context of cancer therapy. In particular, our effort is focused on the processing of raw output obtained by Dynamic ContrastEnhanced MRI (DCE-MRI) techniques. Here, morphological and functional parameters are extracted, which seem indicate the local development of cancer. Our contribute consists in organizing automatically these output, separating MRI slice areas with different meaning, in a histological sense. The technique adopted is based on the Mean-Shift paradigm, and it has recently shown to be robust and useful for different and heterogeneous segmentation tasks. Moreover, the technique appears to be predisposed to numerous extensions and medical-driven optimizations.
TL;DR: Novel feature subset selection methods, based on the estimation of feature salience i.e. the quantification of the relative importance of individual features, in the presence of other features, for determining the classes of records in a dataset are described.
Abstract: In this paper we describe novel feature subset selection methods, based on the estimation of feature salience i.e. the quantification of the relative importance of individual features, in the presence of other features, for determining the classes of records in a dataset. We present a definition of what we mean by feature salience and a method for estimating this feature salience. Five synthetic datasets were used to demonstrate the utility of the salience estimation technique. It was found that the estimation techniques produced good approximations to the calculated saliencies in most cases.
The use of feature salience as the basis of three methods of feature subset selection is described. These methods were evaluated on real world data sets by constructing classifiers using all features and comparing these with classifiers constructed using only a selected subset of features. It was found that the results compared well with other state of the art techniques and that the methods were simpler to implement and significantly faster to execute.
On average, applying our best feature subset selection method resulted in trees that used only 49% of the features used by trees constructed with the full set of features. This reduction in number of features used was associated with a 1% improvement in classifier accuracy.