TL;DR: A survey of the application areas of the ROC analysis in machine learning is presented, its problems and challenges are described and a summarized list of alternative approaches to ROCAnalysis is provided.
Abstract: The use of ROC Receiver Operating Characteristics analysis as a tool for evaluating the performance of classification models in machine learning has been increasing in the last decade. Among the most notable advances in this area are the extension of two-class ROC analysis to the multi-class case as well as the employment of ROC analysis in cost-sensitive learning. Methods now exist which take instance-varying costs into account. The purpose of our paper is to present a survey of this field with the aim of gathering important achievements in one place. In the paper, we present application areas of the ROC analysis in machine learning, describe its problems and challenges and provide a summarized list of alternative approaches to ROC analysis. In addition to presented theory, we also provide a couple of examples intended to illustrate the described approaches.
TL;DR: 1d-SAX is proposed a method to represent a time series as a sequence of symbols that each contain information about the average and the trend of the series on a segment, and shows that 1d- SAX improves performance using equal quantity of information, especially when the compression rate increases.
Abstract: SAX Symbolic Aggregate approXimation is one of the main symbolization techniques for time series. A well-known limitation of SAX is that trends are not taken into account in the symbolization. This paper proposes 1d-SAX a method to represent a time series as a sequence of symbols that each contain information about the average and the trend of the series on a segment. We compare the efficiency of SAX and 1d-SAX in terms of goodness-of-fit, retrieval and classification performance for querying a time series database with an asymmetric scheme. The results show that 1d-SAX improves performance using equal quantity of information, especially when the compression rate increases.
TL;DR: A general mathematical framework for formalizing interestingness in a subjective manner is discussed and it is demonstrated how it can be successfully instantiated for a variety of exploratory data mining problems.
Abstract: Exploratory data mining has as its aim to assist a user in improving their understanding about the data. Considering this aim, it seems self-evident that in optimizing this process the data as well as the user need to be considered. Yet, the vast majority of exploratory data mining methods including most methods for clustering, itemset and association rule mining, subgroup discovery, dimensionality reduction, etc formalize interestingness of patterns in an objective manner, disregarding the user altogether. More often than not this leads to subjectively uninteresting patterns being reported.
Here I will discuss a general mathematical framework for formalizing interestingness in a subjective manner. I will further demonstrate how it can be successfully instantiated for a variety of exploratory data mining problems. Finally, I will highlight some connections to other work, and outline some of the challenges and research opportunities ahead.
TL;DR: OVA-TWSVM can outperform the traditional OVA-SVMs classifier and experimental comparisons with other multiclass classifiers demonstrated that comparable performance could be achieved.
Abstract: Twin support vector machine classifier TWSVM was proposed by Jayadeva et al., which was used for binary classification problems. TWSVM not only overcomes the difficulties in handling the problem of exemplar unbalance in binary classification problems, but also it is four times faster in training a classifier than classical support vector machines. This paper proposes one-versus-all twin support vector machine classifiers OVA-TWSVM for multi-category classification problems by utilizing the strengths of TWSVM. OVA-TWSVM extends TWSVM to solve k-category classification problems by developing k TWSVM where in the ith TWSVM, we only solve the Quadratic Programming Problems QPPs for the ith class, and get the ith nonparallel hyperplane corresponding to the ith class data. OVA-TWSVM uses the well known one-versus-all OVA approach to construct a corresponding twin support vector machine classifier. We analyze the efficiency of the OVA-TWSVM theoretically, and perform experiments to test its efficiency on both synthetic data sets and several benchmark data sets from the UCI machine learning repository. Both the theoretical analysis and experimental results demonstrate that OVA-TWSVM can outperform the traditional OVA-SVMs classifier. Further experimental comparisons with other multiclass classifiers demonstrated that comparable performance could be achieved.
TL;DR: The most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments, and learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters.
Abstract: The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have partially compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints --Constrained Vector Quantization Error CVQE, its variant named LCVQE, and the Metric Pairwise Constrained K-Means MPCK-Means --are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of more specific new experimental findings are discussed in the paper --e.g., deduced constraints usually do not help finding better data partitions.
TL;DR: In this paper, a dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes, then the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process.
Abstract: Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use data sets without labels. So many different semi-supervised learning methods have been studied recently. Among these semi-supervised methods, self-training is one of the important learning algorithms that classifies unlabeled samples with small amount of labeled ones and adds the most confident samples to the training set. In this paper, dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes. Then, the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process. We tested this method on the extracted features of ten common Reuter-21578 classes. Experimental result indicates that proposed method improves the classification performance and it's effective.
TL;DR: Two mining algorithms of maximal correlated weight frequent pattern MCWP, termed MCWPWA based on Weight Ascending order and MCWPSD based on Support Descending order are proposed to mine a compact and meaningful set of frequent patterns.
Abstract: Maximal frequent pattern mining has been suggested for data mining to avoid generating a huge set of frequent patterns. Conversely, weighted frequent pattern mining has been proposed to discover important frequent patterns by considering the weighted support. We propose two mining algorithms of maximal correlated weight frequent pattern MCWP, termed MCWPWA based on Weight Ascending order and MCWPSD based on Support Descending order, to mine a compact and meaningful set of frequent patterns. MCWPSD obtains an advantage in conditional database access, but may not obtain the highest weighted item of the conditional database to mine highly correlated weight frequent patterns. Thus, we suggest a technique that uses additional conditions to prune lowly correlated weight items before the subsets checking process. Analyses show that our algorithms are efficient and scalable.
TL;DR: The proposed Computer Aided Diagnostic CAD technique has the ability to detect diabetes efficiently by analyzing the subtle changes in ECG signals that are indicative of the presence of diabetes in a patient.
Abstract: Diabetes Mellitus, often referred to as diabetes, is a chronic disease that affects a vast majority of world population. The percentage of people affected is increasing every year. Diabetes is very difficult to cure. It can only be kept under control. In this scenario, diagnosis of diabetes is of great importance. In this work, we used Heart Rate Variability HRV signals obtained from ECG signals for the purpose of diagnosis of diabetes. We employed signal processing methods to extract features from the HRV signal. Since HRV signals are of nonlinear nature, we made use of Higher Order Spectrum HOS based features for analysis. In this paper, we have extracted the HOS features from HRV signals corresponding to normal and diabetic subjects. These selected features were fed independently to seven classifiers namely Gaussian Mixture Model GMM, Support Vector Machine SVM, NaiveBayes classifier NB, K-Nearest Neighbour KNN, Probabilistic Neural Network PNN, Fuzzy classifier and Decision Tree DT classifier. The performance of these classifiers was evaluated using accuracy, sensitivity, specificity, positive predictive value, and the area under the receiver operating characteristics curve measures. We observed that the GMM classifier presented the highest accuracy of 90.5%, while the other classifiers presented accuracies in the range of 86.5% to 71.4%. Thus, the proposed Computer Aided Diagnostic CAD technique has the ability to detect diabetes efficiently by analyzing the subtle changes in ECG signals that are indicative of the presence of diabetes in a patient. Also, we have proposed unique bispectrum and bicoherence plots for normal and diabetes heart rate signals.
TL;DR: Experiments on time series forecasting show that including the constraints in the training phase particularly reduces the risk of overfitting in challenging situations with missing values or a large number of Gaussian components.
Abstract: Gaussian mixture models provide an appealing tool for time series modelling. By embedding the time series to a higher-dimensional space, the density of the points can be estimated by a mixture model. The model can directly be used for short-to-medium term forecasting and missing value imputation. The modelling setup introduces some restrictions on the mixture model, which when appropriately taken into account result in a more accurate model. Experiments on time series forecasting show that including the constraints in the training phase particularly reduces the risk of overfitting in challenging situations with missing values or a large number of Gaussian components.
TL;DR: A novel hierarchical clustering methodology is introduced, based on a dissimilarity measure derived from the Jaccard coefficient between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras.
Abstract: Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure derived from the Jaccard coefficient between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks and null models, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset, a collaboration graph extracted from a cinema database, and a network extracted from a database of terrorist attacks; we illustrate how the discovered temporal clustering highlights the crucial moments when the networks witnessed profound changes in their structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis.
TL;DR: This paper proposes Dynamic MMHC, an adaptation of the "static" MMHC algorithm, known for its scalability, and illustrates the interest of this method with some experimental results.
Abstract: Dynamic Bayesian networks DBNs are a class of probabilistic graphical models that has become a standard tool for modeling various stochastic time-varying phenomena. Probabilistic graphical models such as 2-Time slice BN 2T-BNs are the most used and popular models for DBNs. Because of the complexity induced by adding the temporal dimension, DBN structure learning is a very complex task. Existing algorithms are adaptations of score-based BN structure learning algorithms but are often limited when the number of variables is high. We focus in this paper to DBN structure learning with another family of structure learning algorithms, local search methods, known for its scalability. We propose Dynamic MMHC, an adaptation of the "static" MMHC algorithm. We illustrate the interest of this method with some experimental results.
TL;DR: In this paper, the authors propose a 7.7.7-approximation algorithm for each node. And the algorithm works well on all the nodes in the tree-line.
TL;DR: A mining algorithm, called HUM-UT High Utility itemsets Mining based on UT-Tree, to find high utility itemsets from transactional data streams, which has better performance and is more stable under different experimental conditions than the state-of-the-art algorithm HUPMS in terms of time and space.
Abstract: High utility itemsets mining is a hot topic in data stream mining It is essential that the mining algorithm should be efficient in both time and space for data stream is continuous and unbounded To the best of our knowledge, the existing algorithms require multiple database scans to mine high utility itemsets, and this hinders their efficiency In this paper, we propose a new data structure, called UT-Tree Utility on Tail Tree, for maintaining utility information of transaction itemsets to avoid multiple database scans The UT-Tree is created with one database scan, and contains a fixed number of transaction itemsets; utility information is stored on tail-nodes only Based on the proposed data structure and the sliding window approach, we propose a mining algorithm, called HUM-UT High Utility itemsets Mining based on UT-Tree, to find high utility itemsets from transactional data streams The HUM-UT algorithm mines high utility itemsets from the UT-Tree without additional database scan Experiment results show that our algorithm has better performance and is more stable under different experimental conditions than the state-of-the-art algorithm HUPMS in terms of time and space
TL;DR: It is inferred that the better approach is incremental mining on the progressive database, based on the characteristics, which would give scope for future research direction.
Abstract: Sequential pattern mining is a vital problem with broad applications. However, it is also challenging, as combinatorial high number of intermediate subsequences are generated that have to be critically examined. Most of the basic solutions are based on the assumption that the mining is performed on static database. But modern day databases are being continuously updated and are dynamic in nature. So, incremental mining of sequential patterns has become the norm. This article investigates the need for incremental mining of sequential patterns. An analytical study, focusing on the characteristics, has been made for more than twenty incremental mining algorithms. Further, we have discussed the issues associated with each of them. We infer that the better approach is incremental mining on the progressive database. The three more relevant algorithms, based on this approach, are also studied in depth along with the other work done in this area. This would give scope for future research direction.
TL;DR: The proposed TweM Tweet Miner framework entails the discovery of hidden and high level correlations, in the form of generalized association rules, among the content and the contextual features of posts published on Twitter i.e., the tweets.
Abstract: The increasing availability of user-generated content coming from online communities allows the analysis of common user behaviors and trends in social network usage. This paper presents the TweM Tweet Miner framework that entails the discovery of hidden and high level correlations, in the form of generalized association rules, among the content and the contextual features of posts published on Twitter i.e., the tweets. To effectively support knowledge discovery from tweets, the TweM framework performs two main steps: i taxonomy generation over tweet keywords and context data and ii generalized association rule mining, driven by the generated taxonomy, from a sequence of tweet collections. Unlike traditional mining approaches, the generalized rule mining session performed on the current tweet collection also considers the evolution of the extracted patterns across the sequence of the previous mining sessions to prevent the discarding of rare knowledge that frequently occurs in a number of past extractions. Experiments, performed on both real Twitter posts and synthetic datasets, show the effectiveness and the efficiency of the proposed TweM framework in supporting knowledge discovery from Twitter user-generated content.
TL;DR: This paper presents a more in-depth analysis of the concept of Widened Data Mining, which aims at reducing the impact of greedy heuristics by exploring more than just one suitable solution at each step by focusing on how diversity considerations can substantially improve results.
Abstract: This paper follows our earlier publicationi¾?[1], where we introduced the idea of tuned data mining which draws on parallel resources to improve model accuracy rather than the usual focus on speed-up. In this paper we present a more in-depth analysis of the concept of Widened Data Mining, which aims at reducing the impact of greedy heuristics by exploring more than just one suitable solution at each step. In particular we focus on how diversity considerations can substantially improve results. We again use the greedy algorithm for the set cover problem to demonstrate these effects in practice.
TL;DR: This work proposes a multiple temporal matching approach that reveals the commonly shared features within classes, and the most differential ones across classes, based on a new framework based on the variance/covariance criterion.
Abstract: In real applications, time series are generally of complex structure, exhibiting different global behaviors within classes. To discriminate such challenging time series, we propose a multiple temporal matching approach that reveals the commonly shared features within classes, and the most differential ones across classes. For this, we rely on a new framework based on the variance/covariance criterion to strengthen or weaken matched observations according to the induced variability within and between classes. The experiments performed on real and synthetic datasets demonstrate the ability of the multiple temporal matching approach to capture fine-grained distinctions between time series.
TL;DR: This work defines the support of an item set in this setting based on a maximum independent set approach allowing for efficient computation and shows how the enumeration and test of candidate sets can be made efficient by properly reducing the event sequences and exploiting perfect extension pruning.
Abstract: We consider the task of finding frequent patterns in parallel point processes--also known as finding frequent parallel episodes in event sequences. This task can be seen as a generalization of frequent item set mining: the co-occurrence of items or events in transactions is replaced by their imprecise co-occurrence on a continuous time scale, meaning that they occur in a limited time span from each other. We define the support of an item set in this setting based on a maximum independent set approach allowing for efficient computation. Furthermore, we show how the enumeration and test of candidate sets can be made efficient by properly reducing the event sequences and exploiting perfect extension pruning. Finally, we study how the resulting frequent item sets/patterns can be filtered for closed and maximal sets.
TL;DR: This work presents a novel framework for feature extraction that is based on the notions of entity matching and attribute frequencies and introduces different methods and metrics for ranking, which combine them with existing traditional techniques and are studied using real and synthetic data.
Abstract: Log information describing the items the users have selected from the set of answers a query engine returns to their queries constitute an excellent form of indirect user feedback that has been extensively used in the web to improve the effectiveness of search engines. In this work we study how the logs can be exploited to improve the ranking of the results returned by an entity search engine. Entity search engines are becoming more and more popular as the web is changing from a web of documents into a "web of things". We show that entity search engines pose new challenges since their model is different than the one documents are based on. We present a novel framework for feature extraction that is based on the notions of entity matching and attribute frequencies. The extracted features are then used to train a ranking classifier. We introduce different methods and metrics for ranking, we combine them with existing traditional techniques and we study their performance using real and synthetic data. The experiments show that our technique provides better results in terms of accuracy.
TL;DR: This paper investigates the data mining challenge of partially ordered pattern mining of sequential data by describing OrderSpan, a new algorithm that extracts such patterns from sequential databases and overcomes some of the drawbacks of existing methods.
Abstract: Due to the complexity of the task, partially ordered pattern mining of sequential data has not been subject to much study, despite its usefulness. This paper investigates this data mining challenge by describing OrderSpan, a new algorithm that extracts such patterns from sequential databases and overcomes some of the drawbacks of existing methods. Our work consists in providing a simple and flexible framework to directly mine complex sequences of itemsets, by combining well-known properties on prefixes and suffixes. Experiments were performed on different real datasets to show the benefit of partially ordered patterns.
TL;DR: A combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically.
Abstract: In recent years, many NLP systems and tasks are developed using machine learning methods. In order to achieve the best performance, these systems are generally trained on a large human annotated corpus. Since annotating such corpora is a very expensive and time-consuming procedure, manually annotating corpora is become one of the significant issues in many text based tasks such as text mining, semantic annotation, Named Entity Recognition and generally Information Extraction. Semi-supervised Learning and Active Learning are two distinct approaches that deal with reduction of labeling costs. Based on their natures, Active and semi-supervised learning can produce better results when they are jointly applied. In this paper we propose a combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically. The proposed approach reduces manual annotation cost around 90% compare with a supervised learning and 30% in contrast with a similar fully active learning approach. Conditional Random Field CRF is chosen as the underlying learning model due to its promising performance in many sequence labeling tasks. In addition we proposed a confidence measure based on the model's variance reduction that reaches a considerable accuracy for finding informative samples.
TL;DR: Labeled K-Means LK- means is presented, an algorithm for supervised clustering based on a variant of K-means that incorporates information about class labels that outperforms the alternative techniques by a considerable margin.
Abstract: Clustering is a relevant problem in machine learning where the main goal is to locate meaningful partitions of unlabeled data. In the case of labeled data, a related problem is supervised clustering, where the objective is to locate class-uniform clusters. Most current approaches to supervised clustering optimize a score related to cluster purity with respect to class labels. In particular, we present Labeled K-Means LK-Means, an algorithm for supervised clustering based on a variant of K-Means that incorporates information about class labels. LK-Means replaces the classical cost function of K-Means by a convex combination of the joint cost associated to: i A discriminative score based on class labels, and ii A generative score based on a traditional metric for unsupervised clustering. We test the performance of LK-Means using standard real datasets and an application for object recognition. Moreover, we also compare its performance against classical K-Means and a popular K-Medoids-based supervised clustering method. Our experiments show that, in most cases, LK-Means outperforms the alternative techniques by a considerable margin. Furthermore, LK-Means presents execution times considerably lower than the alternative supervised clustering method under evaluation.
TL;DR: The observation that people want answers to questions, not simply data, is explored, and some of the difficulties and risks which lie in the path of realising the opportunities of big data and open data.
Abstract: Big data and open data promise tremendous advances. But the media hype ignores the difficulties and the risks associated with this promise. Beginning with the observation that people want answers to questions, not simply data, I explore some of the difficulties and risks which lie in the path of realising the opportunities.
TL;DR: This work proposes a protocol to evaluate the evaluation measures themselves of association rule quality measures, and indicates that multiplicative improvement and to a lesser extent support and leverage a.k.a. weighted relative accuracy tend to obtain better results than the other measures.
Abstract: The practical success of association rule mining depends heavily on the criterion to choose among the many rules often mined. Many rule quality measures exist in the literature. We propose a protocol to evaluate the evaluation measures themselves. For each association rule, we measure the improvement in accuracy that a commonly used predictor can obtain from an additional feature, constructed according to the exceptions to the rule. We select a reference set of rules that are helpful in this sense. Then, our evaluation method takes into account both how many of these helpful rules are found near the top rules for a given quality measure, and how near the top they are. We focus on seven association rule quality measures. Our experiments indicate that multiplicative improvement and to a lesser extent support and leverage a.k.a.i¾?weighted relative accuracy tend to obtain better results than the other measures.
TL;DR: ICE-CREAM is introduced, a novel approach to action mining that explicitly relies on an automatically obtained best estimate of the causal relationships in the data to suggest actions with desirable effects.
Abstract: In many business contexts, the ultimate goal of knowledge discovery is not the knowledge itself, but putting it to use. Models or patterns found by data mining methods often require further post-processing to bring this about. For instance, in churn prediction, data mining may give a model that predicts which customers are likely to end their contract, but companies are not just interested in knowing who is likely to do so, they want to know what they can do to avoid this. The models or patterns have to be transformed into actionable knowledge. Action mining explicitly addresses this. Currently, many action mining methods rely on a predictive model, obtained through data mining, to estimate the effect of certain actions and finally suggest actions with desirable effects. A major problem with this approach is that predictive models do not necessarily reflect a causal relationship between their inputs and outputs. This makes the existing action mining methods less reliable. In this paper, we introduce ICE-CREAM, a novel approach to action mining that explicitly relies on an automatically obtained best estimate of the causal relationships in the data. Experiments confirm that ICE-CREAM performs much better than the current state of the art in action mining.
TL;DR: This work designs the pattern domain of multi-dimensional association rules, i.e., non trivial extensions of the popular association rules that may involve subsets of any dimensions in their antecedents and their consequents and proposes optimizations to support both rule extraction scalability and non redundancy.
Abstract: Graph mining methods have become quite popular and a timely challenge is to discover dynamic properties in evolving graphs or networks. We consider the so-called relational dynamic oriented graphs that can be encoded as n-ary relations with n ≥ 3 and thus represented by Boolean tensors. Two dimensions are used to encode the graph adjacency matrices and at least one other denotes time. We design the pattern domain of multi-dimensional association rules, i.e., non trivial extensions of the popular association rules that may involve subsets of any dimensions in their antecedents and their consequents. First, we design new objective interestingness measures for such rules and it leads to different approaches for measuring the rule confidence. Second, we must compute collections of a priori interesting rules. It is considered here as a post-processing of the closed patterns that can be extracted efficiently from Boolean tensors. We propose optimizations to support both rule extraction scalability and non redundancy. We illustrate the added-value of this new data mining task to discover patterns from a real-life relational dynamic graph.
TL;DR: This paper defines group association rules and study interestingness measures for them, which can be used to rank, not only groups of individuals, but also rules within each group.
Abstract: The work described in this paper addresses the study of association rules within groups of individuals. The analysis of the characteristics and the behavior of the individuals belonging to such groups in a given database is powerful in practice, since it provides a mechanism to deal with groups rather than isolated individuals. In this paper, we define group association rules and we study interestingness measures for them. These interestingness measures can be used to rank, not only groups of individuals, but also rules within each group. We also compare the rankings provided by those different interestingness measures in order to determine which one provides a better alternative depending on the kind of situations we wish to highlight within large databases with many different and overlapping groups of individuals.
TL;DR: This paper creates a graph representation of the story evolution that is called story graphs, and investigates how graph structure can be used for detecting and discovering new developments in the story, and creates an evaluation framework which bridges the gap between temporal text mining patterns and sentences.
Abstract: With the growing number of document sets accessible online, tracking their evolution over time story tracking became an increasingly interesting problem. In this paper we propose a story tracking method based on the dynamics of keyword-association graphs. We create a graph representation of the story evolution that we call story graphs, and investigate how graph structure can be used for detecting and discovering new developments in the story. First we investigate the possibly interesting graph properties for development detection. We continue by investigating how graph structure can be linked to the sentences representing developments. For this we create an evaluation framework which bridges the gap between temporal text mining patterns and sentences. We apply this framework to evaluate our method against other temporal text mining methods. Our experiments show that story graphs perform at similar levels overall, but provide distinctive advantages in some settings.
TL;DR: This paper investigates the effect of cost ratio, imbalance ratio and sample size on classification performance using a real-world French bankruptcy database and shows that the cost ratio and the level of class imbalance have strong effect on prediction performance.
Abstract: Skewed class distribution and non-uniform misclassification cost are pervasive in many real-world domains such as bankruptcy prediction, medical diagnosis, and intrusion detection. Although class imbalance learning and cost-sensitive learning can be manipulated in a unified framework as was illustrated in previous studies, the influence of class distribution on cost-sensitive learning still needs clarification. In this paper, we investigate the effect of cost ratio, imbalance ratio and sample size on classification performance using a real-world French bankruptcy database. The results show that the cost ratio and the level of class imbalance have strong effect on prediction performance. A near-balanced training data set is favorable when a relatively uniform cost ratio is used, whereas a near-natural class distribution is favorable when a highly uneven cost ratio is used.
TL;DR: A learning algorithm, called SELF SEmi-supervised Learning via FCA, is presented, which performs as a multiclass classifier and a label ranker for mixed-type data containing both discrete and continuous variables, while only few learning algorithms such as the decision tree-based classifier can directly handle mixed- type data.
Abstract: We propose a new approach for semi-supervised learning using closed set lattices, which have been recently used for frequent pattern mining within the framework of the data analysis technique of Formal Concept Analysis FCA. We present a learning algorithm, called SELF SEmi-supervised Learning via FCA, which performs as a multiclass classifier and a label ranker for mixed-type data containing both discrete and continuous variables, while only few learning algorithms such as the decision tree-based classifier can directly handle mixed-type data. From both labeled and unlabeled data, SELF constructs a closed set lattice, which is a partially ordered set of data clusters with respect to subset inclusion, via FCA together with discretizing continuous variables, followed by learning classification rules through finding maximal clusters on the lattice. Moreover, it can weight each classification rule using the lattice, which gives a partial order of preference over class labels. We illustrate experimentally the competitive performance of SELF in classification and ranking compared to other learning algorithms using UCI datasets.