Top 103 papers presented at Intelligent Data Analysis in 2013

Showing papers presented at "Intelligent Data Analysis in 2013"

ROC analysis of classifiers in machine learning: A survey

[...]

Matjaž Majnik¹, Zoran Bosnić¹•Institutions (1)

1 May 2013

TL;DR: A survey of the application areas of the ROC analysis in machine learning is presented, its problems and challenges are described and a summarized list of alternative approaches to ROCAnalysis is provided.

...read moreread less

Abstract: The use of ROC Receiver Operating Characteristics analysis as a tool for evaluating the performance of classification models in machine learning has been increasing in the last decade. Among the most notable advances in this area are the extension of two-class ROC analysis to the multi-class case as well as the employment of ROC analysis in cost-sensitive learning. Methods now exist which take instance-varying costs into account. The purpose of our paper is to present a survey of this field with the aim of gathering important achievements in one place. In the paper, we present application areas of the ROC analysis in machine learning, describe its problems and challenges and provide a summarized list of alternative approaches to ROC analysis. In addition to presented theory, we also provide a couple of examples intended to illustrate the described approaches.

...read moreread less

130 citations

Book Chapter•10.1007/978-3-642-41398-8_24•

1d-SAX: A Novel Symbolic Representation for Time Series

[...]

Simon Malinowski¹, Thomas Guyet¹, René Quiniou², Romain Tavenard³•Institutions (3)

Agrocampus Ouest¹, French Institute for Research in Computer Science and Automation², Idiap Research Institute³

17 Oct 2013

TL;DR: 1d-SAX is proposed a method to represent a time series as a sequence of symbols that each contain information about the average and the trend of the series on a segment, and shows that 1d- SAX improves performance using equal quantity of information, especially when the compression rate increases.

...read moreread less

Abstract: SAX Symbolic Aggregate approXimation is one of the main symbolization techniques for time series. A well-known limitation of SAX is that trends are not taken into account in the symbolization. This paper proposes 1d-SAX a method to represent a time series as a sequence of symbols that each contain information about the average and the trend of the series on a segment. We compare the efficiency of SAX and 1d-SAX in terms of goodness-of-fit, retrieval and classification performance for querying a time series database with an asymmetric scheme. The results show that 1d-SAX improves performance using equal quantity of information, especially when the compression rate increases.

...read moreread less

89 citations

Book Chapter•10.1007/978-3-642-41398-8_3•

Subjective Interestingness in Exploratory Data Mining

[...]

Tijl De Bie¹•Institutions (1)

University of Bristol¹

17 Oct 2013

TL;DR: A general mathematical framework for formalizing interestingness in a subjective manner is discussed and it is demonstrated how it can be successfully instantiated for a variety of exploratory data mining problems.

...read moreread less

Abstract: Exploratory data mining has as its aim to assist a user in improving their understanding about the data. Considering this aim, it seems self-evident that in optimizing this process the data as well as the user need to be considered. Yet, the vast majority of exploratory data mining methods including most methods for clustering, itemset and association rule mining, subgroup discovery, dimensionality reduction, etc formalize interestingness of patterns in an objective manner, disregarding the user altogether. More often than not this leads to subjectively uninteresting patterns being reported. Here I will discuss a general mathematical framework for formalizing interestingness in a subjective manner. I will further demonstrate how it can be successfully instantiated for a variety of exploratory data mining problems. Finally, I will highlight some connections to other work, and outline some of the challenges and research opportunities ahead.

...read moreread less

68 citations

Journal Article•10.3233/IDA-130598•

Extending twin support vector machine classifier for multi-category classification problems

[...]

Juanying Xie¹, Kate Hone², Weixin Xie³, Xinbo Gao⁴, Yong Shi⁵, Xiaohui Liu² - Show less +2 more•Institutions (5)

Shaanxi Normal University¹, Brunel University London², Shenzhen University³, Xidian University⁴, Chinese Academy of Sciences⁵

1 Jul 2013

TL;DR: OVA-TWSVM can outperform the traditional OVA-SVMs classifier and experimental comparisons with other multiclass classifiers demonstrated that comparable performance could be achieved.

...read moreread less

Abstract: Twin support vector machine classifier TWSVM was proposed by Jayadeva et al., which was used for binary classification problems. TWSVM not only overcomes the difficulties in handling the problem of exemplar unbalance in binary classification problems, but also it is four times faster in training a classifier than classical support vector machines. This paper proposes one-versus-all twin support vector machine classifiers OVA-TWSVM for multi-category classification problems by utilizing the strengths of TWSVM. OVA-TWSVM extends TWSVM to solve k-category classification problems by developing k TWSVM where in the ith TWSVM, we only solve the Quadratic Programming Problems QPPs for the ith class, and get the ith nonparallel hyperplane corresponding to the ith class data. OVA-TWSVM uses the well known one-versus-all OVA approach to construct a corresponding twin support vector machine classifier. We analyze the efficiency of the OVA-TWSVM theoretically, and perform experiments to test its efficiency on both synthetic data sets and several benchmark data sets from the UCI machine learning repository. Both the theoretical analysis and experimental results demonstrate that OVA-TWSVM can outperform the traditional OVA-SVMs classifier. Further experimental comparisons with other multiclass classifiers demonstrated that comparable performance could be achieved.

...read moreread less

61 citations

Journal Article•10.3233/IDA-130590•

A study of K-Means-based algorithms for constrained clustering

[...]

Thiago F. Covões¹, Eduardo R. Hruschka¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

1 May 2013

TL;DR: The most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments, and learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters.

...read moreread less

Abstract: The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have partially compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints --Constrained Vector Quantization Error CVQE, its variant named LCVQE, and the Metric Pairwise Constrained K-Means MPCK-Means --are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of more specific new experimental findings are discussed in the paper --e.g., deduced constraints usually do not help finding better data partitions.

...read moreread less

50 citations

Journal Article•10.3233/IDA-130584•

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

[...]

Mohammad Reza Keyvanpour¹, Maryam Bahojb Imani¹•Institutions (1)

Alzahra University¹

1 May 2013

TL;DR: In this paper, a dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes, then the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process.

...read moreread less

Abstract: Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use data sets without labels. So many different semi-supervised learning methods have been studied recently. Among these semi-supervised methods, self-training is one of the important learning algorithms that classifies unlabeled samples with small amount of labeled ones and adds the most confident samples to the training set. In this paper, dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes. Then, the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process. We tested this method on the extracted features of ten common Reuter-21578 classes. Experimental result indicates that proposed method improves the classification performance and it's effective.

...read moreread less

34 citations

Journal Article•10.3233/IDA-130612•

Efficient mining of maximal correlated weight frequent patterns

[...]

Unil Yun¹, Keun Ho Ryu²•Institutions (2)

Sejong University¹, Chungbuk National University²

1 Sep 2013

TL;DR: Two mining algorithms of maximal correlated weight frequent pattern MCWP, termed MCWPWA based on Weight Ascending order and MCWPSD based on Support Descending order are proposed to mine a compact and meaningful set of frequent patterns.

...read moreread less

Abstract: Maximal frequent pattern mining has been suggested for data mining to avoid generating a huge set of frequent patterns. Conversely, weighted frequent pattern mining has been proposed to discover important frequent patterns by considering the weighted support. We propose two mining algorithms of maximal correlated weight frequent pattern MCWP, termed MCWPWA based on Weight Ascending order and MCWPSD based on Support Descending order, to mine a compact and meaningful set of frequent patterns. MCWPSD obtains an advantage in conditional database access, but may not obtain the highest weighted item of the conditional database to mine highly correlated weight frequent patterns. Thus, we suggest a technique that uses additional conditions to prune lowly correlated weight items before the subsets checking process. Analyses show that our algorithms are efficient and scalable.

...read moreread less

33 citations

Journal Article•10.3233/IDA-130580•

Automated detection of diabetes using higher order spectral features extracted from heart rate signals

[...]

G. Swapna¹, U. Rajendra Acharya², S. VinithaSree³, Jasjit S. Suri⁴•Institutions (4)

Government Engineering College, Sreekrishnapuram¹, Ngee Ann Polytechnic², Nanyang Technological University³, Idaho State University⁴

1 Mar 2013

TL;DR: The proposed Computer Aided Diagnostic CAD technique has the ability to detect diabetes efficiently by analyzing the subtle changes in ECG signals that are indicative of the presence of diabetes in a patient.

...read moreread less

Abstract: Diabetes Mellitus, often referred to as diabetes, is a chronic disease that affects a vast majority of world population. The percentage of people affected is increasing every year. Diabetes is very difficult to cure. It can only be kept under control. In this scenario, diagnosis of diabetes is of great importance. In this work, we used Heart Rate Variability HRV signals obtained from ECG signals for the purpose of diagnosis of diabetes. We employed signal processing methods to extract features from the HRV signal. Since HRV signals are of nonlinear nature, we made use of Higher Order Spectrum HOS based features for analysis. In this paper, we have extracted the HOS features from HRV signals corresponding to normal and diabetic subjects. These selected features were fed independently to seven classifiers namely Gaussian Mixture Model GMM, Support Vector Machine SVM, NaiveBayes classifier NB, K-Nearest Neighbour KNN, Probabilistic Neural Network PNN, Fuzzy classifier and Decision Tree DT classifier. The performance of these classifiers was evaluated using accuracy, sensitivity, specificity, positive predictive value, and the area under the receiver operating characteristics curve measures. We observed that the GMM classifier presented the highest accuracy of 90.5%, while the other classifiers presented accuracies in the range of 86.5% to 71.4%. Thus, the proposed Computer Aided Diagnostic CAD technique has the ability to detect diabetes efficiently by analyzing the subtle changes in ECG signals that are indicative of the presence of diabetes in a patient. Also, we have proposed unique bispectrum and bicoherence plots for normal and diabetes heart rate signals.

...read moreread less

33 citations

Book Chapter•10.1007/978-3-642-41398-8_15•

Gaussian Mixture Models for Time Series Modelling, Forecasting, and Interpolation

[...]

Emil Eirola¹, Amaury Lendasse¹•Institutions (1)

Aalto University¹

17 Oct 2013

TL;DR: Experiments on time series forecasting show that including the constraints in the training phase particularly reduces the risk of overfitting in challenging situations with missing values or a large number of Gaussian components.

...read moreread less

Abstract: Gaussian mixture models provide an appealing tool for time series modelling. By embedding the time series to a higher-dimensional space, the density of the points can be estimated by a mixture model. The model can directly be used for short-to-medium term forecasting and missing value imputation. The modelling setup introduces some restrictions on the mixture model, which when appropriately taken into account result in a more accurate model. Experiments on time series forecasting show that including the constraints in the training phase particularly reduces the risk of overfitting in challenging situations with missing values or a large number of Gaussian components.

...read moreread less

31 citations

Journal Article•10.3233/IDA-120566•

Evolving networks: Eras and turning points

[...]

Michele Berlingerio¹, Michele Coscia², Fosca Giannotti¹, Anna Monreale², Dino Pedreschi² - Show less +1 more•Institutions (2)

Istituto di Scienza e Tecnologie dell'Informazione¹, University of Pisa²

1 Jan 2013

TL;DR: A novel hierarchical clustering methodology is introduced, based on a dissimilarity measure derived from the Jaccard coefficient between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras.

...read moreread less

Abstract: Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure derived from the Jaccard coefficient between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks and null models, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset, a collaboration graph extracted from a cinema database, and a network extracted from a database of terrorist attacks; we illustrate how the discovered temporal clustering highlights the crucial moments when the networks witnessed profound changes in their structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis.

...read moreread less

28 citations

Book Chapter•10.1007/978-3-642-41398-8_34•

Dynamic MMHC: A Local Search Algorithm for Dynamic Bayesian Network Structure Learning

[...]

Ghada Trabelsi¹, Philippe Leray², Mounir Ben Ayed¹, Adel M. Alimi¹•Institutions (2)

University of Sfax¹, University of Nantes²

17 Oct 2013

TL;DR: This paper proposes Dynamic MMHC, an adaptation of the "static" MMHC algorithm, known for its scalability, and illustrates the interest of this method with some experimental results.

...read moreread less

Abstract: Dynamic Bayesian networks DBNs are a class of probabilistic graphical models that has become a standard tool for modeling various stochastic time-varying phenomena. Probabilistic graphical models such as 2-Time slice BN 2T-BNs are the most used and popular models for DBNs. Because of the complexity induced by adding the temporal dimension, DBN structure learning is a very complex task. Existing algorithms are adaptations of score-based BN structure learning algorithms but are often limited when the number of variables is high. We focus in this paper to DBN structure learning with another family of structure learning algorithms, local search methods, known for its scalability. We propose Dynamic MMHC, an adaptation of the "static" MMHC algorithm. We illustrate the interest of this method with some experimental results.

...read moreread less

Proceedings Article•

Data Analysis Challenges in the Future Energy Domain

[...]

Frank Eichinger, Daniel Pathmaperuma, Harald Vogt, Emmanuel Müller¹•Institutions (1)

Karlsruhe Institute of Technology¹

1 Jan 2013

TL;DR: In this paper, the authors propose a 7.7.7-approximation algorithm for each node. And the algorithm works well on all the nodes in the tree-line.

...read moreread less

Abstract: 7.

...read moreread less

Journal Article•10.3233/IDA-130595•

UT-Tree: Efficient mining of high utility itemsets from data streams

[...]

Lin Feng¹, Le Wang¹, Bo Jin¹•Institutions (1)

Dalian University of Technology¹

1 Jul 2013

TL;DR: A mining algorithm, called HUM-UT High Utility itemsets Mining based on UT-Tree, to find high utility itemsets from transactional data streams, which has better performance and is more stable under different experimental conditions than the state-of-the-art algorithm HUPMS in terms of time and space.

...read moreread less

Abstract: High utility itemsets mining is a hot topic in data stream mining It is essential that the mining algorithm should be efficient in both time and space for data stream is continuous and unbounded To the best of our knowledge, the existing algorithms require multiple database scans to mine high utility itemsets, and this hinders their efficiency In this paper, we propose a new data structure, called UT-Tree Utility on Tail Tree, for maintaining utility information of transaction itemsets to avoid multiple database scans The UT-Tree is created with one database scan, and contains a fixed number of transaction itemsets; utility information is stored on tail-nodes only Based on the proposed data structure and the sliding window approach, we propose a mining algorithm, called HUM-UT High Utility itemsets Mining based on UT-Tree, to find high utility itemsets from transactional data streams The HUM-UT algorithm mines high utility itemsets from the UT-Tree without additional database scan Experiment results show that our algorithm has better performance and is more stable under different experimental conditions than the state-of-the-art algorithm HUPMS in terms of time and space

...read moreread less

Journal Article•10.3233/IDA-130591•

Incremental mining of sequential patterns: Progress and challenges

[...]

Bhawna Mallick¹, Deepak Garg¹, P. S. Grover•Institutions (1)

Thapar University¹

1 May 2013

TL;DR: It is inferred that the better approach is incremental mining on the progressive database, based on the characteristics, which would give scope for future research direction.

...read moreread less

Abstract: Sequential pattern mining is a vital problem with broad applications. However, it is also challenging, as combinatorial high number of intermediate subsequences are generated that have to be critically examined. Most of the basic solutions are based on the assumption that the mining is performed on static database. But modern day databases are being continuously updated and are dynamic in nature. So, incremental mining of sequential patterns has become the norm. This article investigates the need for incremental mining of sequential patterns. An analytical study, focusing on the characteristics, has been made for more than twenty incremental mining algorithms. Further, we have discussed the issues associated with each of them. We infer that the better approach is incremental mining on the progressive database. The three more relevant algorithms, based on this approach, are also studied in depth along with the other work done in this area. This would give scope for future research direction.

...read moreread less

Journal Article•10.3233/IDA-130597•

Discovering generalized association rules from Twitter

[...]

Luca Cagliero¹, Alessandro Fiori•Institutions (1)

Polytechnic University of Turin¹

1 Jul 2013

TL;DR: The proposed TweM Tweet Miner framework entails the discovery of hidden and high level correlations, in the form of generalized association rules, among the content and the contextual features of posts published on Twitter i.e., the tweets.

...read moreread less

Abstract: The increasing availability of user-generated content coming from online communities allows the analysis of common user behaviors and trends in social network usage. This paper presents the TweM Tweet Miner framework that entails the discovery of hidden and high level correlations, in the form of generalized association rules, among the content and the contextual features of posts published on Twitter i.e., the tweets. To effectively support knowledge discovery from tweets, the TweM framework performs two main steps: i taxonomy generation over tweet keywords and context data and ii generalized association rule mining, driven by the generated taxonomy, from a sequence of tweet collections. Unlike traditional mining approaches, the generalized rule mining session performed on the current tweet collection also considers the evolution of the extracted patterns across the sequence of the previous mining sessions to prevent the discarding of rare knowledge that frequently occurs in a number of past extractions. Experiments, performed on both real Twitter posts and synthetic datasets, show the effectiveness and the efficiency of the proposed TweM framework in supporting knowledge discovery from Twitter user-generated content.

...read moreread less

Book Chapter•10.1007/978-3-642-41398-8_20•

Diversity-Driven Widening

[...]

Violeta N. Ivanova¹, Michael R. Berthold¹•Institutions (1)

University of Konstanz¹

17 Oct 2013

TL;DR: This paper presents a more in-depth analysis of the concept of Widened Data Mining, which aims at reducing the impact of greedy heuristics by exploring more than just one suitable solution at each step by focusing on how diversity considerations can substantially improve results.

...read moreread less

Abstract: This paper follows our earlier publicationi¾?[1], where we introduced the idea of tuned data mining which draws on parallel resources to improve model accuracy rather than the usual focus on speed-up. In this paper we present a more in-depth analysis of the concept of Widened Data Mining, which aims at reducing the impact of greedy heuristics by exploring more than just one suitable solution at each step. In particular we focus on how diversity considerations can substantially improve results. We again use the greedy algorithm for the set cover problem to demonstrate these effects in practice.

...read moreread less

Book Chapter•10.1007/978-3-642-41398-8_18•

Learning Multiple Temporal Matching for Time Series Classification

[...]

Cédric Frambourg¹, Ahlame Douzal-Chouakria¹, Eric Gaussier¹•Institutions (1)

Joseph Fourier University¹

17 Oct 2013

TL;DR: This work proposes a multiple temporal matching approach that reveals the commonly shared features within classes, and the most differential ones across classes, based on a new framework based on the variance/covariance criterion.

...read moreread less

Abstract: In real applications, time series are generally of complex structure, exhibiting different global behaviors within classes. To discriminate such challenging time series, we propose a multiple temporal matching approach that reveals the commonly shared features within classes, and the most differential ones across classes. For this, we rely on a new framework based on the variance/covariance criterion to strengthen or weaken matched observations according to the induced variability within and between classes. The experiments performed on real and synthetic datasets demonstrate the ability of the multiple temporal matching approach to capture fine-grained distinctions between time series.

...read moreread less

Book Chapter•10.1007/978-3-642-41398-8_11•

Finding Frequent Patterns in Parallel Point Processes

[...]

Christian Borgelt, David Picado-Muiño

17 Oct 2013

TL;DR: This work defines the support of an item set in this setting based on a maximum independent set approach allowing for efficient computation and shows how the enumeration and test of candidate sets can be made efficient by properly reducing the event sequences and exploiting perfect extension pruning.

...read moreread less

Abstract: We consider the task of finding frequent patterns in parallel point processes--also known as finding frequent parallel episodes in event sequences. This task can be seen as a generalization of frequent item set mining: the co-occurrence of items or events in transactions is replaced by their imprecise co-occurrence on a continuous time scale, meaning that they occur in a limited time span from each other. We define the support of an item set in this setting based on a maximum independent set approach allowing for efficient computation. Furthermore, we show how the enumeration and test of candidate sets can be made efficient by properly reducing the event sequences and exploiting perfect extension pruning. Finally, we study how the resulting frequent item sets/patterns can be filtered for closed and maximal sets.

...read moreread less

Journal Article•10.3233/IDA-130609•

Entity ranking using click-log information

[...]

Davide Mottin¹, Themis Palpanas¹, Yannis Velegrakis¹•Institutions (1)

University of Trento¹

1 Sep 2013

TL;DR: This work presents a novel framework for feature extraction that is based on the notions of entity matching and attribute frequencies and introduces different methods and metrics for ranking, which combine them with existing traditional techniques and are studied using real and synthetic data.

...read moreread less

Abstract: Log information describing the items the users have selected from the set of answers a query engine returns to their queries constitute an excellent form of indirect user feedback that has been extensively used in the web to improve the effectiveness of search engines. In this work we study how the logs can be exploited to improve the ranking of the results returned by an entity search engine. Entity search engines are becoming more and more popular as the web is changing from a web of documents into a "web of things". We show that entity search engines pose new challenges since their model is different than the one documents are based on. We present a novel framework for feature extraction that is based on the notions of entity matching and attribute frequencies. The extracted features are then used to train a ranking classifier. We introduce different methods and metrics for ranking, we combine them with existing traditional techniques and we study their performance using real and synthetic data. The experiments show that our technique provides better results in terms of accuracy.

...read moreread less

Book Chapter•10.1007/978-3-642-41398-8_17•

OrderSpan: Mining Closed Partially Ordered Patterns

[...]

Mickaël Fabrègue¹, Agnès Braud¹, Sandra Bringay², Florence Le Ber¹, Maguelonne Teisseire - Show less +1 more•Institutions (2)

University of Strasbourg¹, Centre national de la recherche scientifique²

17 Oct 2013

TL;DR: This paper investigates the data mining challenge of partially ordered pattern mining of sequential data by describing OrderSpan, a new algorithm that extracts such patterns from sequential databases and overcomes some of the drawbacks of existing methods.

...read moreread less

Abstract: Due to the complexity of the task, partially ordered pattern mining of sequential data has not been subject to much study, despite its usefulness. This paper investigates this data mining challenge by describing OrderSpan, a new algorithm that extracts such patterns from sequential databases and overcomes some of the drawbacks of existing methods. Our work consists in providing a simple and flexible framework to directly mine complex sequences of itemsets, by combining well-known properties on prefixes and suffixes. Experiments were performed on different real datasets to show the benefit of partially ordered patterns.

...read moreread less

Journal Article•10.3233/IDA-130577•

A two-phase hybrid of semi-supervised and active learning approach for sequence labeling

[...]

Hamed Hassanzadeh¹, Mohammad Reza Keyvanpour²•Institutions (2)

Islamic Azad University¹, Alzahra University²

1 Mar 2013

TL;DR: A combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically.

...read moreread less

Abstract: In recent years, many NLP systems and tasks are developed using machine learning methods. In order to achieve the best performance, these systems are generally trained on a large human annotated corpus. Since annotating such corpora is a very expensive and time-consuming procedure, manually annotating corpora is become one of the significant issues in many text based tasks such as text mining, semantic annotation, Named Entity Recognition and generally Information Extraction. Semi-supervised Learning and Active Learning are two distinct approaches that deal with reduction of labeling costs. Based on their natures, Active and semi-supervised learning can produce better results when they are jointly applied. In this paper we propose a combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically. The proposed approach reduces manual annotation cost around 90% compare with a supervised learning and 30% in contrast with a similar fully active learning approach. Conditional Random Field CRF is chosen as the underlying learning model due to its promising performance in many sequence labeling tasks. In addition we proposed a confidence measure based on the model's variance reduction that reaches a considerable accuracy for finding informative samples.

...read moreread less

Journal Article•10.3233/IDA-130618•

Enhancing K-Means using class labels

[...]

Billy Peralta¹, Pablo Espinace¹, Alvaro Soto¹•Institutions (1)

Pontifical Catholic University of Chile¹

1 Nov 2013

TL;DR: Labeled K-Means LK- means is presented, an algorithm for supervised clustering based on a variant of K-means that incorporates information about class labels that outperforms the alternative techniques by a considerable margin.

...read moreread less

Abstract: Clustering is a relevant problem in machine learning where the main goal is to locate meaningful partitions of unlabeled data. In the case of labeled data, a related problem is supervised clustering, where the objective is to locate class-uniform clusters. Most current approaches to supervised clustering optimize a score related to cluster purity with respect to class labels. In particular, we present Labeled K-Means LK-Means, an algorithm for supervised clustering based on a variant of K-Means that incorporates information about class labels. LK-Means replaces the classical cost function of K-Means by a convex combination of the joint cost associated to: i A discriminative score based on class labels, and ii A generative score based on a traditional metric for unsupervised clustering. We test the performance of LK-Means using standard real datasets and an application for object recognition. Moreover, we also compare its performance against classical K-Means and a popular K-Medoids-based supervised clustering method. Our experiments show that, in most cases, LK-Means outperforms the alternative techniques by a considerable margin. Furthermore, LK-Means presents execution times considerably lower than the alternative supervised clustering method under evaluation.

...read moreread less

Book Chapter•10.1007/978-3-642-41398-8_1•

Data, Not Dogma: Big Data, Open Data, and the Opportunities Ahead

[...]

David J. Hand¹•Institutions (1)

Winton Capital Management¹

17 Oct 2013

TL;DR: The observation that people want answers to questions, not simply data, is explored, and some of the difficulties and risks which lie in the path of realising the opportunities of big data and open data.

...read moreread less

Abstract: Big data and open data promise tremendous advances. But the media hype ignores the difficulties and the risks associated with this promise. Beginning with the observation that people want answers to questions, not simply data, I explore some of the difficulties and risks which lie in the path of realising the opportunities.

...read moreread less

Book Chapter•10.1007/978-3-642-41398-8_7•

Evaluation of Association Rule Quality Measures through Feature Extraction

[...]

José L. Balcázar¹, Francis Dogbey²•Institutions (2)

Polytechnic University of Catalonia¹, Ghana-India Kofi Annan Centre of Excellence in ICT²

17 Oct 2013

TL;DR: This work proposes a protocol to evaluate the evaluation measures themselves of association rule quality measures, and indicates that multiplicative improvement and to a lesser extent support and leverage a.k.a. weighted relative accuracy tend to obtain better results than the other measures.

...read moreread less

Abstract: The practical success of association rule mining depends heavily on the criterion to choose among the many rules often mined. Many rule quality measures exist in the literature. We propose a protocol to evaluate the evaluation measures themselves. For each association rule, we measure the improvement in accuracy that a commonly used predictor can obtain from an additional feature, constructed according to the exceptions to the rule. We select a reference set of rules that are helpful in this sense. Then, our evaluation method takes into account both how many of these helpful rules are found near the top rules for a given quality measure, and how near the top they are. We focus on seven association rule quality measures. Our experiments indicate that multiplicative improvement and to a lesser extent support and leverage a.k.a.i¾?weighted relative accuracy tend to obtain better results than the other measures.

...read moreread less

Journal Article•10.3233/IDA-130621•

Causality-based cost-effective action mining

[...]

Pirooz Shamsinejadbabaki¹, Mohamad Saraee², Hendrik Blockeel³•Institutions (3)

Isfahan University of Technology¹, University of Salford², University of Copenhagen Faculty of Science³

1 Nov 2013

TL;DR: ICE-CREAM is introduced, a novel approach to action mining that explicitly relies on an automatically obtained best estimate of the causal relationships in the data to suggest actions with desirable effects.

...read moreread less

Abstract: In many business contexts, the ultimate goal of knowledge discovery is not the knowledge itself, but putting it to use. Models or patterns found by data mining methods often require further post-processing to bring this about. For instance, in churn prediction, data mining may give a model that predicts which customers are likely to end their contract, but companies are not just interested in knowing who is likely to do so, they want to know what they can do to avoid this. The models or patterns have to be transformed into actionable knowledge. Action mining explicitly addresses this. Currently, many action mining methods rely on a predictive model, obtained through data mining, to estimate the effect of certain actions and finally suggest actions with desirable effects. A major problem with this approach is that predictive models do not necessarily reflect a causal relationship between their inputs and outputs. This makes the existing action mining methods less reliable. In this paper, we introduce ICE-CREAM, a novel approach to action mining that explicitly relies on an automatically obtained best estimate of the causal relationships in the data. Experiments confirm that ICE-CREAM performs much better than the current state of the art in action mining.

...read moreread less

Journal Article•10.3233/IDA-120567•

Discovering descriptive rules in relational dynamic graphs

[...]

Kim-Ngan T. Nguyen¹, Loïc Cerf², Marc Plantevit³, Jean-Françcois Boulicaut¹•Institutions (3)

Institut national des sciences Appliquées de Lyon¹, Universidade Federal de Minas Gerais², University of Lyon³

1 Jan 2013

TL;DR: This work designs the pattern domain of multi-dimensional association rules, i.e., non trivial extensions of the popular association rules that may involve subsets of any dimensions in their antecedents and their consequents and proposes optimizations to support both rule extraction scalability and non redundancy.

...read moreread less

Abstract: Graph mining methods have become quite popular and a timely challenge is to discover dynamic properties in evolving graphs or networks. We consider the so-called relational dynamic oriented graphs that can be encoded as n-ary relations with n ≥ 3 and thus represented by Boolean tensors. Two dimensions are used to encode the graph adjacency matrices and at least one other denotes time. We design the pattern domain of multi-dimensional association rules, i.e., non trivial extensions of the popular association rules that may involve subsets of any dimensions in their antecedents and their consequents. First, we design new objective interestingness measures for such rules and it leads to different approaches for measuring the rule confidence. Second, we must compute collections of a priori interesting rules. It is considered here as a post-processing of the closed patterns that can be extracted efficiently from Boolean tensors. We propose optimizations to support both rule extraction scalability and non redundancy. We illustrate the added-value of this new data mining task to discover patterns from a real-life relational dynamic graph.

...read moreread less

Journal Article•10.3233/IDA-130574•

Interestingness measures for association rules within groups

[...]

Aída Jiménez¹, Fernando Berzal¹, Juan-Carlos Cubero¹•Institutions (1)

University of Granada¹

1 Mar 2013

TL;DR: This paper defines group association rules and study interestingness measures for them, which can be used to rank, not only groups of individuals, but also rules within each group.

...read moreread less

Abstract: The work described in this paper addresses the study of association rules within groups of individuals. The analysis of the characteristics and the behavior of the individuals belonging to such groups in a given database is powerful in practice, since it provides a mechanism to deal with groups rather than isolated individuals. In this paper, we define group association rules and we study interestingness measures for them. These interestingness measures can be used to rank, not only groups of individuals, but also rules within each group. We also compare the rankings provided by those different interestingness measures in order to determine which one provides a better alternative depending on the kind of situations we wish to highlight within large databases with many different and overlapping groups of individuals.

...read moreread less

Journal Article•10.3233/IDA-120570•

Story graphs: Tracking document set evolution using dynamic graphs

[...]

Ilija Subasic¹, Bettina Berendt¹•Institutions (1)

Katholieke Universiteit Leuven¹

1 Jan 2013

TL;DR: This paper creates a graph representation of the story evolution that is called story graphs, and investigates how graph structure can be used for detecting and discovering new developments in the story, and creates an evaluation framework which bridges the gap between temporal text mining patterns and sentences.

...read moreread less

Abstract: With the growing number of document sets accessible online, tracking their evolution over time story tracking became an increasingly interesting problem. In this paper we propose a story tracking method based on the dynamics of keyword-association graphs. We create a graph representation of the story evolution that we call story graphs, and investigate how graph structure can be used for detecting and discovering new developments in the story. First we investigate the possibly interesting graph properties for development detection. We continue by investigating how graph structure can be linked to the sentences representing developments. For this we create an evaluation framework which bridges the gap between temporal text mining patterns and sentences. We apply this framework to evaluate our method against other temporal text mining methods. Our experiments show that story graphs perform at similar levels overall, but provide distinctive advantages in some settings.

...read moreread less

Journal Article•10.3233/IDA-130587•

Influence of class distribution on cost-sensitive learning: A case study of bankruptcy analysis

[...]

Ning Chen¹, An Chen¹, Bernardete Ribeiro²•Institutions (2)

Instituto Superior de Engenharia do Porto¹, University of Coimbra²

1 May 2013

TL;DR: This paper investigates the effect of cost ratio, imbalance ratio and sample size on classification performance using a real-world French bankruptcy database and shows that the cost ratio and the level of class imbalance have strong effect on prediction performance.

...read moreread less

Abstract: Skewed class distribution and non-uniform misclassification cost are pervasive in many real-world domains such as bankruptcy prediction, medical diagnosis, and intrusion detection. Although class imbalance learning and cost-sensitive learning can be manipulated in a unified framework as was illustrated in previous studies, the influence of class distribution on cost-sensitive learning still needs clarification. In this paper, we investigate the effect of cost ratio, imbalance ratio and sample size on classification performance using a real-world French bankruptcy database. The results show that the cost ratio and the level of class imbalance have strong effect on prediction performance. A near-balanced training data set is favorable when a relatively uniform cost ratio is used, whereas a near-natural class distribution is favorable when a highly uneven cost ratio is used.

...read moreread less

Journal Article•10.3233/IDA-130586•

Semi-supervised learning on closed set lattices

[...]

Mahito Sugiyama¹, Akihiro Yamamoto²•Institutions (2)

Japan Society for the Promotion of Science¹, Kyoto University²

1 May 2013

TL;DR: A learning algorithm, called SELF SEmi-supervised Learning via FCA, is presented, which performs as a multiclass classifier and a label ranker for mixed-type data containing both discrete and continuous variables, while only few learning algorithms such as the decision tree-based classifier can directly handle mixed- type data.

...read moreread less

Abstract: We propose a new approach for semi-supervised learning using closed set lattices, which have been recently used for frequent pattern mining within the framework of the data analysis technique of Formal Concept Analysis FCA. We present a learning algorithm, called SELF SEmi-supervised Learning via FCA, which performs as a multiclass classifier and a label ranker for mixed-type data containing both discrete and continuous variables, while only few learning algorithms such as the decision tree-based classifier can directly handle mixed-type data. From both labeled and unlabeled data, SELF constructs a closed set lattice, which is a partially ordered set of data clusters with respect to subset inclusion, via FCA together with discretizing continuous variables, followed by learning classification rules through finding maximal clusters on the lattice. Moreover, it can weight each classification rule using the lattice, which gives a partial order of preference over class labels. We illustrate experimentally the competitive performance of SELF in classification and ranking compared to other learning algorithms using UCI datasets.

...read moreread less

...

Expand