Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Intelligent Data Analysis
  4. 2013
  1. Home
  2. Conferences
  3. Intelligent Data Analysis
  4. 2013
Showing papers presented at "Intelligent Data Analysis in 2013"
Journal Article•10.3233/IDA-130592•
ROC analysis of classifiers in machine learning: A survey

[...]

Matjaž Majnik1, Zoran Bosnić1•
University of Ljubljana1
1 May 2013
TL;DR: A survey of the application areas of the ROC analysis in machine learning is presented, its problems and challenges are described and a summarized list of alternative approaches to ROCAnalysis is provided.
Abstract: The use of ROC Receiver Operating Characteristics analysis as a tool for evaluating the performance of classification models in machine learning has been increasing in the last decade. Among the most notable advances in this area are the extension of two-class ROC analysis to the multi-class case as well as the employment of ROC analysis in cost-sensitive learning. Methods now exist which take instance-varying costs into account. The purpose of our paper is to present a survey of this field with the aim of gathering important achievements in one place. In the paper, we present application areas of the ROC analysis in machine learning, describe its problems and challenges and provide a summarized list of alternative approaches to ROC analysis. In addition to presented theory, we also provide a couple of examples intended to illustrate the described approaches.

130 citations

Book Chapter•10.1007/978-3-642-41398-8_24•
1d-SAX: A Novel Symbolic Representation for Time Series

[...]

Simon Malinowski1, Thomas Guyet1, René Quiniou2, Romain Tavenard3•
Agrocampus Ouest1, French Institute for Research in Computer Science and Automation2, Idiap Research Institute3
17 Oct 2013
TL;DR: 1d-SAX is proposed a method to represent a time series as a sequence of symbols that each contain information about the average and the trend of the series on a segment, and shows that 1d- SAX improves performance using equal quantity of information, especially when the compression rate increases.
Abstract: SAX Symbolic Aggregate approXimation is one of the main symbolization techniques for time series. A well-known limitation of SAX is that trends are not taken into account in the symbolization. This paper proposes 1d-SAX a method to represent a time series as a sequence of symbols that each contain information about the average and the trend of the series on a segment. We compare the efficiency of SAX and 1d-SAX in terms of goodness-of-fit, retrieval and classification performance for querying a time series database with an asymmetric scheme. The results show that 1d-SAX improves performance using equal quantity of information, especially when the compression rate increases.

89 citations

Book Chapter•10.1007/978-3-642-41398-8_3•
Subjective Interestingness in Exploratory Data Mining

[...]

Tijl De Bie1•
University of Bristol1
17 Oct 2013
TL;DR: A general mathematical framework for formalizing interestingness in a subjective manner is discussed and it is demonstrated how it can be successfully instantiated for a variety of exploratory data mining problems.
Abstract: Exploratory data mining has as its aim to assist a user in improving their understanding about the data. Considering this aim, it seems self-evident that in optimizing this process the data as well as the user need to be considered. Yet, the vast majority of exploratory data mining methods including most methods for clustering, itemset and association rule mining, subgroup discovery, dimensionality reduction, etc formalize interestingness of patterns in an objective manner, disregarding the user altogether. More often than not this leads to subjectively uninteresting patterns being reported. Here I will discuss a general mathematical framework for formalizing interestingness in a subjective manner. I will further demonstrate how it can be successfully instantiated for a variety of exploratory data mining problems. Finally, I will highlight some connections to other work, and outline some of the challenges and research opportunities ahead.

68 citations

Journal Article•10.3233/IDA-130598•
Extending twin support vector machine classifier for multi-category classification problems

[...]

Juanying Xie1, Kate Hone2, Weixin Xie3, Xinbo Gao4, Yong Shi5, Xiaohui Liu2 •
Shaanxi Normal University1, Brunel University London2, Shenzhen University3, Xidian University4, Chinese Academy of Sciences5
1 Jul 2013
TL;DR: OVA-TWSVM can outperform the traditional OVA-SVMs classifier and experimental comparisons with other multiclass classifiers demonstrated that comparable performance could be achieved.
Abstract: Twin support vector machine classifier TWSVM was proposed by Jayadeva et al., which was used for binary classification problems. TWSVM not only overcomes the difficulties in handling the problem of exemplar unbalance in binary classification problems, but also it is four times faster in training a classifier than classical support vector machines. This paper proposes one-versus-all twin support vector machine classifiers OVA-TWSVM for multi-category classification problems by utilizing the strengths of TWSVM. OVA-TWSVM extends TWSVM to solve k-category classification problems by developing k TWSVM where in the ith TWSVM, we only solve the Quadratic Programming Problems QPPs for the ith class, and get the ith nonparallel hyperplane corresponding to the ith class data. OVA-TWSVM uses the well known one-versus-all OVA approach to construct a corresponding twin support vector machine classifier. We analyze the efficiency of the OVA-TWSVM theoretically, and perform experiments to test its efficiency on both synthetic data sets and several benchmark data sets from the UCI machine learning repository. Both the theoretical analysis and experimental results demonstrate that OVA-TWSVM can outperform the traditional OVA-SVMs classifier. Further experimental comparisons with other multiclass classifiers demonstrated that comparable performance could be achieved.

61 citations

Journal Article•10.3233/IDA-130590•
A study of K-Means-based algorithms for constrained clustering

[...]

Thiago F. Covões1, Eduardo R. Hruschka1, Joydeep Ghosh1•
University of Texas at Austin1
1 May 2013
TL;DR: The most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments, and learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters.
Abstract: The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have partially compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints --Constrained Vector Quantization Error CVQE, its variant named LCVQE, and the Metric Pairwise Constrained K-Means MPCK-Means --are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of more specific new experimental findings are discussed in the paper --e.g., deduced constraints usually do not help finding better data partitions.

50 citations

Journal Article•10.3233/IDA-130584•
Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

[...]

Mohammad Reza Keyvanpour1, Maryam Bahojb Imani1•
Alzahra University1
1 May 2013
TL;DR: In this paper, a dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes, then the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process.
Abstract: Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use data sets without labels. So many different semi-supervised learning methods have been studied recently. Among these semi-supervised methods, self-training is one of the important learning algorithms that classifies unlabeled samples with small amount of labeled ones and adds the most confident samples to the training set. In this paper, dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes. Then, the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process. We tested this method on the extracted features of ten common Reuter-21578 classes. Experimental result indicates that proposed method improves the classification performance and it's effective.

34 citations

Journal Article•10.3233/IDA-130612•
Efficient mining of maximal correlated weight frequent patterns

[...]

Unil Yun1, Keun Ho Ryu2•
Sejong University1, Chungbuk National University2
1 Sep 2013
TL;DR: Two mining algorithms of maximal correlated weight frequent pattern MCWP, termed MCWPWA based on Weight Ascending order and MCWPSD based on Support Descending order are proposed to mine a compact and meaningful set of frequent patterns.
Abstract: Maximal frequent pattern mining has been suggested for data mining to avoid generating a huge set of frequent patterns. Conversely, weighted frequent pattern mining has been proposed to discover important frequent patterns by considering the weighted support. We propose two mining algorithms of maximal correlated weight frequent pattern MCWP, termed MCWPWA based on Weight Ascending order and MCWPSD based on Support Descending order, to mine a compact and meaningful set of frequent patterns. MCWPSD obtains an advantage in conditional database access, but may not obtain the highest weighted item of the conditional database to mine highly correlated weight frequent patterns. Thus, we suggest a technique that uses additional conditions to prune lowly correlated weight items before the subsets checking process. Analyses show that our algorithms are efficient and scalable.

33 citations

Journal Article•10.3233/IDA-130580•
Automated detection of diabetes using higher order spectral features extracted from heart rate signals

[...]

G. Swapna1, U. Rajendra Acharya2, S. VinithaSree3, Jasjit S. Suri4•
Government Engineering College, Sreekrishnapuram1, Ngee Ann Polytechnic2, Nanyang Technological University3, Idaho State University4
1 Mar 2013
TL;DR: The proposed Computer Aided Diagnostic CAD technique has the ability to detect diabetes efficiently by analyzing the subtle changes in ECG signals that are indicative of the presence of diabetes in a patient.
Abstract: Diabetes Mellitus, often referred to as diabetes, is a chronic disease that affects a vast majority of world population. The percentage of people affected is increasing every year. Diabetes is very difficult to cure. It can only be kept under control. In this scenario, diagnosis of diabetes is of great importance. In this work, we used Heart Rate Variability HRV signals obtained from ECG signals for the purpose of diagnosis of diabetes. We employed signal processing methods to extract features from the HRV signal. Since HRV signals are of nonlinear nature, we made use of Higher Order Spectrum HOS based features for analysis. In this paper, we have extracted the HOS features from HRV signals corresponding to normal and diabetic subjects. These selected features were fed independently to seven classifiers namely Gaussian Mixture Model GMM, Support Vector Machine SVM, NaiveBayes classifier NB, K-Nearest Neighbour KNN, Probabilistic Neural Network PNN, Fuzzy classifier and Decision Tree DT classifier. The performance of these classifiers was evaluated using accuracy, sensitivity, specificity, positive predictive value, and the area under the receiver operating characteristics curve measures. We observed that the GMM classifier presented the highest accuracy of 90.5%, while the other classifiers presented accuracies in the range of 86.5% to 71.4%. Thus, the proposed Computer Aided Diagnostic CAD technique has the ability to detect diabetes efficiently by analyzing the subtle changes in ECG signals that are indicative of the presence of diabetes in a patient. Also, we have proposed unique bispectrum and bicoherence plots for normal and diabetes heart rate signals.

33 citations

Book Chapter•10.1007/978-3-642-41398-8_15•
Gaussian Mixture Models for Time Series Modelling, Forecasting, and Interpolation

[...]

Emil Eirola1, Amaury Lendasse1•
Aalto University1
17 Oct 2013
TL;DR: Experiments on time series forecasting show that including the constraints in the training phase particularly reduces the risk of overfitting in challenging situations with missing values or a large number of Gaussian components.
Abstract: Gaussian mixture models provide an appealing tool for time series modelling. By embedding the time series to a higher-dimensional space, the density of the points can be estimated by a mixture model. The model can directly be used for short-to-medium term forecasting and missing value imputation. The modelling setup introduces some restrictions on the mixture model, which when appropriately taken into account result in a more accurate model. Experiments on time series forecasting show that including the constraints in the training phase particularly reduces the risk of overfitting in challenging situations with missing values or a large number of Gaussian components.

31 citations

Journal Article•10.3233/IDA-120566•
Evolving networks: Eras and turning points

[...]

Michele Berlingerio1, Michele Coscia2, Fosca Giannotti1, Anna Monreale2, Dino Pedreschi2 •
Istituto di Scienza e Tecnologie dell'Informazione1, University of Pisa2
1 Jan 2013
TL;DR: A novel hierarchical clustering methodology is introduced, based on a dissimilarity measure derived from the Jaccard coefficient between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras.
Abstract: Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure derived from the Jaccard coefficient between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks and null models, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset, a collaboration graph extracted from a cinema database, and a network extracted from a database of terrorist attacks; we illustrate how the discovered temporal clustering highlights the crucial moments when the networks witnessed profound changes in their structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis.

28 citations

Book Chapter•10.1007/978-3-642-41398-8_34•
Dynamic MMHC: A Local Search Algorithm for Dynamic Bayesian Network Structure Learning

[...]

Ghada Trabelsi1, Philippe Leray2, Mounir Ben Ayed1, Adel M. Alimi1•
University of Sfax1, University of Nantes2
17 Oct 2013
TL;DR: This paper proposes Dynamic MMHC, an adaptation of the "static" MMHC algorithm, known for its scalability, and illustrates the interest of this method with some experimental results.
Abstract: Dynamic Bayesian networks DBNs are a class of probabilistic graphical models that has become a standard tool for modeling various stochastic time-varying phenomena. Probabilistic graphical models such as 2-Time slice BN 2T-BNs are the most used and popular models for DBNs. Because of the complexity induced by adding the temporal dimension, DBN structure learning is a very complex task. Existing algorithms are adaptations of score-based BN structure learning algorithms but are often limited when the number of variables is high. We focus in this paper to DBN structure learning with another family of structure learning algorithms, local search methods, known for its scalability. We propose Dynamic MMHC, an adaptation of the "static" MMHC algorithm. We illustrate the interest of this method with some experimental results.
Proceedings Article•
Data Analysis Challenges in the Future Energy Domain

[...]

Frank Eichinger, Daniel Pathmaperuma, Harald Vogt, Emmanuel Müller1•
Karlsruhe Institute of Technology1
1 Jan 2013
TL;DR: In this paper, the authors propose a 7.7.7-approximation algorithm for each node. And the algorithm works well on all the nodes in the tree-line.
Abstract: 7.
Journal Article•10.3233/IDA-130595•
UT-Tree: Efficient mining of high utility itemsets from data streams

[...]

Lin Feng1, Le Wang1, Bo Jin1•
Dalian University of Technology1
1 Jul 2013
TL;DR: A mining algorithm, called HUM-UT High Utility itemsets Mining based on UT-Tree, to find high utility itemsets from transactional data streams, which has better performance and is more stable under different experimental conditions than the state-of-the-art algorithm HUPMS in terms of time and space.
Abstract: High utility itemsets mining is a hot topic in data stream mining It is essential that the mining algorithm should be efficient in both time and space for data stream is continuous and unbounded To the best of our knowledge, the existing algorithms require multiple database scans to mine high utility itemsets, and this hinders their efficiency In this paper, we propose a new data structure, called UT-Tree Utility on Tail Tree, for maintaining utility information of transaction itemsets to avoid multiple database scans The UT-Tree is created with one database scan, and contains a fixed number of transaction itemsets; utility information is stored on tail-nodes only Based on the proposed data structure and the sliding window approach, we propose a mining algorithm, called HUM-UT High Utility itemsets Mining based on UT-Tree, to find high utility itemsets from transactional data streams The HUM-UT algorithm mines high utility itemsets from the UT-Tree without additional database scan Experiment results show that our algorithm has better performance and is more stable under different experimental conditions than the state-of-the-art algorithm HUPMS in terms of time and space
Journal Article•10.3233/IDA-130591•
Incremental mining of sequential patterns: Progress and challenges

[...]

Bhawna Mallick1, Deepak Garg1, P. S. Grover•
Thapar University1
1 May 2013
TL;DR: It is inferred that the better approach is incremental mining on the progressive database, based on the characteristics, which would give scope for future research direction.
Abstract: Sequential pattern mining is a vital problem with broad applications. However, it is also challenging, as combinatorial high number of intermediate subsequences are generated that have to be critically examined. Most of the basic solutions are based on the assumption that the mining is performed on static database. But modern day databases are being continuously updated and are dynamic in nature. So, incremental mining of sequential patterns has become the norm. This article investigates the need for incremental mining of sequential patterns. An analytical study, focusing on the characteristics, has been made for more than twenty incremental mining algorithms. Further, we have discussed the issues associated with each of them. We infer that the better approach is incremental mining on the progressive database. The three more relevant algorithms, based on this approach, are also studied in depth along with the other work done in this area. This would give scope for future research direction.
Journal Article•10.3233/IDA-130597•
Discovering generalized association rules from Twitter

[...]

Luca Cagliero1, Alessandro Fiori•
Polytechnic University of Turin1
1 Jul 2013
TL;DR: The proposed TweM Tweet Miner framework entails the discovery of hidden and high level correlations, in the form of generalized association rules, among the content and the contextual features of posts published on Twitter i.e., the tweets.
Abstract: The increasing availability of user-generated content coming from online communities allows the analysis of common user behaviors and trends in social network usage. This paper presents the TweM Tweet Miner framework that entails the discovery of hidden and high level correlations, in the form of generalized association rules, among the content and the contextual features of posts published on Twitter i.e., the tweets. To effectively support knowledge discovery from tweets, the TweM framework performs two main steps: i taxonomy generation over tweet keywords and context data and ii generalized association rule mining, driven by the generated taxonomy, from a sequence of tweet collections. Unlike traditional mining approaches, the generalized rule mining session performed on the current tweet collection also considers the evolution of the extracted patterns across the sequence of the previous mining sessions to prevent the discarding of rare knowledge that frequently occurs in a number of past extractions. Experiments, performed on both real Twitter posts and synthetic datasets, show the effectiveness and the efficiency of the proposed TweM framework in supporting knowledge discovery from Twitter user-generated content.
Book Chapter•10.1007/978-3-642-41398-8_20•
Diversity-Driven Widening

[...]

Violeta N. Ivanova1, Michael R. Berthold1•
University of Konstanz1
17 Oct 2013
TL;DR: This paper presents a more in-depth analysis of the concept of Widened Data Mining, which aims at reducing the impact of greedy heuristics by exploring more than just one suitable solution at each step by focusing on how diversity considerations can substantially improve results.
Abstract: This paper follows our earlier publicationi¾?[1], where we introduced the idea of tuned data mining which draws on parallel resources to improve model accuracy rather than the usual focus on speed-up. In this paper we present a more in-depth analysis of the concept of Widened Data Mining, which aims at reducing the impact of greedy heuristics by exploring more than just one suitable solution at each step. In particular we focus on how diversity considerations can substantially improve results. We again use the greedy algorithm for the set cover problem to demonstrate these effects in practice.
Book Chapter•10.1007/978-3-642-41398-8_18•
Learning Multiple Temporal Matching for Time Series Classification

[...]

Cédric Frambourg1, Ahlame Douzal-Chouakria1, Eric Gaussier1•
Joseph Fourier University1
17 Oct 2013
TL;DR: This work proposes a multiple temporal matching approach that reveals the commonly shared features within classes, and the most differential ones across classes, based on a new framework based on the variance/covariance criterion.
Abstract: In real applications, time series are generally of complex structure, exhibiting different global behaviors within classes. To discriminate such challenging time series, we propose a multiple temporal matching approach that reveals the commonly shared features within classes, and the most differential ones across classes. For this, we rely on a new framework based on the variance/covariance criterion to strengthen or weaken matched observations according to the induced variability within and between classes. The experiments performed on real and synthetic datasets demonstrate the ability of the multiple temporal matching approach to capture fine-grained distinctions between time series.
Book Chapter•10.1007/978-3-642-41398-8_11•
Finding Frequent Patterns in Parallel Point Processes

[...]

Christian Borgelt, David Picado-Muiño
17 Oct 2013
TL;DR: This work defines the support of an item set in this setting based on a maximum independent set approach allowing for efficient computation and shows how the enumeration and test of candidate sets can be made efficient by properly reducing the event sequences and exploiting perfect extension pruning.
Abstract: We consider the task of finding frequent patterns in parallel point processes--also known as finding frequent parallel episodes in event sequences. This task can be seen as a generalization of frequent item set mining: the co-occurrence of items or events in transactions is replaced by their imprecise co-occurrence on a continuous time scale, meaning that they occur in a limited time span from each other. We define the support of an item set in this setting based on a maximum independent set approach allowing for efficient computation. Furthermore, we show how the enumeration and test of candidate sets can be made efficient by properly reducing the event sequences and exploiting perfect extension pruning. Finally, we study how the resulting frequent item sets/patterns can be filtered for closed and maximal sets.
Journal Article•10.3233/IDA-130609•
Entity ranking using click-log information

[...]

Davide Mottin1, Themis Palpanas1, Yannis Velegrakis1•
University of Trento1
1 Sep 2013
TL;DR: This work presents a novel framework for feature extraction that is based on the notions of entity matching and attribute frequencies and introduces different methods and metrics for ranking, which combine them with existing traditional techniques and are studied using real and synthetic data.
Abstract: Log information describing the items the users have selected from the set of answers a query engine returns to their queries constitute an excellent form of indirect user feedback that has been extensively used in the web to improve the effectiveness of search engines. In this work we study how the logs can be exploited to improve the ranking of the results returned by an entity search engine. Entity search engines are becoming more and more popular as the web is changing from a web of documents into a "web of things". We show that entity search engines pose new challenges since their model is different than the one documents are based on. We present a novel framework for feature extraction that is based on the notions of entity matching and attribute frequencies. The extracted features are then used to train a ranking classifier. We introduce different methods and metrics for ranking, we combine them with existing traditional techniques and we study their performance using real and synthetic data. The experiments show that our technique provides better results in terms of accuracy.
Book Chapter•10.1007/978-3-642-41398-8_17•
OrderSpan: Mining Closed Partially Ordered Patterns

[...]

Mickaël Fabrègue1, Agnès Braud1, Sandra Bringay2, Florence Le Ber1, Maguelonne Teisseire •
University of Strasbourg1, Centre national de la recherche scientifique2
17 Oct 2013
TL;DR: This paper investigates the data mining challenge of partially ordered pattern mining of sequential data by describing OrderSpan, a new algorithm that extracts such patterns from sequential databases and overcomes some of the drawbacks of existing methods.
Abstract: Due to the complexity of the task, partially ordered pattern mining of sequential data has not been subject to much study, despite its usefulness. This paper investigates this data mining challenge by describing OrderSpan, a new algorithm that extracts such patterns from sequential databases and overcomes some of the drawbacks of existing methods. Our work consists in providing a simple and flexible framework to directly mine complex sequences of itemsets, by combining well-known properties on prefixes and suffixes. Experiments were performed on different real datasets to show the benefit of partially ordered patterns.
Journal Article•10.3233/IDA-130577•
A two-phase hybrid of semi-supervised and active learning approach for sequence labeling

[...]

Hamed Hassanzadeh1, Mohammad Reza Keyvanpour2•
Islamic Azad University1, Alzahra University2
1 Mar 2013
TL;DR: A combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically.
Abstract: In recent years, many NLP systems and tasks are developed using machine learning methods. In order to achieve the best performance, these systems are generally trained on a large human annotated corpus. Since annotating such corpora is a very expensive and time-consuming procedure, manually annotating corpora is become one of the significant issues in many text based tasks such as text mining, semantic annotation, Named Entity Recognition and generally Information Extraction. Semi-supervised Learning and Active Learning are two distinct approaches that deal with reduction of labeling costs. Based on their natures, Active and semi-supervised learning can produce better results when they are jointly applied. In this paper we propose a combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically. The proposed approach reduces manual annotation cost around 90% compare with a supervised learning and 30% in contrast with a similar fully active learning approach. Conditional Random Field CRF is chosen as the underlying learning model due to its promising performance in many sequence labeling tasks. In addition we proposed a confidence measure based on the model's variance reduction that reaches a considerable accuracy for finding informative samples.
Journal Article•10.3233/IDA-130618•
Enhancing K-Means using class labels

[...]

Billy Peralta1, Pablo Espinace1, Alvaro Soto1•
Pontifical Catholic University of Chile1
1 Nov 2013
TL;DR: Labeled K-Means LK- means is presented, an algorithm for supervised clustering based on a variant of K-means that incorporates information about class labels that outperforms the alternative techniques by a considerable margin.
Abstract: Clustering is a relevant problem in machine learning where the main goal is to locate meaningful partitions of unlabeled data. In the case of labeled data, a related problem is supervised clustering, where the objective is to locate class-uniform clusters. Most current approaches to supervised clustering optimize a score related to cluster purity with respect to class labels. In particular, we present Labeled K-Means LK-Means, an algorithm for supervised clustering based on a variant of K-Means that incorporates information about class labels. LK-Means replaces the classical cost function of K-Means by a convex combination of the joint cost associated to: i A discriminative score based on class labels, and ii A generative score based on a traditional metric for unsupervised clustering. We test the performance of LK-Means using standard real datasets and an application for object recognition. Moreover, we also compare its performance against classical K-Means and a popular K-Medoids-based supervised clustering method. Our experiments show that, in most cases, LK-Means outperforms the alternative techniques by a considerable margin. Furthermore, LK-Means presents execution times considerably lower than the alternative supervised clustering method under evaluation.
Book Chapter•10.1007/978-3-642-41398-8_1•
Data, Not Dogma: Big Data, Open Data, and the Opportunities Ahead

[...]

David J. Hand1•
Winton Capital Management1
17 Oct 2013
TL;DR: The observation that people want answers to questions, not simply data, is explored, and some of the difficulties and risks which lie in the path of realising the opportunities of big data and open data.
Abstract: Big data and open data promise tremendous advances. But the media hype ignores the difficulties and the risks associated with this promise. Beginning with the observation that people want answers to questions, not simply data, I explore some of the difficulties and risks which lie in the path of realising the opportunities.
Book Chapter•10.1007/978-3-642-41398-8_7•
Evaluation of Association Rule Quality Measures through Feature Extraction

[...]

José L. Balcázar1, Francis Dogbey2•
Polytechnic University of Catalonia1, Ghana-India Kofi Annan Centre of Excellence in ICT2
17 Oct 2013
TL;DR: This work proposes a protocol to evaluate the evaluation measures themselves of association rule quality measures, and indicates that multiplicative improvement and to a lesser extent support and leverage a.k.a. weighted relative accuracy tend to obtain better results than the other measures.
Abstract: The practical success of association rule mining depends heavily on the criterion to choose among the many rules often mined. Many rule quality measures exist in the literature. We propose a protocol to evaluate the evaluation measures themselves. For each association rule, we measure the improvement in accuracy that a commonly used predictor can obtain from an additional feature, constructed according to the exceptions to the rule. We select a reference set of rules that are helpful in this sense. Then, our evaluation method takes into account both how many of these helpful rules are found near the top rules for a given quality measure, and how near the top they are. We focus on seven association rule quality measures. Our experiments indicate that multiplicative improvement and to a lesser extent support and leverage a.k.a.i¾?weighted relative accuracy tend to obtain better results than the other measures.
Journal Article•10.3233/IDA-130621•
Causality-based cost-effective action mining

[...]

Pirooz Shamsinejadbabaki1, Mohamad Saraee2, Hendrik Blockeel3•
Isfahan University of Technology1, University of Salford2, University of Copenhagen Faculty of Science3
1 Nov 2013
TL;DR: ICE-CREAM is introduced, a novel approach to action mining that explicitly relies on an automatically obtained best estimate of the causal relationships in the data to suggest actions with desirable effects.
Abstract: In many business contexts, the ultimate goal of knowledge discovery is not the knowledge itself, but putting it to use. Models or patterns found by data mining methods often require further post-processing to bring this about. For instance, in churn prediction, data mining may give a model that predicts which customers are likely to end their contract, but companies are not just interested in knowing who is likely to do so, they want to know what they can do to avoid this. The models or patterns have to be transformed into actionable knowledge. Action mining explicitly addresses this. Currently, many action mining methods rely on a predictive model, obtained through data mining, to estimate the effect of certain actions and finally suggest actions with desirable effects. A major problem with this approach is that predictive models do not necessarily reflect a causal relationship between their inputs and outputs. This makes the existing action mining methods less reliable. In this paper, we introduce ICE-CREAM, a novel approach to action mining that explicitly relies on an automatically obtained best estimate of the causal relationships in the data. Experiments confirm that ICE-CREAM performs much better than the current state of the art in action mining.
Journal Article•10.3233/IDA-120567•
Discovering descriptive rules in relational dynamic graphs

[...]

Kim-Ngan T. Nguyen1, Loïc Cerf2, Marc Plantevit3, Jean-Françcois Boulicaut1•
Institut national des sciences Appliquées de Lyon1, Universidade Federal de Minas Gerais2, University of Lyon3
1 Jan 2013
TL;DR: This work designs the pattern domain of multi-dimensional association rules, i.e., non trivial extensions of the popular association rules that may involve subsets of any dimensions in their antecedents and their consequents and proposes optimizations to support both rule extraction scalability and non redundancy.
Abstract: Graph mining methods have become quite popular and a timely challenge is to discover dynamic properties in evolving graphs or networks. We consider the so-called relational dynamic oriented graphs that can be encoded as n-ary relations with n ≥ 3 and thus represented by Boolean tensors. Two dimensions are used to encode the graph adjacency matrices and at least one other denotes time. We design the pattern domain of multi-dimensional association rules, i.e., non trivial extensions of the popular association rules that may involve subsets of any dimensions in their antecedents and their consequents. First, we design new objective interestingness measures for such rules and it leads to different approaches for measuring the rule confidence. Second, we must compute collections of a priori interesting rules. It is considered here as a post-processing of the closed patterns that can be extracted efficiently from Boolean tensors. We propose optimizations to support both rule extraction scalability and non redundancy. We illustrate the added-value of this new data mining task to discover patterns from a real-life relational dynamic graph.
Journal Article•10.3233/IDA-130574•
Interestingness measures for association rules within groups

[...]

Aída Jiménez1, Fernando Berzal1, Juan-Carlos Cubero1•
University of Granada1
1 Mar 2013
TL;DR: This paper defines group association rules and study interestingness measures for them, which can be used to rank, not only groups of individuals, but also rules within each group.
Abstract: The work described in this paper addresses the study of association rules within groups of individuals. The analysis of the characteristics and the behavior of the individuals belonging to such groups in a given database is powerful in practice, since it provides a mechanism to deal with groups rather than isolated individuals. In this paper, we define group association rules and we study interestingness measures for them. These interestingness measures can be used to rank, not only groups of individuals, but also rules within each group. We also compare the rankings provided by those different interestingness measures in order to determine which one provides a better alternative depending on the kind of situations we wish to highlight within large databases with many different and overlapping groups of individuals.
Journal Article•10.3233/IDA-120570•
Story graphs: Tracking document set evolution using dynamic graphs

[...]

Ilija Subasic1, Bettina Berendt1•
Katholieke Universiteit Leuven1
1 Jan 2013
TL;DR: This paper creates a graph representation of the story evolution that is called story graphs, and investigates how graph structure can be used for detecting and discovering new developments in the story, and creates an evaluation framework which bridges the gap between temporal text mining patterns and sentences.
Abstract: With the growing number of document sets accessible online, tracking their evolution over time story tracking became an increasingly interesting problem. In this paper we propose a story tracking method based on the dynamics of keyword-association graphs. We create a graph representation of the story evolution that we call story graphs, and investigate how graph structure can be used for detecting and discovering new developments in the story. First we investigate the possibly interesting graph properties for development detection. We continue by investigating how graph structure can be linked to the sentences representing developments. For this we create an evaluation framework which bridges the gap between temporal text mining patterns and sentences. We apply this framework to evaluate our method against other temporal text mining methods. Our experiments show that story graphs perform at similar levels overall, but provide distinctive advantages in some settings.
Journal Article•10.3233/IDA-130587•
Influence of class distribution on cost-sensitive learning: A case study of bankruptcy analysis

[...]

Ning Chen1, An Chen1, Bernardete Ribeiro2•
Instituto Superior de Engenharia do Porto1, University of Coimbra2
1 May 2013
TL;DR: This paper investigates the effect of cost ratio, imbalance ratio and sample size on classification performance using a real-world French bankruptcy database and shows that the cost ratio and the level of class imbalance have strong effect on prediction performance.
Abstract: Skewed class distribution and non-uniform misclassification cost are pervasive in many real-world domains such as bankruptcy prediction, medical diagnosis, and intrusion detection. Although class imbalance learning and cost-sensitive learning can be manipulated in a unified framework as was illustrated in previous studies, the influence of class distribution on cost-sensitive learning still needs clarification. In this paper, we investigate the effect of cost ratio, imbalance ratio and sample size on classification performance using a real-world French bankruptcy database. The results show that the cost ratio and the level of class imbalance have strong effect on prediction performance. A near-balanced training data set is favorable when a relatively uniform cost ratio is used, whereas a near-natural class distribution is favorable when a highly uneven cost ratio is used.
Journal Article•10.3233/IDA-130586•
Semi-supervised learning on closed set lattices

[...]

Mahito Sugiyama1, Akihiro Yamamoto2•
Japan Society for the Promotion of Science1, Kyoto University2
1 May 2013
TL;DR: A learning algorithm, called SELF SEmi-supervised Learning via FCA, is presented, which performs as a multiclass classifier and a label ranker for mixed-type data containing both discrete and continuous variables, while only few learning algorithms such as the decision tree-based classifier can directly handle mixed- type data.
Abstract: We propose a new approach for semi-supervised learning using closed set lattices, which have been recently used for frequent pattern mining within the framework of the data analysis technique of Formal Concept Analysis FCA. We present a learning algorithm, called SELF SEmi-supervised Learning via FCA, which performs as a multiclass classifier and a label ranker for mixed-type data containing both discrete and continuous variables, while only few learning algorithms such as the decision tree-based classifier can directly handle mixed-type data. From both labeled and unlabeled data, SELF constructs a closed set lattice, which is a partially ordered set of data clusters with respect to subset inclusion, via FCA together with discretizing continuous variables, followed by learning classification rules through finding maximal clusters on the lattice. Moreover, it can weight each classification rule using the lattice, which gives a partial order of preference over class labels. We illustrate experimentally the competitive performance of SELF in classification and ranking compared to other learning algorithms using UCI datasets.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve