Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Intelligent Data Analysis
  4. 2011
  1. Home
  2. Conferences
  3. Intelligent Data Analysis
  4. 2011
Showing papers presented at "Intelligent Data Analysis in 2011"
Journal Article•10.3233/IDA-2011-0485•
Data set preprocessing and transformation in a database system

[...]

Carlos Ordonez1•
University of Houston1
1 Jun 2011
TL;DR: This article presents a summary of the experience and recommendations to compute data set preprocessing and transformation inside a database system, which is the most time-consuming task in data mining projects, and identifies advantages and disadvantages from a practical standpoint based on data mining users feedback.
Abstract: In general, there is a significant amount of data mining analysis performed outside a database system, which creates many data management issues This article presents a summary of our experience and recommendations to compute data set preprocessing and transformation inside a database system (ie data cleaning, record selection, summarization, denormalization, variable creation, coding), which is the most time-consuming task in data mining projects This aspect is largely ignored in the literature We present practical issues, common solutions and lessons learned when preparing and transforming data sets with the SQL language, based on experience from real-life projects We then provide specific guidelines to translate programs written in a traditional programming language into SQL statements Based on successful real-life projects, we present time performance comparisons between SQL code running inside the database system and external data mining programs We highlight which steps in data mining projects become faster when processed by the database system More importantly, we identify advantages and disadvantages from a practical standpoint based on data mining users feedback

60 citations

Book Chapter•10.1007/978-3-642-24800-9_1•
Bisociative knowledge discovery

[...]

Michael R. Berthold1•
University of Konstanz1
29 Oct 2011
TL;DR: This article focuses on the discovery of new connections between domains (so called bisociations), supporting the creative discovery process in a novel way and motivating this approach, shows the difference to classical data analysis and concludes by briefly illustrating some types of domain-crossing connections.
Abstract: Data analysis generally focusses on finding patterns within a reasonably well connected domain of interest. In this article we focus on the discovery of new connections between domains (so called bisociations), supporting the creative discovery process in a novel way. We motivate this approach, show the difference to classical data analysis and conclude by briefly illustrating some types of domain-crossing connections along with illustrative examples.

55 citations

Proceedings Article•
Computational sustainability

[...]

Carla P. Gomes1•
Cornell University1
29 Oct 2011
TL;DR: Computational sustainability is a new interdisciplinary research field with the overall goal of developing computational models, methods, and tools to help manage the balance between environmental, economic, and societal needs for sustainable development as mentioned in this paper.
Abstract: Computational sustainability [1] is a new interdisciplinary research field with the overall goal of developing computational models, methods, and tools to help manage the balance between environmental, economic, and societal needs for sustainable development. The notion of sustainable development -- development that meets the needs of the present without compromising the ability of future generations to meet their needs -- was introduced in Our Common Future, the seminal report of the United Nations World Commission on Environment and Development, published in 1987. In this talk I will provide an overview of computational sustainability, with examples ranging from wildlife conservation and biodiversity, to poverty mitigation, to large-scale deployment and management of renewable energy sources. I will highlight overarching computational challenges at the intersection of constraint reasoning, optimization, data mining, and dynamical systems. Finally I will discuss the need for a new approach that views computational sustainability problems as "natural" phenomena, amenable to a scientific methodology, in which principled experimentation, to explore problem parameter spaces and hidden problem structure, plays as prominent a role as formal analysis.

53 citations

Journal Article•10.3233/IDA-2010-0466•
Exploring discrepancies in findings obtained with the KDD Cup '99 data set

[...]

Vegard Engen1, Jonathan Vincent1, Keith Phalp1•
Bournemouth University1
1 Apr 2011
TL;DR: An empirical investigation uncovers several underlying causes of the discrepancy in the results reported in the literature, which allows us to better interpret the current body of research, and inform recommendations for future use of the data set.
Abstract: The KDD Cup '99 data set has been widely used to evaluate intrusion detection prototypes, most based on machine learning techniques, for nearly a decade. The data set served well in the KDD Cup '99 competition to demonstrate that machine learning can be useful in intrusion detection systems. However, there are discrepancies in the findings reported in the literature. Further, some researchers have published criticisms of the data (and the DARPA data from which the KDD Cup '99 data has been derived), questioning the validity of results obtained with this data. Despite the criticisms, researchers continue to use the data due to a lack of better publicly available alternatives. Hence, it is important to identify the value of the data set and the findings from the extensive body of research based on it, which has largely been ignored by the existing critiques. This paper reports on an empirical investigation, demonstrating the impact of several methodological differences in the publicly available subsets, which uncovers several underlying causes of the discrepancy in the results reported in the literature. These findings allow us to better interpret the current body of research, and inform recommendations for future use of the data set.

53 citations

Book Chapter•10.1007/978-3-642-24800-9_38•
Analyzing emotional semantics of abstract art using low-level image features

[...]

He Zhang, Eimontas Augilius, Timo Honkela, Jorma Laaksonen, Hannes Gamper1, Henok Alene1 •
Aalto University1
29 Oct 2011
TL;DR: It is empirically demonstrate that the emotions triggered by viewing abstract art images can be predicted with reasonable accuracy by machine using a variety of low-level image descriptors such as color, shape, and texture.
Abstract: In this work, we study people's emotions evoked by viewing abstract art images based on traditional low-level image features within a binary classification framework. Abstract art is used here instead of artistic or photographic images because those contain contextual information that influences the emotional assessment in a highly individual manner. Whether an image of a cat or a mountain elicits a negative or positive response is subjective. After discussing challenges concerning image emotional semantics research, we empirically demonstrate that the emotions triggered by viewing abstract art images can be predicted with reasonable accuracy by machine using a variety of low-level image descriptors such as color, shape, and texture. The abstract art dataset that we created for this work has been made downloadable to the public.

40 citations

Journal Article•10.3233/IDA-2011-0498•
A metric for unsupervised metalearning

[...]

Jun Won Lee1, Christophe Giraud-Carrier1•
Brigham Young University1
1 Nov 2011
TL;DR: This work argues the value of unsupervised metalearning and discusses the attendant necessity of suitable similarity, or distance, functions, and uses COD to produce a clustering of 21 learning algorithms.
Abstract: We argue the value of unsupervised metalearning and discuss the attendant necessity of suitable similarity, or distance, functions. We leverage the notion of diversity among learners used in ensemble learning to design a distance function for the clustering of learning algorithms. We revisit the most popular measures of diversity and show that only one of them, Classifier Output Difference COD is a metric. We then use COD to produce a clustering of 21 learning algorithms, and show how this clustering differs from a clustering based on accuracy, and how it can be used to highlight interesting, sometimes unexpected, similarities among learning algorithms.

39 citations

Journal Article•10.3233/IDA-2011-0499•
Irrelevant attributes and imbalanced classes in multi-label text-categorization domains

[...]

Sareewan Dendamrongvit1, Peerapon Vateekul1, Miroslav Kubat1•
University of Miami1
1 Nov 2011
TL;DR: This work followed the most commonly used approach and induced a binary classifier for each class in text categorization and noticed that performance had been impaired by two factors.
Abstract: An interesting issue in machine learning is induction in multi-label domains where each example can be labeled with two or more classes at the same time. In a work focusing on text categorization, we followed the most commonly used approach and induced a binary classifier for each class. Analyzing the results, we noticed that performance had been impaired by two factors. First, in text domains, each class is characterized by a different set of attributes; an appropriate attribute-selection technique thus has to be applied separately to each of them. Second, the individual classes often have to be induced from imbalanced training sets, a circumstance we addressed here by majority-class undersampling. The paper provides details of the induction system and reports the results of systematic experimentation.

34 citations

Journal Article•10.3233/IDA-2010-0475•
Data mining in software engineering

[...]

Maria Halkidi1, Diomidis Spinellis2, George Tsatsaronis2, Michalis Vazirgiannis2•
University of Piraeus1, Athens University of Economics and Business2
1 Aug 2011
TL;DR: This paper describes various data sources and discusses the principles and techniques of data mining as applied on software engineering data, and surveys the mining approaches that have been used and categorize them according to the corresponding parts of the development process and the task they assist.
Abstract: The increased availability of data created as part of the software development process allows us to apply novel analysis techniques on the data and use the results to guide the process's optimization. In this paper we describe various data sources and discuss the principles and techniques of data mining as applied on software engineering data. Data that can be mined is generated by most parts of the development process: requirements elicitation, development analysis, testing, debugging, and maintenance. Based on this classification we survey the mining approaches that have been used and categorize them according to the corresponding parts of the development process and the task they assist. Thus the survey provides researchers with a concise overview of data mining techniques applied to software engineering data, and aids practitioners on the selection of appropriate data mining techniques for their work.

31 citations

Journal Article•10.3233/IDA-2010-0456•
A parallel, distributed algorithm for relational frequent pattern discovery from very large data sets

[...]

Annalisa Appice1, Michelangelo Ceci1, Antonio Turi1, Donato Malerba1•
University of Bari1
1 Jan 2011
TL;DR: An extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm is proposed.
Abstract: The amount of data produced by ubiquitous computing applications is quickly growing, due to the pervasive presence of small devices endowed with sensing, computing and communication capabilities. Heterogeneity and strong interdependence, which characterize 'ubiquitous data', require a (multi-)relational approach to their analysis. However, relational data mining algorithms do not scale well and very large data sets are hardly processable. In this paper we propose an extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm. The set of patterns discovered by the new algorithm approximates the set of exact solutions found by the serial algorithm. The quality of approximation depends on three parameters: the proportion of data in each sample, the minimum support thresholds and the number of samples in which a pattern has to be frequent in order to be considered globally frequent. Considering that the first two parameters are hardly controllable, we focus our investigation on the third one. Theoretically derived conclusions are also experimentally confirmed. Moreover, an additional application in the context of event log mining proves the viability of the proposed approach to relational frequent pattern mining from very large data sets.

29 citations

Journal Article•10.3233/IDA-2010-0455•
PSO driven collaborative clustering: A clustering algorithm for ubiquitous environments

[...]

Benoît Depaire1, Rafael Falcon2, Koen Vanhoof1, Geert Wets1•
University of Hasselt1, University of Ottawa2
1 Jan 2011
TL;DR: Empirical analysis show how and when this PSO-CFC approach outperforms local fuzzy clustering in the domain of ubiquitous knowledge discovery.
Abstract: The goal of this article is to introduce a collaborative clustering approach to the domain of ubiquitous knowledge discovery. This clustering approach is suitable in peer-to-peer networks where different data sites want to cluster their local data as if they consolidated their data sets, but which is prevented by privacy restrictions. Two variants exist, i.e. one for data sites with the same observations but different features and one for data sites with the same features but different observations. The technique contains two parts, i.e. a collaborative fuzzy clustering technique and a particle swarm optimization to optimize the collaboration between data sites. Empirical analysis show how and when this PSO-CFC approach outperforms local fuzzy clustering.

25 citations

Journal Article•10.3233/IDA-2011-0483•
EclatDS: An efficient sliding window based frequent pattern mining method for data streams

[...]

Mahmood Deypir1, Mohammad Hadi Sadreddini1•
Shiraz University1
1 Jun 2011
TL;DR: Experimental results on synthetically generated and real life data streams show the superiority of the proposed method with multiple orders of magnitude in terms of runtime and memory usage with respect to other pane based sliding window algorithms.
Abstract: Mining frequent patterns over data streams is an interesting problem due to its wide application area. The researchers in this field have been facing two key challenges, namely reduction in runtime and memory usage. In this study, a novel method for efficient mining of frequent patterns over data streams is proposed. The method is based on sliding window model which divides the window into a number of panes. This method provides a new sliding window mechanism by utilizing a set of simple short lists. Each list stores related information about an item in the sliding window. The proposed mechanism dynamically adopts itself with the concept change. This method is empirically evaluated against recently proposed pane based sliding window algorithms. Experimental results on synthetically generated and real life data streams show the superiority of the proposed method with multiple orders of magnitude in terms of runtime and memory usage with respect to other pane based sliding window algorithms.
Journal Article•10.3233/IDA-2011-0489•
Component-based decision trees for classification

[...]

Boris Delibasic1, Milos Jovanovic1, Milan Vukicevic1, Milija Suknović1, Zoran Obradovic2 •
University of Belgrade1, Temple University2
1 Sep 2011
TL;DR: This study proposed two new heuristics in decision tree algorithm design, namely removal of insignificant attributes in induction process at each tree node, and usage of combined strategy for generating possible splits for decision trees, utilizing several ways of splitting together, which experimentally showed benefits.
Abstract: Typical data mining algorithms follow a so called "black-box" paradigm, where the logic is hidden from the user not to overburden him. We show that "white-box" algorithms constructed with reusable components design can have significant benefits for researchers, and end users as well. We developed a component-based algorithm design platform, and used it for "white-box" algorithm construction. The proposed platform can also be used for testing algorithm parts reusable components, and their single or joint influence on algorithm performance. The platform is easily extensible with new components and algorithms, and allows testing of partial contributions of an introduced component. We propose two new heuristics in decision tree algorithm design, namely removal of insignificant attributes in induction process at each tree node, and usage of combined strategy for generating possible splits for decision trees, utilizing several ways of splitting together, which experimentally showed benefits. Using the proposed platform we tested 80 component-based decision tree algorithms on 15 benchmark datasets and present the results of reusable components' influence on performance, and statistical significance of the differences found. Our study suggests that for a specific dataset we should search for the optimal component interplay instead of looking for the optimal among predefined algorithms.
Journal Article•10.3233/IDA-2010-0461•
Active learning and subspace clustering for anomaly detection

[...]

Karim Pichara, Alvaro Soto
1 Apr 2011
TL;DR: This paper proposes a new semi-supervised algorithm that actively learns to detect relevant anomalies by interacting with an expert user in order to obtain semantic information about user preferences.
Abstract: Today, anomaly detection is a highly valuable application in the analysis of current huge datasets. Insurance companies, banks and many manufacturing industries need systems to help humans to detect anomalies in their daily information. In general, anomalies are a very small fraction of the data, therefore their detection is not an easy task. Usually real sources of an anomaly are given by specific values expressed on selective dimensions of datasets, furthermore, many anomalies are not really interesting for humans, due to the fact that interestingness of anomalies is categorized subjectively by the human user. In this paper we propose a new semi-supervised algorithm that actively learns to detect relevant anomalies by interacting with an expert user in order to obtain semantic information about user preferences. Our approach is based on 3 main steps. First, a Bayes network identifies an initial set of candidate anomalies. Afterwards, a subspace clustering technique identifies relevant subsets of dimensions. Finally, a probabilistic active learning scheme, based on properties of Dirichlet distribution, uses the feedback from an expert user to efficiently search for relevant anomalies. Our results, using synthetic and real datasets, indicate that, under noisy data and anomalies presenting regular patterns, our approach correctly identifies relevant anomalies.
Journal Article•10.3233/IDA-2011-0486•
Learning information diffusion model in a social network for predicting influence of nodes

[...]

Masahiro Kimura1, Kazumi Saito2, Kouzou Ohara3, Hiroshi Motoda4•
Ryukoku University1, University of Shizuoka2, Aoyama Gakuin University3, Osaka University4
1 Jul 2011
TL;DR: This work addresses the problem of estimating the parameters, from observed data in a complex social network, for an information diffusion model that takes time- delay into account, based on the popular independent cascade IC model and proposes an iterative method to search for the parameters time-delay and diffusion that maximize this likelihood.
Abstract: We address the problem of estimating the parameters, from observed data in a complex social network, for an information diffusion model that takes time-delay into account, based on the popular independent cascade IC model. For this purpose we formulate the likelihood to obtain the observed data which is a set of time-sequence data of infected active nodes, and propose an iterative method to search for the parameters time-delay and diffusion that maximize this likelihood. We first show by using a synthetic network that the proposed method outperforms the similar existing method. Next, we apply this method to problems of both 1 predicting the influence of nodes for the considered information diffusion model and 2 ranking the influential nodes. Using three large social networks, we demonstrate the effectiveness of the proposed method.
Journal Article•10.3233/IDA-2011-0494•
A fuzzy-neural approach for global CO2 concentration forecasting

[...]

Toly Chen1, Yi-Chi Wang1•
Feng Chia University1
1 Sep 2011
TL;DR: According to the experimental results, the proposed methodology improved both the precision and the accuracy of forecasting the global CO2 concentration by 28% and 91%, respectively.
Abstract: The global CO2 concentration is considered to be one of the most important causes of global warming that must be closely monitored, accurately forecasted, and controlled as good as possible. To accurately forecast the global CO2 concentration, a hybrid fuzzy linear regression FLR and back propagation network BPN approach is proposed in this study. In this proposed approach, multiple experts construct their own FLR equations from various viewpoints to forecast future global CO2 concentrations. Each FLR equation can be converted into two equivalent nonlinear programming problems to be solved. To combine these fuzzy forecasts, a two-step aggregation mechanism is applied. At the first step, fuzzy intersection is applied to combine the fuzzy global CO2 concentration forecasts into a polygon-shaped fuzzy number, in order to improve the precision. After that, a BPN is constructed to defuzzify the polygon-shaped fuzzy number and to generate a representative/crisp value, so as to enhance the accuracy. Some historical data on global CO2 concentrations were used to evaluate the effectiveness of the proposed methodology. According to the experimental results, the proposed methodology improved both the precision and the accuracy of forecasting the global CO2 concentration by 28% and 91%, respectively.
Book Chapter•10.1007/978-3-642-24800-9_26•
Data quality through model checking techniques

[...]

Mario Mezzanzanica1, Roberto Boselli1, Mirko Cesarini1, Fabio Mercorio2•
University of Milan1, University of L'Aquila2
29 Oct 2011
TL;DR: The Robust Data Quality Analysis is introduced, which exploits formal methods to support Data Quality Improvement Processes and has proved successful, by giving insights on the data quality levels and by providing suggestions on how to ameliorate the overall data quality process.
Abstract: The paper introduces the Robust Data Quality Analysis which exploits formal methods to support Data Quality Improvement Processes. The proposed methodology can be applied to data sources containing sequences of events that can be modelled by Finite State Systems. Consistency rules (derived from domain business rules) can be expressed by formal methods and can be automatically verified on data, both before and after the execution of cleansing activities. The assessment results can provide useful information to improve the data quality processes. The paper outlines the preliminary results of the methodology applied to a real case scenario: the cleansing of a very low quality database, containing the work careers of the inhabitants of an Italian province. The methodology has proved successful, by giving insights on the data quality levels and by providing suggestions on how to ameliorate the overall data quality process.
Journal Article•10.3233/IDA-2011-0488•
Classifying evolving data streams with partially labeled data

[...]

Hanen Borchani1, Pedro Larraòaga1, Concha Bielza1•
Technical University of Madrid1
1 Sep 2011
TL;DR: This paper proposes a new semi-supervised approach for handling concept-drifting data streams containing both labeled and unlabeled instances that is so general that it can be applied to different classification models.
Abstract: Recently, several approaches have been proposed to deal with the increasingly challenging task of mining concept-drifting data streams. However, most are based on supervised classification algorithms assuming that true labels are immediately and entirely available in the data streams. Unfortunately, such an assumption is often violated in real-world applications given that it is expensive or because it takes a long time to obtain all true labels. To deal with this problem, we propose in this paper a new semi-supervised approach for handling concept-drifting data streams containing both labeled and unlabeled instances. First, contrary to existing approaches, we monitor three possible kinds of drift: feature, conditional or dual drift. Drift detection is based on a hypothesis test comparing Kullback-Leibler divergence between old and recent data, whose distribution under the null hypothesis of coming from the same distribution is approximated via a bootstrap method. Then, if any drift occurs, a new classifier is learned from the recent data using the EM algorithm; otherwise, the current classifier is left unchanged. Our approach is so general that it can be applied to different classification models. Experimental studies, using the naive Bayes classifier and logistic regression, on both synthetic and real-world data sets demonstrate that our approach performs well.
Journal Article•10.3233/IDA-2011-0478•
Data clustering using variable precision rough set

[...]

Iwan Tri Riyadi Yanto1, Tutut Herawan1, Mustafa Mat Deris2•
Universitas Ahmad Dahlan1, Universiti Tun Hussein Onn Malaysia2
1 Jul 2011
TL;DR: An alternative technique for clustering noisy categorical data using Variable Precision Rough Set model is proposed and the results show that the technique provides better performance in selecting the clustering attribute.
Abstract: Clustering a set of objects into homogeneous classes is a fundamental operation in data mining. Several cluster analysis techniques have been developed to group objects having similar characteristics. Recently, many attentions have been put on categorical data clustering, where data objects are made up of non-numerical attributes. An algorithm termed MMR using classical rough set theory was proposed to deal with problems in clustering categorical data. However, the MMR algorithm fails to handle noisy data as an integral part of databases. In this paper, an alternative technique for clustering noisy categorical data using Variable Precision Rough Set model is proposed. The results show that the technique provides better performance in selecting the clustering attribute.
Journal Article•10.3233/IDA-2011-0491•
Lazy attribute selection: Choosing attributes at classification time

[...]

Rafael B. Pereira1, Alexandre Plastino1, Bianca Zadrozny2, Luiz Henrique de Campos Merschmann, Alex A. Freitas3 •
Federal Fluminense University1, IBM2, University of Kent3
1 Sep 2011
TL;DR: A new attribute selection strategy is proposed --based on a lazy learning approach --which postpones the identification of relevant attributes until an instance is submitted for classification, which in most cases improves the accuracy of classification.
Abstract: Attribute selection is a data preprocessing step which aims at identifying relevant attributes for the target machine learning task --namely classification in this paper In this paper, we propose a new attribute selection strategy --based on a lazy learning approach --which postpones the identification of relevant attributes until an instance is submitted for classification Our strategy relies on the hypothesis that taking into account the attribute values of an instance to be classified may contribute to identifying the best attributes for the correct classification of that particular instance Experimental results using the k-NN and Naive Bayes classifiers, over 40 different data sets from the UCI Machine Learning Repository and five large data sets from the NIPS 2003 feature selection challenge, show the effectiveness of delaying attribute selection to classification time The proposed lazy technique in most cases improves the accuracy of classification, when compared with the analogous attribute selection approach performed as a data preprocessing step We also propose a metric to estimate when a specific data set can benefit from the lazy attribute selection approach
Book Chapter•10.1007/978-3-642-24800-9_19•
Prototype-based classification of dissimilarity data

[...]

Barbara Hammer1, Bassam Mokbel1, Frank-Michael Schleif1, Xibin Zhu1•
Bielefeld University1
29 Oct 2011
TL;DR: A general approach to extend unsupervised prototype-based techniques to dissimilarities is reviewed, and a new supervised prototype- based classification technique for dissimilarity data is proposed.
Abstract: Unlike many black-box algorithms in machine learning, prototype-based models offer an intuitive interface to given data sets, since prototypes can directly be inspected by experts in the field. Most techniques rely on Euclidean vectors such that their suitability for complex scenarios is limited. Recently, several unsupervised approaches have successfully been extended to general, possibly non-Euclidean data characterized by pairwise dissimilarities. In this paper, we shortly review a general approach to extend unsupervised prototype-based techniques to dissimilarities, and we transfer this approach to supervised prototypebased classification for general dissimilarity data. In particular, a new supervised prototype-based classification technique for dissimilarity data is proposed.
Journal Article•10.3233/IDA-2010-0476•
Leukemia identification from bone marrow cells images using a machine vision and data mining strategy

[...]

Jesus A. Gonzalez1, Iván Olmos, Leopoldo Altamirano1, Blanca A. Morales1, Carolina Reta1, Martha C. Galindo1, Jose E. Alonso2, Ruben Lobato2 •
National Institute of Astrophysics, Optics and Electronics1, Mexican Social Security Institute2
1 Aug 2011
TL;DR: This paper presents a method to identify leukemia from bone marrow cells images using a combined machine vision and data mining strategy and shows how the combination of descriptive features and eigenvalues helps to improve classification accuracy.
Abstract: The morphological analysis of medical images to support medical diagnosis is an important research area. This is the case of leukemia identification from bone marrow smears in which cells morphology is studied in order to classify the disease into its main family and subtype, so that a proper treatment can be indicated to the patient. In this paper we present a method to identify leukemia from bone marrow cells images using a combined machine vision and data mining strategy. Our process starts with a segmentation method to obtain leukemia cells and extract from them descriptive characteristics (geometrical, texture, statistical) and eigenvalues. We use these attributes to feed machine learning algorithms that learn to classify acute leukemia families and subtypes according to the FAB system. We show how the combination of descriptive features and eigenvalues helps to improve classification accuracy. Our method achieved accuracy above 95.5% to distinguish between the acute myeloblastic and lymphoblastic leukemia families and accuracy of 90% (and above) among five leukemia subtypes (after the acute leukemia families classification).
Book Chapter•10.1007/978-3-642-24800-9_27•
Generating automated news to explain the meaning of sensor data

[...]

Martin Molina1, Amanda Stent2, Enrique Parodi1•
Technical University of Madrid1, AT&T Labs2
29 Oct 2011
TL;DR: This paper proposes a type of web application: a virtual newspaper with automatically generated news stories that describe the meaning of quantitative sensor data that can facilitate the use of sensor data by general users and, therefore, can increase the utility of sensor network infrastructures.
Abstract: An important competence of human data analysts is to interpret and explain the meaning of the results of data analysis to end-users. However, existing automatic solutions for intelligent data analysis provide limited help to interpret and communicate information to non-expert users. In this paper we present a general approach to generating explanatory descriptions about the meaning of quantitative sensor data. We propose a type of web application: a virtual newspaper with automatically generated news stories that describe the meaning of sensor data. This solution integrates a variety of techniques from intelligent data analysis into a web-based multimedia presentation system. We validated our approach in a real world problem and demonstrate its generality using data sets from several domains. Our experience shows that this solution can facilitate the use of sensor data by general users and, therefore, can increase the utility of sensor network infrastructures.
Journal Article•10.3233/IDA-2011-0500•
Adapting non-hierarchical multilabel classification methods for hierarchical multilabel classification

[...]

Ricardo Cerri1, André C. P. L. F. de Carvalho1, Alex A. Freitas2•
University of São Paulo1, University of Kent2
1 Nov 2011
TL;DR: Two new hierarchical multilabel classification methods based on the well-known local approach for hierarchical classification are proposed, which presented promising results in experiments performed with bioinformatics datasets.
Abstract: In most classification problems, a classifier assigns a single class to each instance and the classes form a flat non-hierarchical structure, without superclasses or subclasses In hierarchical multilabel classification problems, the classes are hierarchically structured, with superclasses and subclasses, and instances can be simultaneously assigned to two or more classes at the same hierarchical level This article proposes two new hierarchical multilabel classification methods based on the well-known local approach for hierarchical classification The methods are compared with two global methods and one well-known local binary classification method from the literature The proposed methods presented promising results in experiments performed with bioinformatics datasets
Journal Article•10.3233/IDA-2010-0454•
Mining frequent closed trees in evolving data streams

[...]

Albert Bifet1, Ricard Gavaldà2•
University of Waikato1, Polytechnic University of Catalonia2
1 Jan 2011
TL;DR: This work proposes new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time, based on an advantageous representation of trees and a low-complexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory.
Abstract: We propose new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time. Closed patterns are powerful representatives of frequent patterns, since they eliminate redundant information. Our approach is based on an advantageous representation of trees and a low-complexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory. More precisely, we present three closed tree mining algorithms in sequence: an incremental one, IncTreeMiner, a sliding-window based one, WinTreeMiner, and finally one that mines closed trees adaptively from data streams, AdaTreeMiner. By adaptive we mean here that it presents at all times the closed trees that are frequent in the current state of the data stream. To the best of our knowledge this is the first work on mining closed frequent trees in streaming data varying with time. We give a first experimental evaluation of the proposed algorithms.
Journal Article•10.3233/IDA-2010-0473•
Feature selection based on inference correlation

[...]

Dengyao Mo1, Samuel H. Huang1•
University of Cincinnati1
1 Aug 2011
TL;DR: A feature selection algorithm using sequential floating forward search based on inference correlation is presented and experiments confirm the effectiveness of the feature selection approach when compared to extant feature selection methods.
Abstract: Feature selection is a critical preprocessing step in machine learning It contributes to cost-effective model building and improvement of model prediction performance Generally, a feature selection algorithm requires a dependency measure and a search strategy Extant dependency measures are mostly based on pair-wise correlation analysis, which cannot detect feature interaction To overcome this problem, we developed a unified dependency criterion called inference correlation The inference correlation between a set of predictor variables and a response variable can be efficiently calculated The variables could be discrete, continuous, or mixed Therefore, inference correlation can be applied to select features for both classification and regression problems A feature selection algorithm using sequential floating forward search based on inference correlation is presented Experiments of the algorithm on synthetic datasets and real-world problems confirm the effectiveness of the feature selection approach when compared to extant feature selection methods
Book Chapter•10.1007/978-3-642-24800-9_32•
Analyzing parliamentary elections based on voting advice application data

[...]

Jaakko Talonen1, Mika Sulkava1•
Aalto University1
29 Oct 2011
TL;DR: Two databases are combined: voting advice application data and the results of the parliamentary elections in 2011, which allows us to model the values of Finnish citizens and the members of the parliament.
Abstract: The main goal of this paper is to model the values of Finnish citizens and the members of the parliament. To achieve this goal, two databases are combined: voting advice application data and the results of the parliamentary elections in 2011. First, the data is converted to a high-dimension space. Then, it is projected to two principal components. The projection allows us to visualize the main differences between the parties. The value grids are produced with a kernel density estimation method without explicitly using the questions of the voting advice application. However, we find meaningful interpretations for the axes in the visualizations with the analyzed data. Subsequently, all candidate value grids are weighted by the results of the parliamentary elections. The result can be interpreted as a distribution grid for Finnish voters' values.
Book Chapter•10.1007/978-3-642-24800-9_10•
Online writing data representation: a graph theory approach

[...]

Gilles Caporossi1, Christophe Leblay•
HEC Montréal1
29 Oct 2011
TL;DR: This paper proposes a representation technique based upon graph theory that provides a new viewpoint to understand the writing process and is aimed at representing the data provided by ScriptLog although the concepts can be applied in other contexts.
Abstract: There are currently several systems to collect online writing data in keystroke logging. Each of these systems provides reliable and very precise data. Unfortunately, due to the large amount of data recorded, it is almost impossible to analyze except for very limited recordings. In this paper, we propose a representation technique based upon graph theory that provides a new viewpoint to understand the writing process. The current application is aimed at representing the data provided by ScriptLog although the concepts can be applied in other contexts.
Book Chapter•10.1007/978-3-642-24800-9_30•
Collaboration-based function prediction in protein-protein interaction networks

[...]

Hossein Rahmani1, Hendrik Blockeel2, Andreas Bender3•
Leiden University1, Katholieke Universiteit Leuven2, University of Cambridge3
29 Oct 2011
TL;DR: This work generalizes the assumption that certain neighboring proteins tend to have "collaborative", but not necessarily the same, functions, and proposes a few methods that work under this new assumption.
Abstract: The cellular metabolism of a living organism is among the most complex systems that man is currently trying to understand. Part of it is described by so-called protein-protein interaction (PPI) networks, and much effort is spent on analyzing these networks. In particular, there has been much interest in predicting certain properties of nodes in the network (in this case, proteins) from the other information in the network. In this paper, we are concerned with predicting a protein's functions. Many approaches to this problem exist. Among the approaches that predict a protein's functions purely from its environment in the network, many are based on the assumption that neighboring proteins tend to have the same functions. In this work we generalize this assumption: we assume that certain neighboring proteins tend to have "collaborative", but not necessarily the same, functions. We propose a few methods that work under this new assumption. These methods yield better results than those previously considered, with improvements in F-measure ranging from 3% to 17%. This shows that the commonly made assumption of homophily in the network (or "guilt by association"), while useful, is not necessarily the best one can make. The assumption of collaborativeness is a useful generalization of it; it is operational (one can easily define methods that rely on it) and can lead to better results.
Journal Article•10.3233/IDA-2010-0465•
A stable credit rating model based on learning vector quantization

[...]

Ning Chen1, Armando Vieira1, Bernardete Ribeiro2, João Duarte1, João C. Neves3 •
Instituto Superior de Engenharia do Porto1, University of Coimbra2, Technical University of Lisbon3
1 Apr 2011
TL;DR: This work proposes a methodology based on learning vector quantization (LVQ) to develop a credit rating model that is applied to a French database of private companies over a period of several years and is capable to create robust and stable classes to rank companies.
Abstract: Credit rating is involved in many financial applications to estimate the creditworthiness of corporations or individuals. In addition to building accurate credit rating models, the stability of models is of significant importance to economic performance. In this work we propose a methodology based on learning vector quantization (LVQ) to develop a credit rating model. This model is applied to a French database of private companies over a period of several years. LVQ is trained and calibrated in a supervised way using data from 2006 and then applied to the remaining years. We analyze one year transition matrix and show that the model is capable to create robust and stable classes to rank companies.
Book Chapter•10.1007/978-3-642-24800-9_29•
Bisociative discovery of interesting relations between domains

[...]

Uwe Nagel1, Kilian Thiel1, Tobias Kötter1, Dawid Piatek1, Michael R. Berthold1 •
University of Konstanz1
29 Oct 2011
TL;DR: A first formalization for the detection of potentially interesting, domain-crossing relations in large, heterogeneous information repositories based purely on structural properties of a relational knowledge description is proposed.
Abstract: The discovery of surprising relations in large, heterogeneous information repositories is gaining increasing importance in real world data analysis. If these repositories come from diverse origins, forming different domains, domain bridging associations between otherwise weakly connected domains can provide insights into the data that can otherwise not be accomplished. In this paper, we propose a first formalization for the detection of such potentially interesting, domain-crossing relations based purely on structural properties of a relational knowledge description.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve