Top 97 papers presented at Intelligent Data Analysis in 2009

Showing papers presented at "Intelligent Data Analysis in 2009"

Knowledge discovery from data streams

[...]

João Gama¹, Auroop R. Ganguly², Olufemi A. Omitaomu², Raju Vatsavai², Mohamed Medhat Gaber³ - Show less +1 more•Institutions (3)

University of Porto¹, Oak Ridge National Laboratory², Monash University³

1 Aug 2009

TL;DR: Knowledge Discovery from Data Streams as mentioned in this paper presents a coherent overview of state-of-the-art research in learning from data streams, covering the fundamentals that are imperative to understand data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks and customer click streams.

...read moreread less

Abstract: Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

...read moreread less

828 citations

Book Chapter•10.1007/978-3-642-03915-7_22•

Adaptive Learning from Evolving Data Streams

[...]

Albert Bifet¹, Ricard Gavaldà¹•Institutions (1)

Polytechnic University of Catalonia¹

27 Aug 2009

TL;DR: A method for developing algorithms that can adaptively learn from data streams that drift over time, based on using change detectors and estimator modules at the right places and choosing implementations with theoretical guarantees in order to extend such guarantees to the resulting adaptive learning algorithm.

...read moreread less

Abstract: We propose and illustrate a method for developing algorithms that can adaptively learn from data streams that drift over time. As an example, we take Hoeffding Tree, an incremental decision tree inducer for data streams, and use as a basis it to build two new methods that can deal with distribution and concept drift: a sliding window-based algorithm, Hoeffding Window Tree, and an adaptive method, Hoeffding Adaptive Tree. Our methods are based on using change detectors and estimator modules at the right places; we choose implementations with theoretical guarantees in order to extend such guarantees to the resulting adaptive learning algorithm. A main advantage of our methods is that they require no guess about how fast or how often the stream will drift; other methods typically have several user-defined parameters to this effect. In our experiments, the new methods never do worse, and in some cases do much better, than CVFDT, a well-known method for tree induction on data streams with drift.

...read moreread less

521 citations

Journal Article•10.3233/IDA-2009-0364•

Searching for interacting features in subset selection

[...]

Zheng Zhao¹, Huan Liu¹•Institutions (1)

Arizona State University¹

1 Apr 2009

TL;DR: This paper takes up the challenge to design a special data structure for feature quality evaluation, and to employ an information-theoretic feature ranking mechanism to efficiently handle feature interaction in subset selection.

...read moreread less

Abstract: The evolving and adapting capabilities of robust intelligence are best manifested in its ability to learn. Machine learning enables computer systems to learn, and improve performance. Feature selection facilitates machine learning (e.g., classification) by aiming to remove irrelevant features. Feature (attribute) interaction presents a challenge to feature subset selection for classification. This is because a feature by itself might have little correlation with the target concept, but when it is combined with some other features, they can be strongly correlated with the target concept. Thus, the unintentional removal of these features may result in poor classification performance. It is computationally intractable to handle feature interactions in general. However, the presence of feature interaction in a wide range of real-world applications demands practical solutions that can reduce high-dimensional data while preserving feature interactions. In this paper, we take up the challenge to design a special data structure for feature quality evaluation, and to employ an information-theoretic feature ranking mechanism to efficiently handle feature interaction in subset selection. We conduct experiments to evaluate our approach by comparing with some representative methods, perform a lesion study to examine the critical components of the proposed algorithm to gain insights, and investigate related issues such as data structure, ranking, time complexity, and scalability in search of interacting features.

...read moreread less

116 citations

Proceedings Article•

Intelligent data analysis

[...]

João Gama¹, Auroop R. Ganguly², Olufemi A. Omitaomu², Raju Vatsavai², Mohamed Medhat Gaber³ - Show less +1 more•Institutions (3)

University of Porto¹, Oak Ridge National Laboratory², Monash University³

1 Jan 2009

TL;DR: Intelligent Data Analysis invites submission of research and application articles that comply with the Aims and Scope of the journal and articles that discuss development of new AI architectures, methodologies, and techniques and their applications to the field of data analysis are preferred.

...read moreread less

Abstract: Aims and Scope Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains. Papers published in this journal are geared heavily towards applications, with an anticipated split of 70% of the papers published being applications-oriented research and the remaining 30% containing more theoretical research. Submission of Papers Authors are requested to submit their paper electronically as an email attachment to the Editor-in-Chief: submissions@ida-ij.com. Intelligent Data Analysis invites submission of research and application articles that comply with the Aims and Scope of the journal. In particular, articles that discuss development of new AI architectures, methodologies, and techniques and their applications to the field of data analysis are preferred. Manuscripts are received with the understanding that their content is unpublished material and is not being submitted for publication elsewhere. Further, it is understood that each co-author has made substantial contributions to the work described and that each accepts joint responsibility for publication. Subscription Information Intelligent Data Analysis (ISSN 1088-467x) will be published in 1 volume of 6 issues in 2014 (Volume 18). Institutional subscription (online only): €1050 / US$1415. Institutional subscription (print only): €1110 / US$1499 (including postage and handling). Institutional subscription (print & online): €1320 / US$1782 (including postage and handling). Individual subscription (online only): €126 / US$170.

...read moreread less

99 citations

Book Chapter•10.1007/978-3-642-03915-7_26•

Mining Frequent Gradual Itemsets from Large Databases

[...]

Lisa Di-Jorio¹, Anne Laurent¹, Maguelonne Teisseire•Institutions (1)

University of Montpellier¹

27 Aug 2009

TL;DR: This paper formally define gradual association rules and an original lattice-based approach and the GRITE algorithm is proposed for extracting gradual itemsets in an efficient manner for handling huge volumes of complex numerical data.

...read moreread less

Abstract: Mining gradual rules plays a crucial role in many real world applications where huge volumes of complex numerical data must be handled, e.g., biological databases, survey databases, data streams or sensor readings. Gradual rules highlight complex order correlations of the form "The more/less X, then the more/less Y ". Such rules have been studied since the early 70's, mostly in the fuzzy logic domain, where the main efforts have been focused on how to model and use such rules. However, mining gradual rules remains challenging because of the exponential combination space to explore. In this paper, we tackle the particular problem of handling huge volumes by proposing scalable methods. First, we formally define gradual association rules and we propose an original lattice-based approach. The GRITE algorithm is proposed for extracting gradual itemsets in an efficient manner. An experimental study on large-scale synthetic and real datasets is performed, showing the efficiency and interest of our approach.

...read moreread less

89 citations

Book Chapter•10.1007/978-3-642-03915-7_8•

Context-Based Distance Learning for Categorical Data Clustering

[...]

Dino Ienco¹, Ruggero G. Pensa¹, Rosa Meo¹•Institutions (1)

University of Turin¹

27 Aug 2009

TL;DR: The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed inThe groups of objects in correspondence of the distinct values of A i a low value of distance is obtained.

...read moreread less

Abstract: Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A j . We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.

...read moreread less

68 citations

Journal Article•10.3233/IDA-2009-0371•

An overview of advances in reliability estimation of individual predictions in machine learning

[...]

Zoran Bosnić¹, Igor Kononenko¹•Institutions (1)

University of Ljubljana¹

1 Apr 2009

TL;DR: The main part of the paper presents two classes of reliability estimation approaches and summarizes the relevant terminology, which is often used in this and related research fields.

...read moreread less

Abstract: In Machine Learning, estimation of the predictive accuracy for a given model is most commonly approached by analyzing the average accuracy of the model In general, the predictive models do not provide accuracy estimates for their individual predictions The reliability estimates of individual predictions require the analysis of various model and instance properties In the paper we make an overview of the approaches for estimation of individual prediction reliability We start by summarizing three research fields, that provided ideas and motivation for our work: (a) approaches to perturbing learning data, (b) the usage of unlabeled data in supervised learning, and (c) the sensitivity analysis The main part of the paper presents two classes of reliability estimation approaches and summarizes the relevant terminology, which is often used in this and related research fields

...read moreread less

67 citations

Book•10.1007/978-3-642-03915-7•

Advances in Intelligent Data Analysis VIII

[...]

Niall M. Adams¹, Céline Robardet, Arno Siebes², Jean-François Boulicaut•Institutions (2)

Imperial College London¹, Utrecht University²

1 Aug 2009

TL;DR: Adaptive Learning from Evolving Data Streams and an Application of Intelligent Data Analysis Techniques to a Large Software Engineering Dataset are considered.

...read moreread less

Abstract: This edited volume contains the 35 contributions to the 8th Symposium Intelligent Data Analsyis organized in Lyon at the end of August, early September.

...read moreread less

58 citations

Journal Article•10.3233/IDA-2009-0373•

Novelty detection with application to data streams

[...]

Eduardo J. Spinosa¹, André Ponce de Leon F. de Carvalho¹, João Gama²•Institutions (2)

University of São Paulo¹, University of Porto²

1 Aug 2009

TL;DR: Results show that the proposed approach to novelty detection is capable of identifying novel concepts that are pure and correspond to real classes, disregarding unrepresentative clusters and outliers.

...read moreread less

Abstract: This paper presents and evaluates an approach to novelty detection that addresses it as the problem of identifying novel concepts in a continuous learning scenario, as an extension to a single-class classification problem. OLINDDA, an OnLIne Novelty and Drift Detection Algorithm that implements this approach, uses efficient standard clustering algorithms to continuously generate candidate clusters among examples that were not explained by the current known concepts. Clusters complying with a validation criterion that takes cohesiveness and representativeness into account are initially identified as concepts. By merging similar concepts, OLINDDA may enhance the representation of some concepts as it advances toward its final goal of describing novel emerging concepts in an unsupervised way. The proposed approach is experimentally evaluated by the use of several measures taken throughout the learning process. Results show that it is capable of identifying novel concepts that are pure and correspond to real classes, disregarding unrepresentative clusters and outliers.

...read moreread less

58 citations

Book Chapter•10.1007/978-3-642-03915-7_34•

Efficient Vertical Mining of Frequent Closures and Generators

[...]

Laszlo Szathmary¹, Petko Valtchev¹, Amedeo Napoli, Robert Godin¹•Institutions (1)

Université du Québec à Montréal¹

27 Aug 2009

TL;DR: The proposed algorithm, Touch, deals with both FCI/FG-mining separately and is highly efficient and outperforms its levelwise competitors.

...read moreread less

Abstract: The effective construction of many association rule bases requires the computation of both frequent closed and frequent generator itemsets (FCIs/FGs). However, only few miners address both concerns, typically by applying levelwise breadth-first traversal. As depth-first traversal is known to be superior, we examine here the depth-first FCI/FG-mining. The proposed algorithm, Touch , deals with both tasks separately, i.e., uses a well-known vertical method, Charm , to extract FCIs and a novel one, Talky-G , to extract FGs. The respective outputs are matched in a post-processing step. Experimental results indicate that Touch is highly efficient and outperforms its levelwise competitors.

...read moreread less

51 citations

Book Chapter•10.1007/978-3-642-03915-7_25•

Ontology-Driven KDD Process Composition

[...]

Claudia Diamantini¹, Domenico Potena¹, Emanuele Storti¹•Institutions (1)

Marche Polytechnic University¹

27 Aug 2009

TL;DR: This paper introduces a goal-driven procedure for automatically compose algorithms based on the exploitation of KDDONTO, an ontology formalizing the domain of K DD algorithms, allowing us to generate valid and non-trivial processes.

...read moreread less

Abstract: One of the most interesting challenges in Knowledge Discovery in Databases (KDD) field is giving support to users in the composition of tools for forming a valid and useful KDD process. Such an activity implies that users have both to choose tools suitable to their knowledge discovery problem, and to compose them for designing the KDD process. To this end, they need expertise and knowledge about functionalities and properties of all KDD algorithms implemented in available tools. In order to support users in this heavy activity, in this paper we introduce a goal-driven procedure for automatically compose algorithms. The proposed procedure is based on the exploitation of KDDONTO, an ontology formalizing the domain of KDD algorithms, allowing us to generate valid and non-trivial processes.

...read moreread less

Journal Article•10.3233/IDA-2009-0376•

Spatio-Temporal Sensor Graphs (STSG): A data model for the discovery of spatio-temporal patterns

[...]

Betsy George¹, James M. Kang¹, Shashi Shekhar¹•Institutions (1)

University of Minnesota¹

1 Aug 2009

TL;DR: Spatio-Temporal Sensor Graphs (STSG) is proposed to model sensor data at the conceptual and physical levels, which allows the properties of edges and nodes to be modeled as a time series of measurement data.

...read moreread less

Abstract: Developing a model that facilitates the representation and knowledge discovery on sensor data presents many challenges With sensors reporting data at a very high frequency, resulting in large volumes of data, there is a need for a model that is memory efficient Since sensor data is spatio-temporal in nature, the model must also support the time dependence of the data Balancing the conflicting requirements of simplicity, expressiveness and storage efficiency is challenging The model should also provide adequate support for the formulation of efficient algorithms for knowledge discovery Though spatio-temporal data can be modeled using time expanded graphs, this model replicates the entire graph across time instants, resulting in high storage overhead and computationally expensive algorithms In this paper, we propose Spatio-Temporal Sensor Graphs (STSG) to model sensor data at the conceptual logical and physical levels This model allows the properties of edges and nodes to be modeled as a time series of measurement data Data at each instant would consist of the measured value and the expected error Also, we evaluate the model using methods to find interesting patterns such as growing hotspots in sensor data and present analytical comparison of the algorithms with methods based on existing models

...read moreread less

Journal Article•10.3233/IDA-2009-0384•

A new generic basis of factual and implicative association rules

[...]

Sadok Ben Yahia, Ghada Gasmi¹, Engelbert Mephu Nguifo¹•Institutions (1)

university of lille¹

1 Dec 2009

TL;DR: This paper introduces a novel informative generic basis of association rules, conveying two types of knowledge: "factual" and "implicative" and presents a valid and complete axiomatic system allowing one to infer the set of all association rules.

...read moreread less

Abstract: The extremely large number of association rules that can be drawn from - even reasonably sized datasets, bootstrapped the development of more acute techniques or methods to reduce the size of the reported rule sets. In this context, the battery of results provided by the Formal Concept Analysis (FCA) allowed one to define "irreducible" nuclei of association rule subset better known as generic bases. From such a condensed and reduced size set of association rules, it is possible to infer all association rules commonly via an adequate axiomatic system. In this paper, we introduce a novel informative generic basis of association rules, conveying two types of knowledge: "factual" and "implicative". We also present a valid and complete axiomatic system allowing one to infer the set of all association rules. Results of the experiments carried out on real-life datasets have shown important profits in terms of compactness of the introduced generic basis.

...read moreread less

Journal Article•10.3233/IDA-2009-0365•

A new ranking procedure by incomplete pairwise comparisons using preference subsets

[...]

Lev V. Utkin

1 Apr 2009

TL;DR: A method for ranking of alternatives or objects and its extensions by incomplete pairwise comparisons using random set theory is proposed and extended on the case of independent groups of experts.

...read moreread less

Abstract: A method for ranking of alternatives or objects and its extensions by incomplete pairwise comparisons using random set theory are proposed in the paper. The main feature of the method is that it allows us to deal with comparisons of arbitrary groups of alternatives. The method is extended on the case of independent groups of experts. The imprecise Dirichlet model is also used to make cautious decisions in several cases. Various numerical examples illustrate the proposed method and its extensions.

...read moreread less

Book Chapter•10.1007/978-3-642-03915-7_35•

Isotonic Classification Trees

[...]

Rémon Kamp¹, Ad Feelders¹, Nicola Barile¹•Institutions (1)

Utrecht University¹

27 Aug 2009

TL;DR: A new algorithm for learning isotonic classification trees that relabels non-monotone leaf nodes by performing the isotonic regression on the collection of leaf nodes is proposed.

...read moreread less

Abstract: We propose a new algorithm for learning isotonic classification trees. It relabels non-monotone leaf nodes by performing the isotonic regression on the collection of leaf nodes. In case two leaf nodes with a common parent have the same class after relabeling, the tree is pruned in the parent node. Since we consider problems with ordered class labels, all results are evaluated on the basis of L 1 prediction error. We experimentally compare the performance of the new algorithm with standard classification trees.

...read moreread less

Journal Article•10.3233/IDA-2009-0374•

Context-aware adaptive data stream mining

[...]

Pari Delir Haghighi¹, Arkady Zaslavsky¹, Shonali Krishnaswamy¹, Mohamed Medhat Gaber¹, Seng Loke² - Show less +1 more•Institutions (2)

Monash University¹, La Trobe University²

1 Aug 2009

TL;DR: This paper presents a general approach for context-aware adaptive mining of data streams that aims to dynamically and autonomously adjust data stream mining parameters according to changes in context and situations.

...read moreread less

Abstract: In resource-constrained devices, adaptation of data stream processing to variations of data rates and availability of resources is crucial for consistency and continuity of running applications. However, to enhance and maximize the benefits of adaptation, there is a need to go beyond mere computational and device capabilities to encompass the full spectrum of context-awareness. This paper presents a general approach for context-aware adaptive mining of data streams that aims to dynamically and autonomously adjust data stream mining parameters according to changes in context and situations. We perform intelligent and real-time analysis of data streams generated from sensors that is under-pinned using context-aware adaptation. A prototype of the proposed architecture is implemented and evaluated in the paper through a real-world scenario in the area of healthcare monitoring.

...read moreread less

Journal Article•10.3233/IDA-2009-0358•

Mining constraint-based patterns using automatic relaxation

[...]

Arnaud Soulet¹, Bruno Crémilleux²•Institutions (2)

François Rabelais University¹, University of Caen Lower Normandy²

1 Jan 2009

TL;DR: It is proved that this set of constraints called primitive-based constraints not only is a superclass of both kinds of monotone ones and their boolean combinations but also other classes such as convertible and succinct constraints.

...read moreread less

Abstract: Constraint-based mining is an active field of research which is a necessary step to achieve interactive and successful KDD processes. The limitations of the task lies in languages being limited to describe the mined patterns and the ability to express varied constraints. In practice, current approaches focus on a language and the most generic frameworks mine individually or simultaneously a monotone and an anti-monotone constraints. In this paper, we propose a generic framework dealing with any partially ordered language and a large set of constraints. We prove that this set of constraints called primitive-based constraints not only is a superclass of both kinds of monotone ones and their boolean combinations but also other classes such as convertible and succinct constraints. We show that the primitive-based constraints can be efficiently mined thanks to a relaxation method based on virtual patterns which summarize the specificities of the search space. Indeed, this approach automatically deduces pruning conditions having suitable monotone properties and thus these conditions can be pushed into usual constraint mining algorithms. We study the optimal relaxations. Finally, we provide an experimental illustration of the efficiency of our proposal by experimenting it on several contexts.

...read moreread less

Journal Article•10.3233/IDA-2009-0370•

On pushing weight constraints deeply into frequent itemset mining

[...]

Unil Yun¹•Institutions (1)

Chungbuk National University¹

1 Apr 2009

TL;DR: This paper proposes two efficient algorithms for mining weighted frequent itemsets in which the main approaches are to push weight constraints into the Apriori algorithm and the pattern growth algorithm respectively and shows how to maintain the downward closure property in mining weightedrequent itemsets.

...read moreread less

Abstract: There have been many studies on mining frequent itemset (or pattern) in the data mining field because of its broad applications in mining association rules, correlations, graph patterns, constraint based frequent patterns, sequential patterns, and many other data mining tasks. One of major challenges in frequent pattern mining is a huge number of result patterns. As the minimum threshold becomes lower, an exponentially large number of itemsets are generated. Therefore, pruning unimportant patterns effectively in mining process is one of main topics in frequent pattern mining. In weighted frequent pattern mining, not only support but also weight are used and important patterns can be detected. In this paper, we propose two efficient algorithms for mining weighted frequent itemsets in which the main approaches are to push weight constraints into the Apriori algorithm and the pattern growth algorithm respectively. Additionally, we show how to maintain the downward closure property in mining weighted frequent itemsets. In our approach, the normalized weights within the weight range are used according to the importance of items. A weight range is used to restrict weights of items and a minimum weight is utilized to balance between weight and support of items for pruning the search space. Our approach generates fewer but important weighted frequent itemsets in large databases, particularly dense databases with low minimum supports. An extensive performance study shows that our algorithm outperforms previous mining algorithms. In addition, it is efficient and scalable.

...read moreread less

Book Chapter•10.1007/978-3-642-03915-7_19•

Feature Extraction and Selection from Vibration Measurements for Structural Health Monitoring

[...]

Janne Toivola¹, Jaakko Hollmén¹•Institutions (1)

Helsinki University of Technology¹

27 Aug 2009

TL;DR: This work investigates the problem of extracting features from lightweight wireless acceleration sensors by selecting random sets of features and estimating probabilistic classifiers for damage detection purposes, and assesses the relevance of the features in a large population of classifiers.

...read moreread less

Abstract: Structural Health Monitoring (SHM) aims at monitoring buildings or other structures and assessing their condition, alerting about new defects in the structure when necessary For instance, vibration measurements can be used for monitoring the condition of a bridge We investigate the problem of extracting features from lightweight wireless acceleration sensors On-line algorithms for frequency domain monitoring are considered, and the resulting features are combined to form a large bank of candidate features We explore the feature space by selecting random sets of features and estimating probabilistic classifiers for damage detection purposes We assess the relevance of the features in a large population of classifiers The methods are assessed with real-life data from a wooden bridge model, where structural problems are simulated with small added weights

...read moreread less

Book Chapter•10.1007/978-3-642-03915-7_15•

ART-Based Neural Networks for Multi-label Classification

[...]

Elena Sapozhnikova¹•Institutions (1)

University of Konstanz¹

27 Aug 2009

TL;DR: This paper presents a novel approach based on ART (Adaptive Resonance Theory) neural networks that improves the multi-label classification performance of Fuzzy ARTMAP and ARAM algorithms and shows the effectiveness of the proposed approach.

...read moreread less

Abstract: Multi-label classification is an active and rapidly developing research area of data analysis. It becomes increasingly important in such fields as gene function prediction, text classification or web mining. This task corresponds to classification of instances labeled by multiple classes rather than just one. Traditionally, it was solved by learning independent binary classifiers for each class and combining their outputs to obtain multi-label predictions. Alternatively, a classifier can be directly trained to predict a label set of an unknown size for each unseen instance. Recently, several direct multi-label machine learning algorithms have been proposed. This paper presents a novel approach based on ART (Adaptive Resonance Theory) neural networks. The Fuzzy ARTMAP and ARAM algorithms were modified in order to improve their multi-label classification performance and were evaluated on benchmark datasets. Comparison of experimental results with the results of other multi-label classifiers shows the effectiveness of the proposed approach.

...read moreread less

Book Chapter•10.1007/978-3-642-03915-7_14•

Condensed Representation of Sequential Patterns According to Frequency-Based Measures

[...]

Marc Plantevit¹, Bruno Crémilleux¹•Institutions (1)

University of Caen Lower Normandy¹

27 Aug 2009

TL;DR: This paper defines an exact condensed representation according to the frequency-based measures and shows how to infer the best patterns according to these measures, i.e., the patterns which maximize them.

...read moreread less

Abstract: Condensed representations of patterns are at the core of many data mining works and there are a lot of contributions handling data described by items. In this paper, we tackle sequential data and we define an exact condensed representation for sequential patterns according to the frequency-based measures. These measures are often used, typically in order to evaluate classification rules. Furthermore, we show how to infer the best patterns according to these measures, i.e., the patterns which maximize them. These patterns are immediately obtained from the condensed representation so that this approach is easily usable in practice. Experiments conducted on various datasets demonstrate the feasibility and the interest of our approach.

...read moreread less

Book Chapter•10.1007/978-3-642-03915-7_2•

Analyzing the Localization of Retail Stores with Complex Systems Tools

[...]

Pablo Jensen¹•Institutions (1)

École normale supérieure de Lyon¹

27 Aug 2009

TL;DR: From pure location data, network analysis leads to a community structure that closely follows the commercial classification of the US Department of Labor, which allows to build a 'quality' index of optimal location niches for stores, which has been empirically tested.

...read moreread less

Abstract: Measuring the spatial distribution of locations of many entities (trees, atoms, economic activities, ...), and, more precisely, the deviations from purely random configurations, is a powerful method to unravel their underlying interactions. I study here the spatial organization of retail commercial activities. From pure location data, network analysis leads to a community structure that closely follows the commercial classification of the US Department of Labor. The interaction network allows to build a 'quality' index of optimal location niches for stores, which has been empirically tested.

...read moreread less

Journal Article•10.3233/IDA-2009-0379•

A probabilistic framework for automatic term recognition

[...]

Wilson Wong¹, Wei Liu¹, Mohammed Bennamoun¹•Institutions (1)

University of Western Australia¹

1 Dec 2009

TL;DR: This work proposes a probabilistic framework for formalising and combining qualitative evidence based on explicitly defined term characteristics to produce a new termhood measure that demonstrates consistently better precision, recall and accuracy compared to three other existing ad-hoc measures.

...read moreread less

Abstract: Term recognition identifies domain-relevant terms which are essential for discovering domain concepts and for the construction of terminologies required by a wide range of natural language applications. Many techniques have been developed in an attempt to numerically determine or quantify termhood based on term characteristics. Some of the apparent shortcomings of existing techniques are the ad-hoc combination of termhood evidence, mathematically-unfounded derivation of scores and implicit assumptions concerning term characteristics. We propose a probabilistic framework for formalising and combining qualitative evidence based on explicitly defined term characteristics to produce a new termhood measure. Our qualitative and quantitative evaluations demonstrate consistently better precision, recall and accuracy compared to three other existing ad-hoc measures.

...read moreread less

Book Chapter•10.1007/978-3-642-03915-7_11•

Subgroup Discovery for Test Selection: A Novel Approach and Its Application to Breast Cancer Diagnosis

[...]

Marianne Mueller¹, Romer Rosales², Harald Steck², Sriram Krishnan², Bharat Rao², Stefan Kramer¹ - Show less +2 more•Institutions (2)

Technische Universität München¹, Siemens²

27 Aug 2009

TL;DR: In experiments on breast cancer diagnosis data, it was showed that the proposed new approach to test selection based on the discovery of subgroups of patients sharing the same optimal test is faster than the baseline algorithm APRIORI-SD while preserving its accuracy.

...read moreread less

Abstract: We propose a new approach to test selection based on the discovery of subgroups of patients sharing the same optimal test, and present its application to breast cancer diagnosis. Subgroups are defined in terms of background information about the patient. We automatically determine the best t subgroups a patient belongs to, and decide for the test proposed by their majority. We introduce the concept of prediction quality to measure how accurate the test outcome is regarding the disease status. The quality of a subgroup is then the best mean prediction quality of its members (choosing the same test for all). Incorporating the quality computation in the search heuristic enables a significant reduction of the search space. In experiments on breast cancer diagnosis data we showed that it is faster than the baseline algorithm APRIORI-SD while preserving its accuracy.

...read moreread less

Journal Article•10.3233/IDA-2009-0399•

Metric learning for semi-supervised clustering using pairwise constraints and the geometrical structure of data

[...]

Mahdieh Soleymani Baghshah¹, Saeed Bagheri Shouraki¹•Institutions (1)

Sharif University of Technology¹

1 Dec 2009

TL;DR: The proposed metric learning method can improve performance of semi-supervised clustering algorithms and Experimental results on real-world data sets show the effectiveness of this method.

...read moreread less

Abstract: Metric learning is a powerful approach for semi-supervised clustering. In this paper, a metric learning method considering both pairwise constraints and the geometrical structure of data is introduced for semi-supervised clustering. At first, a smooth metric is found (based on an optimization problem) using positive constraints as supervisory information. Then, an extension of this method employing both positive and negative constraints is introduced. As opposed to the existing methods, the extended method has the capability of considering both positive and negative constraints while considering the topological structure of data. The proposed metric learning method can improve performance of semi-supervised clustering algorithms. Experimental results on real-world data sets show the effectiveness of this method.

...read moreread less

Journal Article•10.3233/IDA-2009-0368•

Mining periodic patterns in spatio-temporal sequences at different time granularities

[...]

Sezin Karli¹, Yucel Saygin¹•Institutions (1)

Sabancı University¹

1 Apr 2009

TL;DR: Experimental results show that the proposed techniques are indeed effective and efficient for mining periodic spatio-temporal patterns at different time granularities.

...read moreread less

Abstract: With the advancement of technology, it is now easy to collect the location information of mobile users over time. Spatio-temporal data mining techniques were proposed in the literature for the extraction of patterns from spatio-temporal data. However, current techniques can only extract patterns of the finest time granularity, and therefore overlooks potential patterns available at coarser time granularities. In this work, we propose two techniques to allow mining at different time granularities. Experimental results show that the proposed techniques are indeed effective and efficient for mining periodic spatio-temporal patterns at different time granularities.

...read moreread less

Journal Article•10.3233/IDA-2009-0390•

Quality indices for (practical) clustering evaluation

[...]

Margarida G. M. S. Cardoso¹, André Ponce de Leon F. de Carvalho²•Institutions (2)

ISCTE – University Institute of Lisbon¹, University of São Paulo²

1 Oct 2009

TL;DR: This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation, and presents some commonly used indices, as well as indices recently proposed in the literature.

...read moreread less

Abstract: Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters' compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.

...read moreread less

Journal Article•10.3233/IDA-2009-0367•

SOM-based data analysis of speculative attacks' real effects

[...]

Ismael E. Arciniegas Rueda, F. Arciniegas

1 Apr 2009

TL;DR: This paper used self-organizing maps (SOM) to search for meaningful associations between speculative attacks' real effects and 28 variables that characterize the economic, financial, legal, and socio-political structure of the country at the onset of the attack.

...read moreread less

Abstract: In some cases, currency crises are followed by strong recessions (e.g., recent Asian and Argentinean crises), but in other cases they are not. This paper uses Self-Organizing Maps (SOM) to search for meaningful associations between speculative attacks' real effects and 28 variables that characterize the economic, financial, legal, and socio-political structure of the country at the onset of the attack. SOM is a neural network-based generalization of Principal Component Analysis (PCA) that provides an efficient non-linear projection of the multidimensional data space on a curved surface. This paper finds a strong association of speculative attacks` real effects with fundamentals and the banking sector structure.

...read moreread less

Journal Article•10.3233/IDA-2009-0369•

Models for association rules based on clustering and correlation

[...]

Carlos Ordonez¹•Institutions (1)

University of Houston¹

1 Apr 2009

TL;DR: The clustering model is accurate to estimate support, given a sufficiently large number of clusters and it is more accurate than correlation, except for sets of two items, and itemset support can be bounded and approximated from both models.

...read moreread less

Abstract: Association rules require models to understand their relationship to statistical properties of the data set. In this work, we study mathematical relationships between association rules and two fundamental techniques: clustering and correlation. Each cluster represents an important itemset. We show the sufficient statistics for clustering and correlation on binary data sets are the linear sum of points and the quadratic sum of points, respectively. We prove itemset support can be bounded and approximated from both models. Support bounds and support estimation obey the set downward closure property for fast bottom-up search for frequent itemsets. Both models can be efficiently computed with sparse matrix computations. Experiments with real and synthetic data sets evaluate model accuracy and speed. The clustering model is accurate to estimate support, given a sufficiently large number of clusters and it is more accurate than correlation, except for sets of two items. Accuracy increases as the number of clusters grows, but decreases as the minimum support threshold decreases. Once built, the clustering model represents a faster alternative than the traditional A-priori algorithm and the correlation model to mine associations. The correlation model is faster to compute than clustering, but it is less accurate. Time complexity to compute both models is linear on data set size, whereas dimensionality marginally impacts time when analyzing large transaction data sets.

...read moreread less

Journal Article•10.3233/IDA-2009-0382•

Approximate minimum spanning tree clustering in high-dimensional space

[...]

Chih Lai¹, Taras Rafa, Dwight E. Nelson¹•Institutions (1)

University of St. Thomas (Minnesota)¹

1 Dec 2009

TL;DR: An approximate clustering method in which a new Approximate MST is repeatedly built in the maximum (d+1) iterations from two sources: a new Hilbert curve created from carefully shifted N data points, and a previous AMST which holds cumulative vicinity information derived from earlier iterations.

...read moreread less

Abstract: Minimum spanning tree (MST) clustering sequentially inserts the nearest points in the R$^{d}$ space into a list which is then divided into clusters by using desired criteria. This insertion order, however, can be relaxed provided approximately nearby points in a condensed area are adjacently inserted into a list before distant points in other areas. Based on this observation, we propose an approximate clustering method in which a new Approximate MST (AMST) is repeatedly built in the maximum (d+1) iterations from two sources: a new Hilbert curve created from carefully shifted N data points, and a previous AMST which holds cumulative vicinity information derived from earlier iterations. Although the final AMST may not completely match to a true MST built from an $O(N^{2})$ algorithm, most mismatches occur locally within individual data groups which are unimportant for clustering. Our experiments on synthetic datasets and animal motion vectors extracted from surveillance videos show that high-quality clusters can be efficiently obtained from this approximation method.

...read moreread less

...

Expand