Top 42 papers presented at Intelligent Data Analysis in 2004

Showing papers presented at "Intelligent Data Analysis in 2004"

Journal Article•10.3233/IDA-2004-8305•

Learning drifting concepts: Example selection vs. example weighting

[...]

1 Aug 2004

TL;DR: This paper proposes several methods to handle concept drift handling with support vector machines that can effectively select an appropriate window size, example selection, and example weighting, respectively, in a robust way.

...read moreread less

Abstract: For many learning tasks where data is collected over an extended period of time, its underlying distribution is likely to change. A typical example is information filtering, i.e. the adaptive classification of documents with respect to a particular user interest. Both the interest of the user and the document content change over time. A filtering system should be able to adapt to such concept changes. This paper proposes several methods to handle such concept drifts with support vector machines. The methods either maintain an adaptive time window on the training data [13], select representative training examples, or weight the training examples [15]. The key idea is to automatically adjust the window size, the example selection, and the example weighting, respectively, so that the estimated generalization error is minimized. The approaches are both theoretically well-founded as well as effective and efficient in practice. Since they do not require complicated parameterization, they are simpler to use and more robust than comparable heuristics. Experiments with simulated concept drift scenarios based on real-world text data compare the new methods with other window management approaches. We show that they can effectively select an appropriate window size, example selection, and example weighting, respectively, in a robust way. We also explain how the proposed example selection and weighting approaches can be turned into incremental approaches. Since most evaluation methods for machine learning, like e.g. cross-validation, assume that the examples are independent and identically distributed, which is clearly unrealistic in the case of concept drift, alternative evaluation schemes are used to estimate and optimize the performance of each learning step within the concept drift handling frameworks as well as to evaluate and compare the different frameworks.

...read moreread less

521 citations

Proceedings Article•10.5555/1293831.1293836•

Learning drifting concepts: Example selection vs. example weighting

[...]

KlinkenbergRalf

1 Aug 2004

TL;DR: For many learning tasks where data is collected over an extended period of time, its underlying distribution is likely to change, a typical example is information filtering, i.e. the adaptive class...

...read moreread less

176 citations

Journal Article•10.3233/IDA-2004-8406•

Why machine learning algorithms fail in misuse detection on KDD intrusion detection data set

[...]

Maheshkumar Sabhnani¹, Gursel Serpen¹•Institutions (1)

University of Toledo¹

1 Sep 2004

TL;DR: Analysis results clearly suggest that no pattern classification or machine learning algorithm can be trained successfully with the KDD data set to perform misuse detection for user-to-root or remote- to-local attack categories.

...read moreread less

Abstract: A large set of machine learning and pattern classification algorithms trained and tested on KDD intrusion detection data set failed to identify most of the user-to-root and remote-to-local attacks, as reported by many researchers in the literature. In light of this observation, this paper aims to expose the deficiencies and limitations of the KDD data set to argue that this data set should not be used to train pattern recognition or machine learning algorithms for misuse detection for these two attack categories. Multiple analysis techniques are employed to demonstrate, both objectively and subjectively, that the KDD training and testing data subsets represent dissimilar target hypotheses for user-to-root and remote-to-local attack categories. These techniques consisted of switching the roles of original training and testing data subsets to develop a decision tree classifier, cross-validation on merged training and testing data subsets, and qualitative and comparative analysis of rules generated independently on training and testing data subsets through the C4.5 decision tree algorithm. Analysis results clearly suggest that no pattern classification or machine learning algorithm can be trained successfully with the KDD data set to perform misuse detection for user-to-root or remote-to-local attack categories. It is further noted that the analysis techniques employed to assess the similarity between the two target hypotheses represented by the training and the testing data subsets can readily be generalized to data set pairs in other problem domains.

...read moreread less

172 citations

Journal Article•10.3233/IDA-2004-8602•

A surface-based approach for classification of 3D neuroanatomic structures

[...]

Li Shen¹, James Ford¹, Fillia Makedon¹, Andrew J. Saykin¹•Institutions (1)

Dartmouth College¹

1 Dec 2004

TL;DR: A threshold-free receiver operating characteristic (ROC) approach is employed as an alternative evaluation of classification results as well as a new method for visualizing discriminative patterns to help medical diagnosis in practice.

...read moreread less

Abstract: We present a new framework for 3D surface object classification that combines a powerful shape description method with suitable pattern classification techniques. Spherical harmonic parameterization and normalization techniques are used to describe a surface shape and derive a dual high dimensional landmark representation. A point distribution model is applied to reduce the dimensionality. Fisher's linear discriminants and support vector machines are used for classification. Several feature selection schemes are proposed for learning better classifiers. After showing the effectiveness of this framework using simulated shape data, we apply it to real hippocampal data in schizophrenia and perform extensive experimental studies by examining different combinations of techniques. We achieve best leave-one-out cross-validation accuracies of 93% (whole set, N = 56) and 90% (right-handed males, N = 39), respectively, which are competitive with the best results in previous studies using different techniques on similar types of data. Furthermore, to help medical diagnosis in practice, we employ a threshold-free receiver operating characteristic (ROC) approach as an alternative evaluation of classification results as well as propose a new method for visualizing discriminative patterns.

...read moreread less

84 citations

Journal Article•10.3233/IDA-2004-8204•

A global optimal algorithm for class-dependent discretization of continuous data

[...]

Lili Liu¹, Andrew K. C. Wong¹, Yang Wang•Institutions (1)

University of Waterloo¹

1 Apr 2004

TL;DR: The proposed method, OCDD, or Optimal Class-Dependent Discretization, finds the global optimum of the objective functions and can be used to discretize continuous variables for many existing inductive learning systems.

...read moreread less

Abstract: This paper presents a new method to convert continuous variables into discrete variables for inductive machine learning. The method can be applied to pattern classification problems in machine learning and data mining. The discretization process is formulated as an optimization problem. We first use the normalized mutual information that measures the interdependence between the class labels and the variable to be discretized as the objective function, and then use fractional programming (iterative dynamic programming) to find its optimum. Unlike the majority of class-dependent discretization methods in the literature which only find the local optimum of the objective functions, the proposed method, OCDD, or Optimal Class-Dependent Discretization, finds the global optimum. The experimental results demonstrate that this algorithm is very effective in classification when coupled with popular learning systems such as C4.5 decision trees and Naive-Bayes classifier. It can be used to discretize continuous variables for many existing inductive learning systems.

...read moreread less

44 citations

Journal Article•10.3233/IDA-2004-8203•

A new approach to hierarchical clustering and structuring of data with Self-Organizing Maps

[...]

Elias Pampalk¹, Gerhard Widmer¹, Alvin Chan²•Institutions (2)

Austrian Research Institute for Artificial Intelligence¹, DSO National Laboratories²

1 Apr 2004

TL;DR: This work presents a novel approach to reveal the inherent hierarchical structure of data using multiple SOMs together with heuristics which optimize the stability, and introduces the Tension and Mapping Ratio extension to exploit specific characteristics of the SOM based on the topology preservation.

...read moreread less

Abstract: The Self-Organizing Map (SOM) is a powerful tool for exploratory data analysis which has been employed in a wide range of data mining applications. We present a novel approach to reveal the inherent hierarchical structure of data using multiple SOMs together with heuristics which optimize the stability. In particular, we address shortcomings of the Growing Hierarchical Self-Organizing Map (GHSOM) regarding the decision which areas in the hierarchical structure need to be represented by a finer granularity and which areas do not. We introduce the Tension and Mapping Ratio extension to exploit specific characteristics of the SOM based on the topology preservation. As a main result, in contrast to the GHSOM, the inherent hierarchical structure of the data is revealed without requiring the user to define a threshold parameter which controls the map sizes of the individual SOMs. We evaluate our approach using data from real-world data mining projects in the music domain.

...read moreread less

40 citations

Book Chapter•10.1007/978-3-540-45231-7_35•

Large scale mining of molecular fragments with wildcards

[...]

Heiko Hofer, Christian Borgelt¹, Michael R. Berthold²•Institutions (2)

Otto-von-Guericke University Magdeburg¹, University of Konstanz²

1 Oct 2004

TL;DR: A special treatment of rings and a method that finds fragments with wildcards based on chemical expert knowledge are presented that are extensions of this approach to building a classifier that predicts whether a novel molecule will be active or inactive.

...read moreread less

Abstract: The main task of drug discovery is to find novel bioactive molecules, i.e., chemical compounds that, for example, protect human cells against a virus. One way to support solving this task is to analyze a database of known and tested molecules with the aim to build a classifier that predicts whether a novel molecule will be active or inactive, so that future chemical tests can be focused on the most promising candidates. In [1] an algorithm for constructing such a classifier was proposed that uses molecular fragments to discriminate between active and inactive molecules. In this paper we present two extensions of this approach: A special treatment of rings and a method that finds fragments with wildcards based on chemical expert knowledge.

...read moreread less

38 citations

Journal Article•10.3233/IDA-2004-8306•

Sequential learning in neural networks: A review and a discussion of pseudorehearsal based methods

[...]

Anthony Robins¹•Institutions (1)

University of Otago¹

1 Aug 2004

TL;DR: This review explores the topic of sequential learning, where information to be learned and retained arrives in separate episodes over time, in the context of artificial neural networks, and examines the pseudorehearsal mechanism, which is an effective solution to the catastrophic forgetting problem in back propagation type networks.

...read moreread less

Abstract: In this review we explore the topic of sequential learning, where information to be learned and retained arrives in separate episodes over time, in the context of artificial neural networks. Most neural networks handle this kind of task very badly, as new learning completely disrupts information previously learned by the network. This problem, known as "catastrophic forgetting", has received a lot of attention in the literature. We illustrate the catastrophic forgetting effect, and summarise possible solutions. In particular, we review the literature relating to the pseudorehearsal mechanism, which is an effective solution to the catastrophic forgetting problem in back propagation type networks. We then review similar issues of capacity, forgetting, and the use of pseudorehearsal in Hopfield type networks. Finally, we briefly discuss these issues in the context of cognition, and summarise interesting topics for further research.

...read moreread less

37 citations

Journal Article•10.3233/IDA-2004-8302•

Incremental learning and concept drift in INTHELEX

[...]

Floriana Esposito¹, Stefano Ferilli¹, Nicola Fanizzi¹, Teresa Maria Altomare Basile¹, N. Di Mauro¹ - Show less +1 more•Institutions (1)

University of Bari¹

1 Aug 2004

TL;DR: This work presents a new approach to learning in presence of concept drift, and in particular a special version of the incremental system INTHELEX purposely designed to implement such a technique.

...read moreread less

Abstract: Real-world tasks often involve a continuous flow of new information that affects the learned theory, a situation that classical batch (one-step) learning systems are hardly suitable to handle. On the contrary, incremental (also called "on-line") techniques are able to deal with such a situation by exploiting refinement operators. In many cases deep knowledge about the world is not available: Either incomplete information is available at the time of initial theory generation, or the nature of the concepts evolves dynamically. The latter situation is the most difficult to handle since time evolution needs to be considered. This work presents a new approach to learning in presence of concept drift, and in particular a special version of the incremental system INTHELEX purposely designed to implement such a technique. Its behavior in this context has been checked and analyzed by running it on two different datasets.

...read moreread less

27 citations

Journal Article•10.3233/IDA-2004-8105•

A trimmed mean approach to finding spatial outliers

[...]

Tianming Hu¹, Sam Yuan Sung¹•Institutions (1)

National University of Singapore¹

1 Jan 2004

TL;DR: A local trimmed mean approach to evaluating the spatial outlier factor which is the degree that a site is outlying compared to its neighbors is proposed and empirical results show this approach is significantly better than scatter plot, and slightly better than spatial statistic.

...read moreread less

Abstract: Outlier detection concerns discovering some unusual data whose behavior is exceptional compared to other data. In contrast to non-spatial outliers which only consider non-spatial attributes, spatial outliers are defined to be those sites which are very different from its neighbors defined in terms of spatial attributes, i.e., locations. In this paper, we propose a local trimmed mean approach to evaluating the spatial outlier factor which is the degree that a site is outlying compared to its neighbors. The structure of our approach strictly follows the general spatial data model, which states spatial data consist of trend, dependence and error. We empirically demonstrate trimmed mean is more outlier-resistant than median in estimating sample location and it is employed to estimate spatial trend in our approach. In addition to using the 1st order neighbors in computing error, we also use higher order neighbors to estimate spatial trend. With true outlier factor supposed to be given by the spatial error model, we compare our approach with spatial statistic and scatter plot. Experimental results on two real datasets show our approach is significantly better than scatter plot, and slightly better than spatial statistic.

...read moreread less

27 citations

Journal Article•10.3233/IDA-2004-8102•

A dynamic approach to adjusting lengths of intervals in fuzzy time series forecasting

[...]

Kun-Huang Huarng¹, Hui-Kuang Yu¹•Institutions (1)

Feng Chia University¹

1 Jan 2004

TL;DR: This study proposes a dynamic approach to adjusting lengths of intervals in fuzzy time series forecasting, thus capturing fuzzy relationships more appropriately and shows that this dynamic approach can be applied to improve fuzzy timeseries forecasting.

...read moreread less

Abstract: Fuzzy time series models have been proposed to model linguistic observations and have been extended to model numerical observations as well. Many factors are believed to affect fuzzy time series forecasting. The formulation of fuzzy relationships and the lengths of intervals for observations are considered two of them. Hence, how to cover both issues simultaneously is important for the improvement of forecasting results. This study proposes a dynamic approach to adjusting lengths of intervals in fuzzy time series forecasting, thus capturing fuzzy relationships more appropriately. These fuzzy relationships can then be used to improve forecasting. Enrollment and stock index forecasting are used to demonstrate the effectiveness of the dynamic approach. Empirical results show that this dynamic approach can be applied to improve fuzzy time series forecasting.

...read moreread less

Journal Article•10.3233/IDA-2004-8304•

Towards a machine learning approach based on incremental concept formation

[...]

Mondher Maddouri

1 Aug 2004

TL;DR: A new learning approach is introduced that improves incremental concept formation and has the advantage of handling both the problem of data addition, data deletion, data update, attribute addition and attribute deletion.

...read moreread less

Abstract: In many real-world learning problems the data flows continuously and learning algorithms should be able to respond to this circumstance: the induced concept description should gradually change over time. In this paper, we outline some existing incremental learners based on the theory of Formal Concept Analysis: FCA. Then, we introduce a new learning approach that improves incremental concept formation. This approach has the advantage of handling both the problem of data addition, data deletion, data update, attribute addition and attribute deletion. Finally, we apply the proposed approach to the problem of cancer diagnosis. We measure the effect of incrementality on the quality of the discovered rules using cross-validation.

...read moreread less

Book Chapter•10.1007/978-3-540-45231-7_6•

Resolving rule conflicts with double induction

[...]

Tony Lindgren¹, Henrik Boström¹•Institutions (1)

Royal Institute of Technology¹

1 Oct 2004

TL;DR: Experiments show that this method significantly outperforms both the CN2 approach and naive Bayes in terms of solving the problem of resolving conflicts between rules.

...read moreread less

Abstract: When applying an unordered set of classification rules, the rules may assign more than one class to a particular example. Previous methods of resolving such conflicts between rules include using the most frequent class of the examples covered by the conflicting rules (as done in CN2) and using naive Bayes to calculate the most probable class. An alternative way of solving this problem is presented in this paper: by generating new rules from the examples covered by the conflicting rules. These newly induced rules are then used for classification. Experiments on a number of domains show that this method significantly outperforms both the CN2 approach and naive Bayes.

...read moreread less

Journal Article•10.3233/IDA-2004-8502•

Perceptron and SVM learning with generalized cost models

[...]

Peter Geibel¹, Ulf Brefeld², Fritz Wysotzki¹•Institutions (2)

Technical University of Berlin¹, Humboldt State University²

1 Oct 2004

TL;DR: A cost-sensitive perceptron learning rule for non-separable classes is derived that can be extended to multi-modal classes (DIPOL) and a natural cost- sensitive extension of the support vector machine (SVM) is presented.

...read moreread less

Abstract: Learning algorithms from the fields of artificial neural networks and machine learning, typically, do not take any costs into account or allow only costs depending on the classes of the examples that are used for learning. As an extension of class dependent costs, we consider costs that are example, i.e. feature and class dependent. We derive a cost-sensitive perceptron learning rule for non-separable classes, that can be extended to multi-modal classes (DIPOL) and present a natural cost-sensitive extension of the support vector machine (SVM). We also derive an approach for including example dependent costs into an arbitrary cost-insensitive learning algorithm by sampling according to modified probability distributions.

...read moreread less

Book Chapter•10.1007/978-3-540-45231-7_11•

Topology and intelligent data analysis

[...]

Vanessa Robins¹, J. Abernethy², N. Rooney², Elizabeth Bradley²•Institutions (2)

Australian National University¹, University of Colorado Boulder²

1 Oct 2004

TL;DR: This paper shows how topology, properly reformulated for a finite-precision world, can be useful in intelligent data analysis tasks.

...read moreread less

Abstract: A broad range of mathematical techniques, ranging from statistics to fuzzy logic, have been used to great advantage in intelligent data analysis. Topology - the fundamental mathematics of shape - has to date been conspicuously absent from this repertoire. This paper shows how topology, properly reformulated for a finite-precision world, can be useful in intelligent data analysis tasks.

...read moreread less

Journal Article•10.3233/IDA-2004-8104•

Weighted Instance Typicality Search (WITS): A nearest neighbor data reduction algorithm

[...]

Brent D. Morring¹, Tony Martinez•Institutions (1)

Brigham Young University¹

1 Jan 2004

TL;DR: The WITS algorithm achieved the highest average accuracy, showed fewer catastrophic failures, and stored an average of 71% fewer instances than DROP-5, the next most competitive algorithm in terms of accuracy and catastrophic failures; and the C-WITS algorithm provides a user-defined parameter that gives the user control over the training-time vs. accuracy balance.

...read moreread less

Abstract: Two disadvantages of the standard nearest neighbor algorithm are 1) it must store all the instances of the training set, thus creating a large memory footprint and 2) it must search all the instances of the training set to predict the classification of a new query point, thus it is slow at run time. Much work has been done to remedy these shortcomings. This paper presents a new algorithm WITS (Weighted-Instance Typicality Search) and a modified version, Clustered-WITS (C-WITS), designed to address these issues. Data reduction algorithms address both issues by storing and using only a portion of the available instances. WITS is an incremental data reduction algorithm with O(n^2) complexity, where n is the training set size. WITS uses the concept of Typicality in conjunction with Instance-Weighting to produce minimal nearest neighbor solutions. WITS and C-WITS are compared to three other state of the art data reduction algorithms on ten real-world datasets. WITS achieved the highest average accuracy, showed fewer catastrophic failures, and stored an average of 71% fewer instances than DROP-5, the next most competitive algorithm in terms of accuracy and catastrophic failures. The C-WITS algorithm provides a user-defined parameter that gives the user control over the training-time vs. accuracy balance. This modification makes C-WITS more suitable for large problems, the very problems data reductions algorithms are designed for. On two large problems (10,992 and 20,000 instances), C-WITS stores only a small fraction of the instances (0.88% and 1.95% of the training data)while maintaining generalization accuracies comparable to the best accuracies reported for these problems.

...read moreread less

Journal Article•10.3233/IDA-2004-8606•

Nonmetric multidimensional scaling: Neural networks versus traditional techniques

[...]

M. C. van Wezel¹, Walter A. Kosters²•Institutions (2)

Erasmus University Rotterdam¹, Leiden University²

1 Dec 2004

TL;DR: This paper considers various methods for nonmetric multidimensional scaling of the monotone neural network using sequential quadratic programming to estimate the weights of the neural network and an experimental comparison of the methods is given for various synthetic and real-life datasets.

...read moreread less

Abstract: In this paper we consider various methods for nonmetric multidimensional scaling. We focus on the nonmetric phase, for which we consider various alternatives: Kruskal's nonmetric phase, Guttman's nonmetric phase, monotone regression by monotone splines, and monotone regression by a monotone neural network. All methods are briefly described. We use sequential quadratic programming to estimate the weights of the neural network. An experimental comparison of the methods is given for various synthetic and real-life datasets. The monotone neural network performs comparable to the traditional methods.

...read moreread less

Journal Article•10.3233/IDA-2004-8405•

An improved genetic algorithm adopting immigration operator

[...]

Wenxian Yang¹•Institutions (1)

Nottingham Trent University¹

1 Sep 2004

TL;DR: In order to further improve the convergence performance of available genetic algorithms (GAs), a new operator, namely Immigration Operator (IO), was proposed and an improved genetic algorithm was developed.

...read moreread less

Abstract: In order to further improve the convergence performance of available genetic algorithms (GAs), a new operator, namely Immigration Operator (IO), was proposed in this paper Using the IO, an improved genetic algorithm was developed To verify the effectiveness of the IO on improving the evolutionary performances of the algorithm, two benchmarking problems had been adopted The first one is the typical simulation problem for searching the maximum value of the advanced Goldstein & Price function in a prescribed region The second is the well-known Traveling Salesman problem (TSP) Subsequently, the improved algorithm was applied to search the effective criteria for monitoring the working condition of engine valves The object inspected in the experiments was the sixth exhaust valve of a 6135-typed diesel engine Both the simulated and practical experiments suggest that, after adopting the IO, a higher rate of convergence is achieved by the improved algorithm Particularly in solving the kind of TSP problems, the crossover operator is handicapped in avoiding the morbid solution (ie the same city is traveled for multiple times in a same tour) In contrast, the IO provides an additional motivity for driving the evolution

...read moreread less

Proceedings Article•10.5555/1293789.1293791•

A surface-based approach for classification of 3D neuroanatomic structures

[...]

ShenLi, FordJames, MakedonFillia, SaykinAndrew

1 Dec 2004

TL;DR: A new framework for 3D surface object classification is presented that combines a powerful shape description method with suitable pattern classification techniques and spherical harmonic parameterizatio...

...read moreread less

Journal Article•10.3233/IDA-2004-8303•

Adaptive Ripple Down Rules method based on minimum description length principle

[...]

Tetsuya Yoshida¹, Takuya Wada¹, Hiroshi Motoda¹, Takashi Washio¹•Institutions (1)

Osaka University¹

1 Aug 2004

TL;DR: An adaptive Ripple Down Rules method based on the Minimum Description Length Principle aiming at knowledge acquisition in a dynamically changing environment is proposed and knowledge deletion is carried out as well as knowledge acquisition so that useless knowledge is properly discarded.

...read moreread less

Abstract: A knowledge acquisition method Ripple Down Rules (RDR) can directly acquire and encode knowledge from human experts. It is an incremental acquisition method and each new piece of knowledge is added as an exception to the existing knowledge base. Past researches on RDR method assume that the problem domain is stable. This is not the case in reality, especially when an environment changes. Things change over time. This paper proposes an adaptive Ripple Down Rules method based on the Minimum Description Length Principle aiming at knowledge acquisition in a dynamically changing environment. We consider both the change in class distribution on a domain and the change in knowledge source as typical changes in the environment. When class distribution changes, some pieces of knowledge previously acquired become worthless, and the existence of such knowledge may hinder acquisition of new knowledge. In our approach knowledge deletion is carried out as well as knowledge acquisition so that useless knowledge is properly discarded. To cope with the change in knowledge source, RDR knowledge based systems can be constructed adaptively by acquiring knowledge from both domain experts and data. By incorporating inductive learning methods, knowledge can be acquired (learned) even when only either data or experts are available by switching the knowledge source from domain experts to data and vice versa at any time of knowledge acquisition. Since experts need not be available all the time, it contributes to reducing the cost of personnel expenses. Experiments were conducted by simulating the change in knowledge source and the change in class distribution using the datasets in UCI repository. The results show that it is worth following this path.

...read moreread less

Journal Article•10.3233/IDA-2004-8407•

Maintenance of generalized association rules with multiple minimum supports

[...]

Ming-Cheng Tseng¹, Wen-Yang Lin¹•Institutions (1)

I-Shou University¹

1 Sep 2004

TL;DR: Two algorithms are proposed, UD_Cumulate and UD_Stratify, which can incrementally update the discovered generalized association rules with non-uniform support specification and are capable of effectively reducing the number of candidate sets and database re-scanning.

...read moreread less

Abstract: Mining generalized association rules among items in the presence of taxonomy has been recognized as an important model in data mining. Earlier work on generalized association rules confined the minimum supports to be uniformly specified for all items or items within the same taxonomy level. This constraint would restrain an expert from discovering more interesting but much less supported association rules. In our previous work, we have addressed this problem and proposed two algorithms, MMS_Cumulate and MMS_Stratify. In this paper, we examined the problem of maintaining the discovered multi-supported, generalized association rules when new transactions are added into the original database. We proposed two algorithms, UD_Cumulate and UD_Stratify, which can incrementally update the discovered generalized association rules with non-uniform support specification and are capable of effectively reducing the number of candidate sets and database re-scanning. Empirical evaluation showed that UD_Cumulate and UD_Stratify are 2-6 times faster than running MMS_Cumulate or MMS_Stratify on the updated database afresh.

...read moreread less

Proceedings Article•10.5555/1293797.1293804•

Topology and intelligent data analysis

[...]

RobinsV., AbernethyJ., RooneyN., BradleyE.

1 Oct 2004

TL;DR: A broad range of mathematical techniques, ranging from statistics to fuzzy logic, have been used to great advantage in intelligent data analysis, and topology - the fundamental mathematics of shape - is studied.

...read moreread less

Journal Article•10.3233/IDA-2004-8206•

Prediction of oil well production: A multiple-neural-network approach

[...]

H. H. Nguyen¹, Christine W. Chan¹, M. Wilson¹•Institutions (1)

University of Regina¹

1 Apr 2004

TL;DR: An application using both single and multiple interval prediction models implemented with artificial neural networks to estimate the future production performance of oil wells showed that a MNN model performed better than a single neural network model for long-term predictions.

...read moreread less

Abstract: This study presents an application using both single and multiple interval prediction models implemented with artificial neural networks to estimate the future production performance of oil wells. The single interval prediction model was developed using NOL (Gensym Corp., USA). The multiple neural network (MNN) model is a novel approach that combines a group of neural networks, with each component neural network being responsible for predicting a different time period. The approach is designed to improve the accuracy of long-term predictions. In addition to conducting both short and long term prediction of oil production, the study also investigates different approaches for modeling the application domain parameters. The MNN model for prediction of future well performance is applied to the time series data obtained from four pools of wells in the southwestern region of Saskatchewan, Canada. The results showed that a MNN model performed better than a single neural network model for long-term predictions.

...read moreread less

Journal Article•10.3233/IDA-2004-8604•

Improving the prediction performance of customer behavior through multiple imputation

[...]

Hyunju Noh¹, Min-jung Kwak, Ingoo Han¹•Institutions (1)

KAIST¹

1 Dec 2004

TL;DR: This study is designed to introduce the multiple imputation technique and show two experimental works of several imputation methods applied to the real cases in electronic customer relationship management domain, the first with missing covariates and the second with missing targets.

...read moreread less

Abstract: Various predictive modeling approaches based on the customers' information may be used for selecting proper targets for a promoted product to entice customers into purchasers. However, there is a fundamental problem, the incomplete data which can yield biased results and deteriorate the accuracy of those approaches. So far, several methods such as case deletion and mean substitution are applied to handle the incomplete dataset in various domains. Those approaches are simple and easy to implement but may also provide biased results. Recently multiple imputation is suggested as a method to overcome the flaws in traditional treatments through reflecting the uncertainty of missing values in the incomplete dataset. This study is designed to introduce the multiple imputation technique and show two experimental works of several imputation methods applied to the real cases in electronic customer relationship management domain, the first with missing covariates and the second with missing targets. According to the results of the experimental works, the multiple-imputation based approaches produced the better performance than the traditional approaches in both of two case studies. Especially, the multiple imputation technique proved to be more effective in the dataset with a high missing rate than the one with a low missing rate.

...read moreread less

Journal Article•10.3233/IDA-2004-8605•

On the relationships between user profiles and navigation sessions in virtual communities: A data-mining approach

[...]

Simone Garatti¹, Sergio M. Savaresi¹, Sergio Bittanti¹, Luca La Brocca•Institutions (1)

Polytechnic University of Milan¹

1 Dec 2004

TL;DR: The analysis and Data-Mining of a large data-set related to a very popular Italian Virtual Community is presented, which provides a complete and full-rounded picture of the Virtual Community.

...read moreread less

Abstract: In this paper the analysis and Data-Mining of a large data-set related to a very popular Italian Virtual Community is presented. The Community is constituted by more than half-million registered users, each characterized by a unique nickname and a personal "profile" filled during a registration procedure, on a voluntary basis. Two data-sets have been considered: the Data-Base of the Users (nicknames and profiles), and the log-file of the server hosting the Community web-site. This work is constituted by three main parts: 1) analysis and clustering of the User Data-Base; 2) sessionization of the log-file and clustering of the navigation session database; 3) correlation of User clusters and navigation session clusters. This analysis provides a complete and full-rounded picture of the Virtual Community.

...read moreread less

Journal Article•10.3233/IDA-2004-8106•

Identification of discriminative features in the EEG

[...]

Peter Meinicke¹, Thomas Hermann, Holger Bekel, Horst M. Müller, Sabine Weiss², Helge Ritter - Show less +2 more•Institutions (2)

University of Göttingen¹, University of Vienna²

1 Jan 2004

TL;DR: The results correlate well with results from coherence analysis and strongly indicate that these new methods are well suited for uncovering cognitively relevant features in EEG signals.

...read moreread less

Abstract: An important step for the correlation of EEG signals with cognitive processes is the identification of discriminative features in the EEG signal. In this paper we utilize independent component analysis (ICA) for feature extraction and selection. Our specific ICA technique is based on a nonparametric source representation which in particular allows for modelling of multimodal feature distributions as generally required for the analysis of mixed data from different experiment conditions. To demonstrate the potential of the resulting ICA feature selection scheme we report results from an analysis of psycholinguistic experiments on the discrimination of speech perception from perception of so-called pseudo speech signals and demonstrate how the obtained ICA features can be further analyzed with the technique of sonification. Our results correlate well with results from coherence analysis and strongly indicate that these new methods are well suited for uncovering cognitively relevant features in EEG signals.

...read moreread less

Journal Article•10.3233/IDA-2004-8301•

Incremental learning and concept drift: Editor's introduction: Guest-editorial

[...]

Miroslav Kubat¹, João Gama², Paul E. Utgoff³•Institutions (3)

University of Miami¹, University of Porto², University of Massachusetts Amherst³

1 Aug 2004

TL;DR: This special issue of Intelligent Data Analysis is dedicated to machine learning systems capable of dealing with concept drift, and focuses on those learning scenarios in which the system must induce the concept from timestamped training data.

...read moreread less

Abstract: A complex problem in data analysis is the time-varying nature of many realistic domains. In many real-world learning problems, training data become available in batches over time, or even flow steadily, as in user-modeling tasks, dynamic control systems, web-mining, and times series analysis. In these applications, learning algorithms should be able to adjust the decision model dynamically whenever new data become available. This is the scenario that motivates this special issue of Intelligent Data Analysis to machine learning systems capable of dealing with concept drift. To narrow the domain of interest, we focus on those learning scenarios in which the system must induce the concept from timestamped training data. A brute-force algorithm relearns the concept from scratch each time a new example becomes available. This poses several problems. Learning from the scratch wastes computational resources. Moreover, in non-stationary environments, the system should take into account the fact that only the most recent examples are relevant to the actual target concept. A less expensive approach would employ an incremental learning technique that adapts the previously induced concept model by incorporating the experience obtained from newly available examples. An incremental learning system can be used with some success for domains in which the underlying instance distribution evolves, especially if there is an abundance of examples that are representative of the most recent version of the target concept. Yet, in domains where the change is substantial and there is a paucity of recent examples, the system needs to be able to discount or even forget older examples, and adjust what has been induced from them. The task is more difficult than it appears. When learning in time-varying domains, the system needs to modify the internal concept representation not only as more examples become available, but also in response to suspected changes in the definition of the target concept. It is of paramount importance that the system be able to distinguish between the situation in which new examples only help to fine-tune the existing concept model, and the situation in which the new examples are indicative of a shift in the target concept. To complicate matters even further, the system should not be misled by noise. Over the past decade, many researchers have become interested in this task, and the results of their work have appeared in diverse journals and conferences. By organizing this special issue, we wanted to concentrate several alternative approaches in the same volume in order to give the interested reader a better idea about the state-of-the-art of the relevant algorithms, applications, and evaluation methods. We believe that the five articles that appear here satisfy this goal.

...read moreread less

Journal Article•10.3233/IDA-2004-8402•

A Tabu Clustering algorithm for Intrusion Detection

[...]

Yong Guo Liu¹, Xiao Feng Liao², Xue Ming Li², Zhong Fu Wu²•Institutions (2)

Shanghai Jiao Tong University¹, Chongqing University²

1 Sep 2004

TL;DR: A new detection algorithm is proposed, the Intrusion Detection Based on Tabu Clustering (IDBTC) algorithm, which can automatically set up clusters and detect intrusions by labeling normal and abnormal groups.

...read moreread less

Abstract: Traditional methods of intrusion detection lack the extensibility in face of changing network configurations and the adaptability in face of unknown intrusion types. Meanwhile, current machine-learning algorithms for intrusion detection need labeled data to be trained, so they are expensive in computation and sometimes misled by artificial data. In order to solve these problems, a new detection algorithm is proposed in this paper, the Intrusion Detection Based on Tabu Clustering (IDBTC) algorithm. It can automatically set up clusters and detect intrusions by labeling normal and abnormal groups. Computer simulations show that this algorithm is effective for intrusion detection.

...read moreread less

Journal Article•10.3233/IDA-2004-8205•

Efficiently mining Maximal Frequent Sets in dense databases for discovering association rules

[...]

Krishnamoorthy Srikumar¹, Bharat Bhasker¹•Institutions (1)

Indian Institute of Management Ahmedabad¹

1 Apr 2004

TL;DR: MaxDomino, an algorithm for mining Maximal Frequent Sets (MFS) for discovering association rules in dense databases that performs better compared to other known algorithms-at higher support levels is presented.

...read moreread less

Abstract: We present, MaxDomino, an algorithm for mining Maximal Frequent Sets (MFS) for discovering association rules in dense databases. The algorithm uses novel concepts of dominancy factor and collapsibility of transaction for efficiently mining MFS. Unlike traditional bottom up approach with look-aheads, MaxDomino employs a top down strategy with selective bottom-up search for mining MFS. Using a set of benchmark dense datasets-created by University of California, Irvine-we demonstrate that MaxDomino outperforms GenMax-that performs better compared to other known algorithms-at higher support levels. Our algorithm is especially efficient for dense databases.

...read moreread less

Journal Article•10.3233/IDA-2004-8603•

Visualization of evolutionary computation processes from a population perspective

[...]

Hsu-Chih Wu¹, Chuen-Tsai Sun¹, Sih-Shin Lee¹•Institutions (1)

National Chiao Tung University¹

1 Dec 2004

TL;DR: A visualization framework for genetic algorithm, which visualizes evolutionary processes from a population viewpoint, rather than chromosomal or problem spaces, is investigated, both user-friendly and extendable to other problems and models that address population changes over time.

...read moreread less

Abstract: The authors investigate a visualization framework for genetic algorithm, to express the evolutionary processes. Our framework differs from most existing methods in that it visualizes evolutionary processes from a population viewpoint, rather than chromosomal or problem spaces. A simple sexual selection model is used for demonstration purposes. We propose four visualization methods that are based on the framework. Those tools show how evolutionary trends and population characteristics are visually depicted. The framework is both user-friendly and extendable to other problems and models that address population changes over time.

...read moreread less