Top 44 Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery papers published in 2011

Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2011"

Journal Article•10.1002/WIDM.8•

Classification and regression trees

[...]

Wei-Yin Loh¹•Institutions (1)

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.

...read moreread less

Abstract: Classification and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14-23 DOI: 10.1002/widm.8 This article is categorized under: Technologies > Classification Technologies > Machine Learning Technologies > Prediction Technologies > Statistical Fundamentals

...read moreread less

18,764 citations

Journal Article•10.1002/WIDM.2•

Robust statistics for outlier detection

[...]

Peter J. Rousseeuw¹, Mia Hubert¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: An overview of several robust methods and outlier detection tools for univariate, low‐dimensional, and high‐dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification are presented.

...read moreread less

Abstract: When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73-79 DOI: 10.1002/widm.2 This article is categorized under: Algorithmic Development > Biological Data Mining Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering

...read moreread less

735 citations

Journal Article•10.1002/WIDM.30•

Density-based clustering

[...]

Hans-Peter Kriegel¹, Peer Kröger¹, Jörg Sander², Arthur Zimek¹•Institutions (2)

Ludwig Maximilian University of Munich¹, University of Alberta²

01 May 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this article, a density-based clustering is defined as the task of identifying groups or clusters in a data set, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects.

...read moreread less

Abstract: Clustering refers to the task of identifying groups or clusters in a data set. In density-based clustering, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects. Density-based clusters are separated from each other by contiguous regions of low density of objects. Data objects located in low-density regions are typically considered noise or outliers. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 231–240 DOI: 10.1002/widm.30 This article is categorized under: Technologies > Structure Discovery and Clustering

...read moreread less

721 citations

Journal Article•10.1002/WIDM.12•

Multivariate random forests

[...]

Mark R. Segal¹, Yuanyuan Xiao¹•Institutions (1)

University of California, San Francisco¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The genesis of, and motivation for, the random forest paradigm as an outgrowth from earlier tree‐structured techniques is outlined and an illustrative example from ecology is provided that showcases the improved fit and enhanced interpretation afforded by the random Forest framework.

...read moreread less

Abstract: Random forests have emerged as a versatile and highly accurate classiﬁcationand regression methodology, requiring little tuning and providing interpretableoutputs. Here, we brieﬂy outline the genesis of, and motivation for, the randomforest paradigm as an outgrowth from earlier tree-structured techniques. Weelaborate on aspects of prediction error and attendant tuning parameter issues.However,ouremphasisisonextendingtherandomforestschematothemultipleresponse setting. We provide a simple illustrative example from ecology thatshowcases the improved ﬁt and enhanced interpretation afforded by the randomforest framework.

...read moreread less

266 citations

Journal Article•10.1002/WIDM.25•

Identifying patterns in spatial information: a survey of methods

[...]

Shashi Shekhar¹, Michael R. Evans¹, James M. Kang¹, Pradeep Mohan¹•Institutions (1)

University of Minnesota¹

01 May 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper explores the emerging field of spatial data mining, focusing on different methods to extract patterns from spatial information, and concludes with a look at future research needs.

...read moreread less

Abstract: Explosive growth in geospatial data and the emergence of new spatial technologies emphasize the need for automated discovery of spatial knowledge. Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial databases. The complexity of spatial data and implicit spatial relationships limits the usefulness of conventional data mining techniques for extracting spatial patterns. In this paper, we explore the emerging field of spatial data mining, focusing on different methods to extract patterns from spatial information. We conclude with a look at future research needs. C

...read moreread less

203 citations

Journal Article•10.1002/WIDM.43•

Evolutionary multiobjective optimization

[...]

Carlos A. Coello Coello¹•Institutions (1)

CINVESTAV¹

01 Sep 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper presents a very short introduction to multiobjective evolutionary algorithms, including their basic concepts and their main components, including selection mechanisms, diversity maintenance mechanisms, and elitism in a multi‐objective context.

...read moreread less

Abstract: This paper presents a very short introduction to multiobjective evolutionary algorithms, including their basic concepts and their main components. The discussion focuses on algorthmic design and, therefore, the issues discussed include selection mechanisms, diversity maintenance mechanisms, and elitism in a multi-objective context. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 444-447 DOI: 10.1002/widm.43

...read moreread less

116 citations

Journal Article•10.1002/WIDM.15•

Choosing the number of clusters

[...]

Boris Mirkin¹, Boris Mirkin²•Institutions (2)

Birkbeck, University of London¹, National Research University – Higher School of Economics²

01 May 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The paper reviews published work on the issue with respect to mixture of distributions, partition, especially in k‐means clustering, and hierarchical cluster structures and some perspective directions for further developments are outlined.

...read moreread less

Abstract: The issue of determining ‘the right number of clusters’ is attracting ever growing interest. The paper reviews published work on the issue with respect to mixture of distributions, partition, especially in k-means clustering, and hierarchical cluster structures. Some perspective directions for further developments are outlined.

...read moreread less

103 citations

Journal Article•10.1002/WIDM.31•

Mining uncertain data

[...]

Carson K. Leung¹•Institutions (1)

University of Manitoba¹

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Recent algorithmic development on mining uncertain data in these probabilistic databases for frequent patterns from probabilism databases of uncertain data is reviewed.

...read moreread less

Abstract: As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the data. In general, the association rule mining process comprises two key steps. The first key step, which mines frequent patterns (i.e., frequently occurring sets of items) from data, is more computationally intensive than the second key step of using the mined frequent patterns to form association rules. In the early days, many developed algorithms mined frequent patterns from traditional transaction databases of precise data such as shopping market basket data, in which the contents of databases are known. However, we are living in an uncertain world, in which uncertain data can be found almost everywhere. Hence, in recent years, researchers have paid more attention to frequent pattern mining from probabilistic databases of uncertain data. In this paper, we review recent algorithmic development on mining uncertain data in these probabilistic databases for frequent patterns. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 316–329 DOI: 10.1002/widm.31

...read moreread less

61 citations

Journal Article•10.1002/WIDM.11•

A study of methodological issues in design and development of rule-based systems: proposal of a new approach

[...]

Antoni Ligęza¹, Grzegorz J. Nalepa¹•Institutions (1)

AGH University of Science and Technology¹

01 Mar 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: A novel, consistent, three‐phase methodology incorporating conceptual, logical, and physical design is outlined, and tools supporting the complete design and development process are presented.

...read moreread less

Abstract: Rule-based systems (RBSs) constitute a powerful technology for declarative encoding and automated processing of large bodies of knowledge. A typical RBS consists of a knowledge base containing facts and production rules, and an inference engine managing the reasoning process. Despite their simple conceptual scheme, design and development of a RBS often turn out to be unexpectedly complex task. This paper presents an overview of issues concerning design and development of such systems. Differences between RBSs and classical software are exemplified, and design and implementation issues are analyzed. A novel, consistent, three-phase methodology incorporating conceptual, logical, and physical design is outlined. Moreover, tools supporting the complete design and development process are presented. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 117-137 DOI: 10.1002/widm.11

...read moreread less

54 citations

Journal Article•10.1002/WIDM.18•

On the discovery of association rules by means of evolutionary algorithms

[...]

María José del Jesus¹, José A. Gámez², Pedro González¹, José M. Puerta²•Institutions (2)

University of Jaén¹, University of Castilla–La Mancha²

01 Sep 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Association rule learning is a data mining task that tries to discover interesting relations between variables in large databases as discussed by the authors, and the main applications of association rule evolutionary learning covered by the specialized bibliography are reviewed.

...read moreread less

Abstract: Association rule learning is a data mining task that tries to discover interesting relations between variables in large databases. A review of association rule learning is presented that focuses on the use of evolutionary algorithms not only applied to Boolean variables but also to categorical and quantitative ones. The use of fuzzy rules in the evolutionary algorithms for association rule learning is also described. Finally, the main applications of association rule evolutionary learning covered by the specialized bibliography are reviewed. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 397-415 DOI: 10.1002/widm.18

...read moreread less

51 citations

Journal Article•10.1002/WIDM.20•

Sensor selection to support practical use of health-monitoring smart environments

[...]

Diane J. Cook¹, Lawrence B. Holder¹•Institutions (1)

Washington State University¹

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this article, the authors employ data mining techniques to look at the problem of sensor selection for activity recognition in smart homes and analyze the results based on six datasets collected in five distinct smart home environments.

...read moreread less

Abstract: The data mining and pervasive sensing technologies found in smart homes offer unprecedented opportunities for providing health monitoring and assistance to individuals experiencing difficulties in living independently at home. In order to monitor the functional health of smart home residents, we need to design technologies that recognize and track activities that people normally perform as part of their daily routines. One question that frequently arises, however, is how many smart home sensors are needed and where should they be placed in order to accurately recognize activities? We employ data mining techniques to look at the problem of sensor selection for activity recognition in smart homes. We analyze the results based on six datasets collected in five distinct smart home environments. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 339–351 DOI: 10.1002/widm.20

...read moreread less

Journal Article•10.1002/WIDM.27•

Contrast and change mining

[...]

Mirko Boettcher¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

01 May 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This article provides an overview of recent works on methods for change analysis, thereby focusing on contrast mining and change mining, the two emerging subfields of contemporary data mining research.

...read moreread less

Abstract: Because the world with its markets, innovations, and customers is changing faster than ever before, the key to survival for businesses is the ability to detect, assess, and respond to changing conditions timely and intelligently. Understanding changes and reacting to or acting upon them therefore become a strategic issue not only for companies but also in many other domains. The corresponding need for knowledge has been answered by data mining research by proposing a multitude of methods for analyzing different aspects of change. This article provides an overview of recent works on methods for change analysis, thereby focusing on contrast mining and change mining, the two emerging subfields of contemporary data mining research. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 215–230 DOI: 10.1002/widm.27

...read moreread less

Journal Article•10.1002/WIDM.47•

Genetic algorithms for clustering and fuzzy clustering: Genetic algorithms for clustering

[...]

Sanghamitra Bandyopadhyay

01 Nov 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Two techniques, one with fixed number of clusters and another with a variable number of fuzzy clusters, are described along with some experimental results on numerical as well as image data sets.

...read moreread less

Abstract: Clustering has been an area of intensive research for several decades because of its multifaceted applications in innumerable domains. Clustering can be either Boolean, where a single data point belongs to exactly one cluster, or fuzzy, where a single data point can have nonzero belongingness to more than one cluster. Traditionally, optimization of some well‐defined objective function has been the standard approach in both clustering and fuzzy clustering. Hence, researchers have investigated the utility of evolutionary computing and related techniques in this regard. The different approaches differ in their choice of the objective function and/or the optimization strategy used. In particular, clustering using genetic algorithms (GAs) has attracted attention of researchers, and has been studied extensively. This paper presents a short review of some of different approaches of GA‐based clustering methods. Two techniques, one with fixed number of clusters and another with a variable number of fuzzy clusters, are described along with some experimental results on numerical as well as image data sets. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 524–531 DOI: 10.1002/widm.47

...read moreread less

Journal Article•10.1002/WIDM.9•

Accelerating data mining workloads: current approaches and future challenges in system architecture design

[...]

Alok Choudhary¹, Daniel Honbo¹, Prabhat Kumar¹, Berkin Ozisikyilmaz¹, Sanchit Misra¹, Gokhan Memik¹ - Show less +2 more•Institutions (1)

Northwestern University¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Experiments have shown that heterogeneous architectures employing GPUs or FPGAs can result in significant application speedups over homogenous CPU‐based systems, while increasing performance per watt.

...read moreread less

Abstract: Conventional systems based on general-purpose processors cannot keep pace with the exponential increase in the generation and collection of data. It is therefore important to explore alternative architectures that can provide the computational capabilities required to analyze ever-growing datasets. Programmable graphics processing units (GPUs) offer computational capabilities that surpass even high-end multi-core central processing units (CPUs), making them wellsuited for floating-point- or integer-intensive and data parallel operations. Fieldprogrammable gate arrays (FPGAs), which can be reconfigured to implement an arbitrary circuit, provide the capability to specify a customized datapath for any task. The multiple granularities of parallelism offered by FPGA architectures, as well as their high internal bandwidth, make them suitable for low complexity parallel computations. GPUs and FPGAs can serve as coprocessors for data mining applications, allowing the CPU to offload computationally intensive tasks for faster processing. Experiments have shown that heterogeneous architectures employingGPUsorFPGAscanresultinsignificantapplicationspeedupsoverhomogenous CPU-based systems, while increasing performance per watt. C

...read moreread less

Journal Article•10.1002/WIDM.38•

The use of decision trees for cost-sensitive classification: an empirical study in software quality prediction

[...]

Naeem Seliya¹, Taghi M. Khoshgoftaar²•Institutions (2)

University of Michigan¹, Florida Atlantic University²

01 Sep 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this paper, the authors investigated six cost-sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling RUS.

...read moreread less

Abstract: This empirical study investigates two commonly used decision tree classification algorithms in the context of cost-sensitive learning. A review of the literature shows that the cost-based performance of a software quality prediction model is usually determined after the model-training process has been completed. In contrast, we incorporate cost-sensitive learning during the model-training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost-sensitive learning technique. The paper investigates six different cost-sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling RUS. The data come from case study include 15 software measurement datasets obtained from several high-assurance systems. In addition, to a unique insight into the cost-based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model-training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider during model training and analyze during model evaluation the cost-based performance of a defect prediction model. RUS is ranked as the best cost-sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448-459 DOI: 10.1002/widm.38

...read moreread less

Journal Article•10.1002/WIDM.21•

Information enhancement for data mining

[...]

Shichao Zhang¹•Institutions (1)

Zhejiang Normal University¹

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This study first review the current research on imputing missing values, and then experimentally evaluate the techniques and demonstrate the efficiency of missing data imputation techniques to enhance information in the process of pattern discovery from datasets with missing values.

...read moreread less

Abstract: Information enhancement techniques are desired in many areas such as data mining, machine learning, business intelligence, and web data analysis. Information enhancement mainly includes the following topics: data cleaning, data preparation and transformation, missing values imputation, feature and instance selection, feature construction, treatment of noisy and inconsistent data, data integration, data collection and housing, information enhancement, web data availability, web data capture and representation, and the others. It is impossible to outline all the research topics in a single paper. In this study, we discuss the information enhancement for data mining with existing missing data imputation techniques. We first review the current research on imputing missing values, and then experimentally evaluate the techniques and demonstrate the efficiency of missing data imputation techniques to enhance information in the process of pattern discovery from datasets with missing values. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 284–295 DOI: 10.1002/widm.21

...read moreread less

Journal Article•10.1002/WIDM.4•

Multivariate image mining

[...]

Julia Herold¹, Christian Loyek¹, Tim Wilhelm Nattkemper¹•Institutions (1)

Bielefeld University¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This review article summarizes the different imaging technologies and recently published approaches to MVI mining with a special focus on biomedical applications.

...read moreread less

Abstract: Because of recent advances in sensor technology and a rapid increase in storage capacities, a growing number of intensity values can be recorded and associated with pixel coordinates using new imaging technologies This growth in dimension can be observed in different scientific areas and this new category of images is referred to as multivariate images (MVIs) In these images, an almost arbitrary number of variables is associated with each pixel that represent, for instance, signal values at different time points or for different spectral bands or for different imaging parameters or modalities Thus, these images can no longer be interpreted as gray value images or red, green, blue color images, and new information technologies are needed In this review article, we summarize the different imaging technologies and recently published approaches to MVI mining with a special focus on biomedical applications © 2011 John Wiley & Sons, Inc WIREs Data Mining Knowl Discov 2011 1 2-13 DOI: 101002/widm4

...read moreread less

Journal Article•10.1002/WIDM.46•

Mining flexible‐receptor molecular docking data

[...]

Karina S. Machado, Ana T. Winck, Duncan D. Ruiz, Osmar Norberto de Souza

01 Nov 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The work on the application of classification decision trees, regression model tree, and association rules using properly preprocessed data of the FFR molecular docking results are reviewed, and it is explained how data mining techniques can support the acceleration of molecular docking simulations of FFR models.

...read moreread less

Abstract: Knowledge discovery in databases has become an integral part of practically every aspect of bioinformatics research, which usually produces, and has to process, very large amounts of data. Rational drug design is one of the current scientific areas that has greatly benefited from bioinformatics, particularly a step, which analyzes receptor–ligand interactions via molecular docking simulations. An important challenge is the inclusion of the receptor flexibility since they can become computationally very demanding. We have represented this explicit flexibility as a series of different conformations derived from a molecular dynamics simulation trajectory of the receptor. This model has been termed as the fully flexible receptor (FFR) model. In our studies, the receptor is the enzyme InhA from Mycobacterium tuberculosis, which is the major drug target for the treatment of tuberculosis. The FFR model of InhA (named FFR_InhA) was docked to four ligands, namely, nicotinamide adenine dinucleotide, pentacyano(isoniazid)ferrate II, triclosan, and ethionamide, thus, generating very large amounts of data, which needs to be mined to produce useful knowledge to help accelerate drug discovery and development. Very little work has been done in this area. In this article, we review our work on the application of classification decision trees, regression model tree, and association rules using properly preprocessed data of the FFR molecular docking results, and show how they can provide an improved understanding of the FFR_InhA-ligand behavior. Furthermore, we explain how data mining techniques can support the acceleration of molecular docking simulations of FFR models. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 532–541 DOI: 10.1002/widm.46

...read moreread less

Journal Article•10.1002/WIDM.26•

Similarity-based data mining in files of two-dimensional chemical structures using fingerprint measures of molecular resemblance

[...]

Peter Willett¹•Institutions (1)

University of Sheffield¹

01 May 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The similarity measures considered here are based on the use of a fingerprint representation of molecular structure, where a fingerprint is a vector encoding the presence of fragment substructures in a molecule and where the similarity between pairs of such fingerprints is computed using an association coefficient such as the Tanimoto coefficient.

...read moreread less

Abstract: This paper reviews the use of measures of intermolecular similarity for processing databases of chemical structures, which play an important role in the discovery of new drugs by the pharmaceutical industry. The similarity measures considered here are based on the use of a fingerprint representation of molecular structure, where a fingerprint is a vector encoding the presence of fragment substructures in a molecule and where the similarity between pairs of such fingerprints is computed using an association coefficient such as the Tanimoto coefficient. The Similar Property Principle provides the basic rationale for the use of similarity methods in three important chemoinformatics applications—similarity searching, database clustering, and molecular diversity analysis. Similarity searching enables the identification of those molecules in a database that are most similar to a user-defined, biologically active query molecule, with data fusion providing an effective way of combining the results of multiple similarity searches. Cluster analysis, typically using the Jarvis–Patrick, Ward, or divisive k-means clustering methods, enables the cost-effective selection of molecules for biological testing, for property prediction and for investigating database overlap. Molecular diversity analysis, typically using cluster-based, dissimilarity-based, or optimization-based approaches, enables the identification of structurally diverse sets of molecules, so as to ensure that the full chemical space spanned by a database is tested in the search for novel bioactive molecules. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 241–251 DOI: 10.1002/widm.26

...read moreread less

Journal Article•10.1002/WIDM.44•

Evolutionary computation for training set selection

[...]

Nicolás García-Pedrajas¹•Institutions (1)

University of Córdoba (Spain)¹

01 Nov 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The main algorithms that have been developed for decision trees, artificial neural networks, and other classifiers are presented and the relevant issue of the scalability of these methods to very large datasets is discussed.

...read moreread less

Abstract: Instance selection is becoming increasingly relevant because of the large amount of data that is constantly being produced in many fields of research. Two basic approaches exist for instance selection: instance selection as a method for prototype selection for instance-based methods (such as k-nearest neighbors) and instance selection for obtaining the training set for classifiers that require a learning process (such as decision trees or neural networks). In this paper, we review the methods that have been developed thus far for the latter approach within the field of evolutionary computation. Different groups of learning algorithms require different instance selectors to suit their learning/search biases. This requirement may render many instance selection algorithms useless if their philosophy of design is not suitable for the problem at hand. Evolutionary algorithms do not assume any structure of the data or any behavior of the classifier but instead adapt the instance selection to the performance of the classifier. They are therefore very suitable for training set selection. The main algorithms that have been developed for decision trees, artificial neural networks, and other classifiers are presented. We also discuss the relevant issue of the scalability of these methods to very large datasets. Although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is in the hundreds of thousands or millions. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 512–523 DOI: 10.1002/widm.44

...read moreread less

Journal Article•10.1002/WIDM.36•

Similarity measures for sequential data

[...]

Konrad Rieck¹•Institutions (1)

Technical University of Berlin¹

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper reviews three major classes of similarity measures: edit distances, bag‐of‐word models, and string kernels, and presents these classes and underlying comparisons in detail, highlight advantages, and differences as well as provide basic algorithms supporting practical applications.

...read moreread less

Abstract: Expressive comparison of strings is a prerequisite for analysis of sequential data in many areas of computer science. However, comparing strings and assessing their similarity is not a trivial task and there exists several contrasting approaches for defining similarity measures over sequential data. In this paper, we review three major classes of such similarity measures: edit distances, bag-of-word models, and string kernels. Each of these classes originates from a particular application domain and models similarity of strings differently. We present these classes and underlying comparisons in detail, highlight advantages, and differences as well as provide basic algorithms supporting practical applications. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 296–304 DOI: 10.1002/widm.36

...read moreread less

Journal Article•10.1002/WIDM.7•

Tech mining: Text mining and visualization tools, as applied to nanoenhanced solar cells

[...]

Alan L. Porter¹, Ying Guo², Denise Chiavatta³•Institutions (3)

Search Technologies¹, Beijing Institute of Technology², Georgia Institute of Technology³

01 Mar 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Select analyses and visualizations of relevant ST&I data on the topics of nanoenhanced, thin‐film solar cells and dye‐sensitized solar cells help identify complementary and competitive research activity, evaluate research productivity, assess research interdisciplinarity, understand nanotechnology developmental trajectories, and identify and forecast promising nanoapplications.

...read moreread less

Abstract: ‘Tech mining’ is a multistep process for the analysis of science, technology, & innovation (‘ST&I’) information resources. It uses text mining, visualization, and communication tools to provide the empirical knowledge necessary to address management of technology questions. Tech mining can help assess mature or emerging fields of science and technology, such as nanotechnology. Here, we depict select analyses and visualizations of relevant ST&I data on the topics of nanoenhanced, thin-film solar cells and dye-sensitized solar cells. These analyses help identify complementary and competitive research activity, evaluate research productivity, assess research interdisciplinarity, understand nanotechnology developmental trajectories, and identify and forecast promising nanoapplications. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 172-181 DOI: 10.1002/widm.7

...read moreread less

Journal Article•10.1002/WIDM.37•

Understanding online groups through social media

[...]

Geoffrey Barbier¹, Lei Tang¹, Huan Liu¹•Institutions (1)

Arizona State University¹

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper highlights some ongoing research efforts aiming at understanding groups through social media using data mining techniques to provide insights into group behavior and divulge a group's characteristics.

...read moreread less

Abstract: Multiple fields including sociology, anthropology, and business are interested in understanding group behavior. Applying data mining techniques to social media can help provide insights into group behavior and divulge a group's characteristics by identifying a group, developing a profile for a group, revealing the sentiment of a group, and detailing a group's composition. The ability to accomplish these tasks has practical business and scientific applications such as understanding customers better and providing new insights into influence propagation, as well as the ability to accurately categorize groups over time. This paper highlights some ongoing research efforts aiming at understanding groups through social media. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 330–338 DOI: 10.1002/widm.37

...read moreread less

Journal Article•10.1002/WIDM.3•

Soft clustering for information retrieval applications

[...]

G. Bordogna¹, Gabriella Pasi¹•Institutions (1)

University of Milan¹

01 Mar 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper overviews soft clustering algorithms applied in the context of information retrieval (IR) and introduces the expectation maximization and fuzzy c‐means algorithms, and some of their extensions defined to overcome their main drawbacks when applied for organizing large document collections.

...read moreread less

Abstract: This paper overviews soft clustering algorithms applied in the context of information retrieval (IR). First, a motivation of the utility of soft clustering approaches in IR is discussed. Then, an outline of the two main flat soft approaches, namely probabilistic clustering and fuzzy clustering, is described. Specifically, the expectation maximization and fuzzy c-means algorithms are introduced, and some of their extensions defined to overcome their main drawbacks when applied for organizing large document collections. Finally, soft hierarchical clustering algorithms designed for generating taxonomies of documents are introduced. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 138-146 DOI: 10.1002/widm.3

...read moreread less

Journal Article•10.1002/WIDM.35•

Mining of protein contact maps for protein fold prediction

[...]

Durga Bhavani S¹, Suvarnavani K², Somdatta Sinha³•Institutions (3)

University UCINF¹, Velagapudi Ramakrishna Siddhartha Engineering College², Centre for Cellular and Molecular Biology³

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: A simple and computationally inexpensive algorithm based on triangle subdivision method is proposed to extract additional features from the contact map and results show great promise in developing a new and simple tool for the challenging problem of fold prediction.

...read moreread less

Abstract: The three-dimensional structure of proteins is useful to carry out the biophysical and biochemical functions in a cell. Approaches to protein structure/fold prediction typically extract amino acid sequence features, and machine learning approaches are then applied to classification problem. Protein contact maps are two-dimensional representations of the contacts among the amino acid residues in the folded protein structure. This paper highlights the need for a systematic study of these contact networks. Mining of contact maps to derive features pertaining to fold information offers a new mechanism for fold discovery from the protein sequence via the contact maps. These ideas are explored in the structural class of all-alpha proteins to identify structural elements. A simple and computationally inexpensive algorithm based on triangle subdivision method is proposed to extract additional features from the contact map. The method successfully characterizes the off-diagonal interactions in the contact map for predicting specific ‘folds’. The decision tree classification results show great promise in developing a new and simple tool for the challenging problem of fold prediction. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 362–368 DOI: 10.1002/widm.35

...read moreread less

Journal Article•10.1002/WIDM.39•

Distance-preserving dimensionality reduction

[...]

Li Yang¹•Institutions (1)

Western Michigan University¹

01 Sep 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: An overview of basic concepts and principles that deal with the problem of mapping high‐dimensional data to low‐dimensional space such that distances between all or some pairs of data points are preserved.

...read moreread less

Abstract: This paper presents an overview of basic concepts and principles that deal with the problem of mapping high-dimensional data to low-dimensional space such that distances between all or some pairs of data points are preserved. It introduces related techniques and systematizes today's methods into linear methods, methods using iterative optimization, methods preserving exact distances, methods using geodesic distances, and methods using alignments of local models. It discusses these methods by focusing on their basic ideas, by summarizing their common features and differences, and by comparing their strengths and weaknesses. This paper assumes no familiarity with dimensionality reduction. The main text should be readable by people with little technical background. Technical information of important algorithms is briefly presented in sidebars, the reading of which assumes basics in statistics and matrix computation. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 369-380 DOI: 10.1002/widm.39

...read moreread less

Journal Article•10.1002/WIDM.34•

Fuzzy machine learning and data mining a

[...]

Eyke Hüllermeier¹•Institutions (1)

University of Marburg¹

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The aim of this paper is to assess the relevance of fuzzy set theory and fuzzy logic for these fields, highlighting potential contributions without concealing alleged limitations and shortcomings of current research.

...read moreread less

Abstract: The development of methods for machine learning and data mining has attracted increasing attention in the fuzzy set community in recent years. The aim of this paper is to assess the relevance of fuzzy set theory and fuzzy logic for these fields, highlighting potential contributions without concealing alleged limitations and shortcomings of current research. To this end, some typical applications of fuzzy logic will be reviewed, followed by a more systematic discussion of possible benefits of fuzzy methods. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 269–283 DOI: 10.1002/widm.34

...read moreread less

Journal Article•10.1002/WIDM.17•

Introducing WIREs Data Mining and Knowledge Discovery

[...]

Witold Pedrycz¹•Institutions (1)

University of Alberta¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This article is categorized under: Algorithmic Development > Biological Data Mining Al algorithmic development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering.

...read moreread less

Abstract: This article is categorized under: Algorithmic Development > Biological Data Mining Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering

...read moreread less

Journal Article•10.1002/WIDM.29•

Gene expression modular analysis: an overview from the data mining perspective

[...]

Alberto Pascual-Montano¹•Institutions (1)

Spanish National Research Council¹

01 Sep 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This article makes the reader aware of the main problems that still persist and provides a description of the methodologies that are applied for classification, clustering, and functional exploration of gene expression data.

...read moreread less

Abstract: In this review, we discuss the main problems and state-of-the-art solutions applied to the field of gene expression. Specific data analysis workflows have been developed in parallel with the technology and currently cover a very wide spectrum of methods and applications needed to give answers to a lot of scientific questions that this type of data are producing. Computer science and, more specifically, the data mining area is still benefiting from a large set of real-case scenarios to apply and develop new ideas and tools for discovering biological knowledge and new information from this experimental data. In this article, we make the reader aware of the main problems that still persist and provide a description of the methodologies that are applied for classification, clustering, and functional exploration of gene expression data. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 381-396 DOI: 10.1002/widm.29

...read moreread less

Journal Article•10.1002/WIDM.22•

Ensembles of case-based reasoning classifiers in high-dimensional biological domains

[...]

Niloofar Arshadi¹, Igor Jurisica¹•Institutions (1)

Ontario Institute for Cancer Research¹

01 Mar 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: An ensemble for case‐based reasoning (E4CBR) approach where an ensemble of CBR classifiers is combined with clustering and feature selection, which demonstrates that the aggregation method outperforms the existing CBR aggregation methods.

...read moreread less

Abstract: In order to extend the capabilities of case-based reasoning (CBR), we implemented an ensemble for case-based reasoning (E4CBR) approach where an ensemble of CBR classifiers is combined with clustering and feature selection. We first select a subset of features of all the cases, and then cluster the cases into disjoint groups, where each group of cases forms the case-base of one of the member classifiers. Finally, in each case-base, a subset of features is ‘locally’ selected individually. To predict the label of an unseen case, each classifier in the ensemble provides a prediction, and the aggregation component of E4CBR combines the predictions by weighing each classifier using a CBR approach—a classifier with more cases similar to the test case receives a higher weight.We evaluated E4CBR on four publicly available biological data sets, and also compared the classification error of E4CBR with a single CBR classifier. In our experiments, we use TA3—a computational framework for CBR systems. Our results show that E4CBR reduces the classification error of our CBR classifier. On the basis of empirical results, our aggregation method outperforms the existing CBR aggregation methods. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 164-171 DOI: 10.1002/widm.22

...read moreread less