TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
TL;DR: An overview of several robust methods and outlier detection tools for univariate, low‐dimensional, and high‐dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification are presented.
TL;DR: In this article, a density-based clustering is defined as the task of identifying groups or clusters in a data set, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects.
TL;DR: The genesis of, and motivation for, the random forest paradigm as an outgrowth from earlier tree‐structured techniques is outlined and an illustrative example from ecology is provided that showcases the improved fit and enhanced interpretation afforded by the random Forest framework.
Abstract: Random forests have emerged as a versatile and highly accurate classificationand regression methodology, requiring little tuning and providing interpretableoutputs. Here, we briefly outline the genesis of, and motivation for, the randomforest paradigm as an outgrowth from earlier tree-structured techniques. Weelaborate on aspects of prediction error and attendant tuning parameter issues.However,ouremphasisisonextendingtherandomforestschematothemultipleresponse setting. We provide a simple illustrative example from ecology thatshowcases the improved fit and enhanced interpretation afforded by the randomforest framework.
TL;DR: This paper explores the emerging field of spatial data mining, focusing on different methods to extract patterns from spatial information, and concludes with a look at future research needs.
Abstract: Explosive growth in geospatial data and the emergence of new spatial technologies emphasize the need for automated discovery of spatial knowledge. Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial databases. The complexity of spatial data and implicit spatial relationships limits the usefulness of conventional data mining techniques for extracting spatial patterns. In this paper, we explore the emerging field of spatial data mining, focusing on different methods to extract patterns from spatial information. We conclude with a look at future research needs. C
TL;DR: This paper presents a very short introduction to multiobjective evolutionary algorithms, including their basic concepts and their main components, including selection mechanisms, diversity maintenance mechanisms, and elitism in a multi‐objective context.
TL;DR: The paper reviews published work on the issue with respect to mixture of distributions, partition, especially in k‐means clustering, and hierarchical cluster structures and some perspective directions for further developments are outlined.
Abstract: The issue of determining ‘the right number of clusters’ is attracting ever growing interest. The paper reviews published work on the issue with respect to mixture of distributions, partition, especially in k-means clustering, and hierarchical cluster structures. Some perspective directions for further developments are outlined.
TL;DR: Recent algorithmic development on mining uncertain data in these probabilistic databases for frequent patterns from probabilism databases of uncertain data is reviewed.
TL;DR: A novel, consistent, three‐phase methodology incorporating conceptual, logical, and physical design is outlined, and tools supporting the complete design and development process are presented.
TL;DR: Association rule learning is a data mining task that tries to discover interesting relations between variables in large databases as discussed by the authors, and the main applications of association rule evolutionary learning covered by the specialized bibliography are reviewed.
TL;DR: In this article, the authors employ data mining techniques to look at the problem of sensor selection for activity recognition in smart homes and analyze the results based on six datasets collected in five distinct smart home environments.
TL;DR: This article provides an overview of recent works on methods for change analysis, thereby focusing on contrast mining and change mining, the two emerging subfields of contemporary data mining research.
TL;DR: Two techniques, one with fixed number of clusters and another with a variable number of fuzzy clusters, are described along with some experimental results on numerical as well as image data sets.
TL;DR: Experiments have shown that heterogeneous architectures employing GPUs or FPGAs can result in significant application speedups over homogenous CPU‐based systems, while increasing performance per watt.
Abstract: Conventional systems based on general-purpose processors cannot keep pace with the exponential increase in the generation and collection of data. It is therefore important to explore alternative architectures that can provide the computational capabilities required to analyze ever-growing datasets. Programmable graphics processing units (GPUs) offer computational capabilities that surpass even high-end multi-core central processing units (CPUs), making them wellsuited for floating-point- or integer-intensive and data parallel operations. Fieldprogrammable gate arrays (FPGAs), which can be reconfigured to implement an arbitrary circuit, provide the capability to specify a customized datapath for any task. The multiple granularities of parallelism offered by FPGA architectures, as well as their high internal bandwidth, make them suitable for low complexity parallel computations. GPUs and FPGAs can serve as coprocessors for data mining applications, allowing the CPU to offload computationally intensive tasks for faster processing. Experiments have shown that heterogeneous architectures employingGPUsorFPGAscanresultinsignificantapplicationspeedupsoverhomogenous CPU-based systems, while increasing performance per watt. C
TL;DR: In this paper, the authors investigated six cost-sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling RUS.
TL;DR: This study first review the current research on imputing missing values, and then experimentally evaluate the techniques and demonstrate the efficiency of missing data imputation techniques to enhance information in the process of pattern discovery from datasets with missing values.
TL;DR: This review article summarizes the different imaging technologies and recently published approaches to MVI mining with a special focus on biomedical applications.
TL;DR: The work on the application of classification decision trees, regression model tree, and association rules using properly preprocessed data of the FFR molecular docking results are reviewed, and it is explained how data mining techniques can support the acceleration of molecular docking simulations of FFR models.
TL;DR: The similarity measures considered here are based on the use of a fingerprint representation of molecular structure, where a fingerprint is a vector encoding the presence of fragment substructures in a molecule and where the similarity between pairs of such fingerprints is computed using an association coefficient such as the Tanimoto coefficient.
TL;DR: The main algorithms that have been developed for decision trees, artificial neural networks, and other classifiers are presented and the relevant issue of the scalability of these methods to very large datasets is discussed.
TL;DR: This paper reviews three major classes of similarity measures: edit distances, bag‐of‐word models, and string kernels, and presents these classes and underlying comparisons in detail, highlight advantages, and differences as well as provide basic algorithms supporting practical applications.
TL;DR: Select analyses and visualizations of relevant ST&I data on the topics of nanoenhanced, thin‐film solar cells and dye‐sensitized solar cells help identify complementary and competitive research activity, evaluate research productivity, assess research interdisciplinarity, understand nanotechnology developmental trajectories, and identify and forecast promising nanoapplications.
TL;DR: This paper highlights some ongoing research efforts aiming at understanding groups through social media using data mining techniques to provide insights into group behavior and divulge a group's characteristics.
TL;DR: This paper overviews soft clustering algorithms applied in the context of information retrieval (IR) and introduces the expectation maximization and fuzzy c‐means algorithms, and some of their extensions defined to overcome their main drawbacks when applied for organizing large document collections.
TL;DR: A simple and computationally inexpensive algorithm based on triangle subdivision method is proposed to extract additional features from the contact map and results show great promise in developing a new and simple tool for the challenging problem of fold prediction.
TL;DR: An overview of basic concepts and principles that deal with the problem of mapping high‐dimensional data to low‐dimensional space such that distances between all or some pairs of data points are preserved.
TL;DR: The aim of this paper is to assess the relevance of fuzzy set theory and fuzzy logic for these fields, highlighting potential contributions without concealing alleged limitations and shortcomings of current research.
TL;DR: This article is categorized under: Algorithmic Development > Biological Data Mining Al algorithmic development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering.
Abstract: This article is categorized under: Algorithmic Development > Biological Data Mining Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering
TL;DR: This article makes the reader aware of the main problems that still persist and provides a description of the methodologies that are applied for classification, clustering, and functional exploration of gene expression data.
TL;DR: An ensemble for case‐based reasoning (E4CBR) approach where an ensemble of CBR classifiers is combined with clustering and feature selection, which demonstrates that the aggregation method outperforms the existing CBR aggregation methods.