Intelligent Data Analysis

Conference Tools

Papers published on a yearly basis

Papers

Journal Article•10.1016/S1088-467X(97)00008-5•

Feature Selection for Classification

[...]

Manoranjan Dash¹, Huan Liu¹•Institutions (1)

National University of Singapore¹

1 May 1997

TL;DR: This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.

...read moreread less

Abstract: Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.

...read moreread less

3,443 citations

Journal Article•10.3233/IDA-2002-6504•

The class imbalance problem: A systematic study

[...]

Nathalie Japkowicz¹, Shaju Stephen¹•Institutions (1)

University of Ottawa¹

1 Oct 2002

TL;DR: The assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines is investigated.

...read moreread less

Abstract: In machine learning problems, differences in prior class probabilities -- or class imbalances -- have been reported to hinder the performance of some standard classifiers, such as decision trees. This paper presents a systematic study aimed at answering three different questions. First, we attempt to understand the nature of the class imbalance problem by establishing a relationship between concept complexity, size of the training set and class imbalance level. Second, we discuss several basic re-sampling or cost-modifying methods previously proposed to deal with the class imbalance problem and compare their effectiveness. The results obtained by such methods on artificial domains are linked to results in real-world domains. Finally, we investigate the assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines.

...read moreread less

3,439 citations

Journal Article•10.3233/IDA-2007-11508•

Toward accurate dynamic time warping in linear time and space

[...]

Stan Salvador¹, Philip K. Chan²•Institutions (2)

General Dynamics¹, Florida Institute of Technology²

1 Oct 2007

TL;DR: This paper introduces FastDTW, an approximation of DTW that has a linear time and space complexity and shows a large improvement in accuracy over existing methods.

...read moreread less

Abstract: Dynamic Time Warping (DTW) has a quadratic time and space complexity that limits its use to small time series. In this paper we introduce FastDTW, an approximation of DTW that has a linear time and space complexity. FastDTW uses a multilevel approach that recursively projects a solution from a coarser resolution and refines the projected solution. We prove the linear time and space complexity of FastDTW both theoretically and empirically. We also analyze the accuracy of FastDTW by comparing it to two other types of existing approximate DTW algorithms: constraints (such as Sakoe-Chiba Bands) and abstraction. Our results show a large improvement in accuracy over existing methods.

...read moreread less

1,733 citations

Book•10.1201/EBK1439826119•

Knowledge discovery from data streams

[...]

João Gama¹, Auroop R. Ganguly², Olufemi A. Omitaomu², Raju Vatsavai², Mohamed Medhat Gaber³ - Show less +1 more•Institutions (3)

University of Porto¹, Oak Ridge National Laboratory², Monash University³

1 Aug 2009

TL;DR: Knowledge Discovery from Data Streams as mentioned in this paper presents a coherent overview of state-of-the-art research in learning from data streams, covering the fundamentals that are imperative to understand data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks and customer click streams.

...read moreread less

Abstract: Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

...read moreread less

828 citations

Book Chapter•10.1007/3-540-44816-0_31•

Active Hidden Markov Models for Information Extraction

[...]

Tobias Scheffer¹, Christian Decomain, Stefan Wrobel¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

13 Sep 2001

TL;DR: This paper considers the more challenging task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled documents are available for training, and describes an EM style algorithm for learning HMMs from partially labeled data.

...read moreread less

Abstract: Information extraction from HTML documents requires a classifier capable of assigning semantic labels to the words or word sequences to be extracted. If completely labeled documents are available for training, well-known Markov model techniques can be used to learn such classifiers. In this paper, we consider the more challenging task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled documents are available for training. We first give detailed account of the task and its appropriate loss function, and show how it can be minimized given an HMM. We describe an EM style algorithm for learning HMMs from partially labeled data. We then present an active learning algorithm that selects "difficult" unlabeled tokens and asks the user to label them. We study empirically by how much active learning reduces the required data labeling effort, or increases the quality of the learned model achievable with a given amount of user effort.

...read moreread less

674 citations

...

Expand

Year	Papers
2021	106
2020	146
2019	82
2018	192
2017	161
2016	151

Conference Tools

Papers published on a yearly basis

Papers

Feature Selection for Classification

The class imbalance problem: A systematic study

Toward accurate dynamic time warping in linear time and space

Knowledge discovery from data streams

Active Hidden Markov Models for Information Extraction

Performance Metrics