Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

doi:10.1016/J.ESWA.2011.12.043

Journal Article10.1016/J.ESWA.2011.12.043

Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

Victoria López, +3 more

- 01 Jun 2012

- Expert Systems With Applications

- Vol. 39, Iss: 7, pp 6585-6608

321

TL;DR: This work analyzes the performance of data level proposals against algorithm level proposals focusing in cost-sensitive models and versus a hybrid procedure that combines those two approaches to show that an unique approach among the rest cannot be highlighted.

Abstract: Class imbalance is among the most persistent complications which may confront the traditional supervised learning task in real-world applications. The problem occurs, in the binary case, when the number of instances in one class significantly outnumbers the number of instances in the other class. This situation is a handicap when trying to identify the minority class, as the learning algorithms are not usually adapted to such characteristics. The approaches to deal with the problem of imbalanced datasets fall into two major categories: data sampling and algorithmic modification. Cost-sensitive learning solutions incorporating both the data and algorithm level approaches assume higher misclassification costs with samples in the minority class and seek to minimize high cost errors. Nevertheless, there is not a full exhaustive comparison between those models which can help us to determine the most appropriate one under different scenarios. The main objective of this work is to analyze the performance of data level proposals against algorithm level proposals focusing in cost-sensitive models and versus a hybrid procedure that combines those two approaches. We will show, by means of a statistical comparative analysis, that we cannot highlight an unique approach among the rest. This will lead to a discussion about the data intrinsic characteristics of the imbalanced classification problem which will help to follow new paths that can lead to the improvement of current models mainly focusing on class overlap and dataset shift in imbalanced classification.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Artificial neural networks

Andrea Roli

- 09 Mar 2012

TL;DR: Artificial neural networks (ANNs) constitute a class of flexible nonlinear models designed to mimic biological neural systems as mentioned in this paper, and they have been widely used in computer vision applications.

...read moreread less

2.9K

Journal Article•10.1016/J.ESWA.2016.12.035

Learning from class-imbalanced data

Guo Haixiang, +5 more

- 01 May 2017

- Expert Systems With Applications

TL;DR: An in depth review of rare event detection from an imbalanced learning perspective and a comprehensive taxonomy of the existing application domains of im balanced learning are provided.

...read moreread less

2K

•Journal Article•10.1613/JAIR.1.11192

SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary

Alberto Fernández, +3 more

- 01 Jan 2018

- Journal of Artificial Intelligence Resea...

TL;DR: The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data because of its simplicity in the design, as well as its robustness when applied to different type of problems.

...read moreread less

1.6K

Journal Article•10.1016/J.INS.2013.07.007

An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics

Victoria López, +4 more

- 20 Nov 2013

- Information Sciences

TL;DR: This work carries out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem, and introduces several approaches and recommendations to address these problems in conjunction with imbalanced data.

...read moreread less

1.6K

Journal Article•10.1016/j.eswa.2016.12.035

Learning from class-imbalanced data: Review of methods and applications

Haixiang Guo, +5 more

- 01 May 2017

- Expert Systems With Applications

TL;DR: An in depth review of rare event detection from an imbalanced learning perspective and a comprehensive taxonomy of the existing application domains of im balanced learning are provided.

...read moreread less

932

...

Expand

References

•Journal Article•10.1023/A:1022627411411

Support-Vector Networks

Corinna Cortes, +1 more

- 15 Sep 1995

- Machine Learning

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

...read moreread less

42K

Statistical learning theory

Vladimir Vapnik

- 01 Jan 1998

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

...read moreread less

30.4K

•Journal Article•10.1613/JAIR.953

SMOTE: synthetic minority over-sampling technique

Nitesh V. Chawla, +3 more

- 01 Jan 2002

- Journal of Artificial Intelligence Resea...

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

...read moreread less

27.7K

•Book

C4.5: Programs for Machine Learning

J. Ross Quinlan

- 15 Oct 1992

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

...read moreread less

27.2K

•Journal Article•10.1109/TIT.1967.1053964

Nearest neighbor pattern classification

Thomas M. Cover, +1 more

- 01 Jan 1967

- IEEE Transactions on Information Theory

TL;DR: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points, so it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.

...read moreread less

15.2K