Revisiting Feature Selection with Data Complexity
Ngan Thi Dong,Megha Khosla +1 more
- 01 Oct 2020
- pp 211-216
8
TL;DR: In this paper, a comparative study of feature selection methods over 27 publicly available datasets evaluated over a range of the selected features using classification as the downstream task was performed, and it was shown that the performance of all studied feature selection method is highly correlated with the error rate of a nearest-neighbor based classifier.
read more
Abstract: The identification of biomarkers or predictive features that are indicative of a specific biological or disease state is a major research topic in biomedical applications. Several feature selection (FS) methods ranging from simple univariate methods to recent deep-learning methods have been proposed to select a minimal set of the most predictive features. However, the main question of which method to use when remains unanswered. We study the above problem from the perspective of data complexity and ask if data complexity measures can be used to guide the selection of the most-suitable method. We perform a comparative study of 11 feature selection methods over 27 publicly available datasets evaluated over a range of the number of selected features using classification as the downstream task. We (empirically) show that as regard to classification, the performance of all studied feature selection methods is highly correlated with the error rate of a nearest-neighbor based classifier. We also argue about the non-suitability of studied complexity measures to determine the optimal number of relevant features. While looking closely at several other aspects, we provide recommendations for choosing a particular FS method for a given dataset.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Revisiting Feature Selection with Data Complexity
Ngan Thi Dong,Megha Khosla +1 more
- 01 Oct 2020
TL;DR: In this paper, a comparative study of feature selection methods over 27 publicly available datasets evaluated over a range of the selected features using classification as the downstream task was performed, and it was shown that the performance of all studied feature selection method is highly correlated with the error rate of a nearest-neighbor based classifier.
8
Private Graph Extraction via Feature Explanations
TL;DR: In this paper , the authors proposed graph reconstruction attacks with post-hoc feature explanations, and investigated the differences between attack performance with respect to three different classes of explanation methods: gradient-based, perturbation-based and surrogate model-based methods.
Investigation of Capsule-Inspired Neural Network Approaches for DNA Methylation
Joshua J. Levy,Youdinghuan Chen,Nasim Azizgolshani,Curtis L. Petersen,Alexander J. Titus,Erika L. Moen,Louis J. Vaickus,Lucas A. Salas,Brock C. Christensen +8 more
TL;DR: Deep-learning software is presented that group CpGs into user-specified or predefined biologically relevant groupings related to diagnostic and prognostic outcomes and presents opportunities to increase interpretability of disease mechanisms through utilization of biologically relevant annotations.
MethylSPWNet and MethylCapsNet: Biologically Motivated Organization of DNAm Neural Network, Inspired by Capsule Networks
Joshua J. Levy,Joshua J. Levy,Youdinghuan Chen,Nasim Azizgolshani,Curtis L. Petersen,Curtis L. Petersen,Alexander J. Titus,Erika L. Moen,Erika L. Moen,Louis J. Vaickus,Lucas A. Salas,Brock C. Christensen +11 more
TL;DR: MethylCapsNet and MethylSPWNet as discussed by the authors group CpGs into biologically relevant capsules, such as gene promoter context, CpG island relationship, or user-defined groupings, and relate them to diagnostic and prognostic outcomes.
The impact of training data selection on the software defect prediction performance and data complexity
TL;DR: This study compared 13 training data selection methods on 61 projects using six classification algorithms and measured the data complexity using six complexity measures, concluding that critically selecting the training data method could improve the performance of the prediction model.
2
References
Regression Shrinkage and Selection via the Lasso
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Regularization and variable selection via the elastic net
Hui Zou,Trevor Hastie +1 more
TL;DR: It is shown that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation, and an algorithm called LARS‐EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lamba.
Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications
Therese Sørlie,Charles M. Perou,Robert Tibshirani,Turid Aas,Stephanie Geisler,Hilde Johnsen,Trevor Hastie,Michael B. Eisen,Matt van de Rijn,Stefanie S. Jeffrey,T. Thorsen,Hanne Quist,John C. Matese,Patrick O. Brown,David Botstein,Per Eystein Lønning,Anne Lise Børresen-Dale +16 more
TL;DR: Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.
•Proceedings Article
A Comparative Study on Feature Selection in Text Categorization
Yiming Yang,Jan O. Pedersen +1 more
- 08 Jul 1997
TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
5.6K
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Uri Alon,Naama Barkai,Daniel A. Notterman,Kurt C. Gish,S. Ybarra,David H. Mack,A. J. Levine,A. J. Levine +7 more
TL;DR: In this paper, a two-way clustering algorithm was applied to both the genes and the tissues, revealing broad coherent patterns that suggest a high degree of organization underlying gene expression in these tissues.