Gene selection and classification of microarray data using random forest
TL;DR: It is shown that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
read more
Abstract: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A review of feature selection techniques in bioinformatics
TL;DR: A basic taxonomy of feature selection techniques is provided, providing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
5.4K
Random forest in remote sensing: A review of applications and future directions
Mariana Belgiu,Lucian Drăguţ +1 more
TL;DR: This review has revealed that RF classifier can successfully handle high data dimensionality and multicolinearity, being both fast and insensitive to overfitting.
5.2K
Bias in random forest variable importance measures: Illustrations, sources and a solution
TL;DR: An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.
Conditional variable importance for random forests
TL;DR: A new, conditional permutation scheme is developed for the computation of the variable importance measure that reflects the true impact of each predictor variable more reliably than the original marginal approach.
Single-cell transcriptomics of 20 mouse organs creates a "Tabula Muris"
Overall coordination,Logistical coordination,Library preparation,Cell type annotation,Principal investigators +4 more
TL;DR: A compendium of single-cell transcriptomic data from the model organism Mus musculus that comprises more than 100,000 cells from 20 organs and tissues is presented, representing a new resource for cell biology and enabling the direct and controlled comparison of gene expression in cell types that are shared between tissues.
References
•Journal Article
R: A language and environment for statistical computing.
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
410.8K
Random Forests
Leo Breiman
- 01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
LIBSVM: A library for support vector machines
Chih-Chung Chang,Chih-Jen Lin +1 more
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
•Book
The Elements of Statistical Learning
Trevor Hastie,Robert Tibshirani,Jerome H. Friedman +2 more
- 01 Jan 2001
29.4K
•Book
Classification and regression trees
Leo Breiman
- 01 Jan 1983
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
22.7K
Related Papers (5)
[...]
Leo Breiman
- 01 Oct 2001
Andy Liaw,Matthew C. Wiener +1 more
- 01 Jan 2007
[...]
Leo Breiman
- 01 Aug 1996