Gene selection and classification of microarray data using random forest

doi:10.1186/1471-2105-7-3

Open AccessJournal Article10.1186/1471-2105-7-3

Gene selection and classification of microarray data using random forest

Ramon Diaz-Uriarte, +1 more

- 06 Jan 2006

- BMC Bioinformatics

- Vol. 7, Iss: 1, pp 3-3

3K

TL;DR: It is shown that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Abstract: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1093/BIOINFORMATICS/BTM344

A review of feature selection techniques in bioinformatics

Yvan Saeys, +2 more

- 10 Sep 2007

- Bioinformatics

TL;DR: A basic taxonomy of feature selection techniques is provided, providing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.

...read moreread less

5.4K

Journal Article•10.1016/J.ISPRSJPRS.2016.01.011

Random forest in remote sensing: A review of applications and future directions

Mariana Belgiu, +1 more

- 01 Apr 2016

- Isprs Journal of Photogrammetry and Remo...

TL;DR: This review has revealed that RF classifier can successfully handle high data dimensionality and multicolinearity, being both fast and insensitive to overfitting.

...read moreread less

5.2K

•Journal Article•10.1186/1471-2105-8-25

Bias in random forest variable importance measures: Illustrations, sources and a solution

Carolin Strobl, +3 more

- 25 Jan 2007

- BMC Bioinformatics

TL;DR: An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.

...read moreread less

3.3K

•Journal Article•10.1186/1471-2105-9-307

Conditional variable importance for random forests

Carolin Strobl, +4 more

- 11 Jul 2008

- BMC Bioinformatics

TL;DR: A new, conditional permutation scheme is developed for the computation of the variable importance measure that reflects the true impact of each predictor variable more reliably than the original marginal approach.

...read moreread less

3.1K

•Journal Article•10.1038/S41586-018-0590-4

Single-cell transcriptomics of 20 mouse organs creates a "Tabula Muris"

Overall coordination, +4 more

- 18 Oct 2018

- Nature

TL;DR: A compendium of single-cell transcriptomic data from the model organism Mus musculus that comprises more than 100,000 cells from 20 organs and tissues is presented, representing a new resource for cell biology and enabling the direct and controlled comparison of gene expression in cell types that are shared between tissues.

...read moreread less

2.5K

...

Expand

References

•Journal Article

R: A language and environment for statistical computing.

R Core Team

- 01 Jan 2014

- MSOR connections

TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.

...read moreread less

410.8K

•Journal Article•10.1023/A:1010933404324

Random Forests

Leo Breiman

- 01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

113.1K

Journal Article•10.1145/1961189.1961199

LIBSVM: A library for support vector machines

Chih-Chung Chang, +1 more

- 06 May 2011

- ACM Transactions on Intelligent Systems ...

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

46.3K

•Book

The Elements of Statistical Learning

Trevor Hastie, +2 more

- 01 Jan 2001

29.4K

•Book

Classification and regression trees

Leo Breiman

- 01 Jan 1983

TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.

...read moreread less

22.7K

...

Expand

Gene selection and classification of microarray data using random forest

Chat with Paper

AI Agents for this Paper

Citations

A review of feature selection techniques in bioinformatics

Random forest in remote sensing: A review of applications and future directions

Bias in random forest variable importance measures: Illustrations, sources and a solution

Conditional variable importance for random forests

Single-cell transcriptomics of 20 mouse organs creates a "Tabula Muris"

References

R: A language and environment for statistical computing.

Random Forests

LIBSVM: A library for support vector machines

The Elements of Statistical Learning

Classification and regression trees

Related Papers (5)

Random Forests

Classification and Regression by randomForest

Bagging predictors

Classification and Regression Trees.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.