Supervised dimensionality reduction for big data.

doi:10.1038/S41467-021-23102-2

Open AccessJournal Article10.1038/S41467-021-23102-2

Supervised dimensionality reduction for big data.

Joshua T. Vogelstein, +6 more

- 17 May 2021

- Nature Communications

- Vol. 12, Iss: 1, pp 2872-2872

41

TL;DR: Linear Optimal Low-Rank Projection (LOP) as discussed by the authors extends principal component analysis (PCA) by incorporating class-conditional moment estimates into the low-dimensional projection.

Abstract: To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

10.1021/acs.chemrev.3c00302

2D Materials in Flexible Electronics: Recent Advances and Future Prospectives.

Ajit Kumar Katiyar, +6 more

- 06 Dec 2023

- Chemical Reviews

TL;DR: 2D materials are highly promising for flexible electronics due to their unique properties and compatibility with other materials. They enable the creation of various flexible electronic devices, including wearable electronics and foldable displays.

...read moreread less

68

10.1259/bjr.20230211

Artificial intelligence (AI) and machine learning (ML) in precision oncology: a review on enhancing discoverability through multiomics integration.

Lise Wei, +8 more

- 03 Sep 2023

- British Journal of Radiology

TL;DR: Different categories of multiomics data and their roles in diagnosis and therapy are presented and AI-based data fusion methods and modeling methods as well as different validation schemes are illustrated.

...read moreread less

38

Journal Article•10.1038/s41467-023-41512-2

First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa

Gemma Turon, +4 more

- 15 Dec 2022

- bioRxiv

TL;DR: ZairaChem, an artificial intelligence (AI)- and machine learning (ML)-based tool to train small-molecule activity prediction models, is presented and it is shown how computational profiling of compounds, prior to synthesis and experimental testing, can increase the rate of progression by up to 40%.

...read moreread less

35

Journal Article•10.1126/science.adk5864

Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells

Fabian Offensperger, +30 more

- 26 Apr 2024

- Science

TL;DR: The proteome-wide binding preferences of more than 400 small-molecule fragments are determined through a chemoproteomics strategy that is based on treatment of intact cells, and an ML framework to build models that can predict how fragments interact with native proteins on a proteome-wide scale is developed.

...read moreread less

34

Journal Article•10.1016/j.jneumeth.2022.109745

A review on Alzheimer’s disease classification from normal controls and mild cognitive impairment using structural MR images

Neha Garg, +2 more

- 01 Nov 2022

- Journal of Neuroscience Methods

TL;DR: A review of structural MRI-based studies for AD detection is presented in this paper , where the performance of various feature extraction methods has been compared and it has been observed that the wavelet transform-based feature extraction method would give promising results for AD classification.

...read moreread less

31

...

Expand

References

•Journal Article•10.1023/A:1010933404324

Random Forests

Leo Breiman

- 01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

113.1K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

Journal Article•10.1111/J.2517-6161.1996.TB02080.X

Regression Shrinkage and Selection via the Lasso

Robert Tibshirani

- 01 Jan 1996

- Journal of the royal statistical society...

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.

...read moreread less

45.4K

•Posted Content

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, +39 more

- 01 Jan 2015

- arXiv: Distributed, Parallel, and Cluste...

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less

13.6K

Journal Article•10.1109/34.598228

Eigenfaces vs. Fisherfaces: recognition using class specific linear projection

Peter N. Belhumeur, +2 more

- 01 Jul 1997

- IEEE Transactions on Pattern Analysis an...

TL;DR: A face recognition algorithm which is insensitive to large variation in lighting direction and facial expression is developed, based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variations in lighting and facial expressions.

...read moreread less

12.3K

...

Expand

Supervised dimensionality reduction for big data.

Chat with Paper

AI Agents for this Paper

Citations

2D Materials in Flexible Electronics: Recent Advances and Future Prospectives.

Artificial intelligence (AI) and machine learning (ML) in precision oncology: a review on enhancing discoverability through multiomics integration.

First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa

Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells

A review on Alzheimer’s disease classification from normal controls and mild cognitive impairment using structural MR images

References

Random Forests

ImageNet Classification with Deep Convolutional Neural Networks

Regression Shrinkage and Selection via the Lasso

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Eigenfaces vs. Fisherfaces: recognition using class specific linear projection

Related Papers (5)

Stable bagging feature selection on medical data

A Selective Overview of Variable Selection in High Dimensional Feature Space.

Entropy-Scaling Search of Massive Biological Data

Principal component analysis: a review and recent developments

Stepwise classification of cancer samples using clinical and molecular data