Journal Article10.1126/science.adg7492
Accurate proteome-wide missense variant effect prediction with AlphaMissense
Jun Cheng,Guido Novati,Joshua Pan,Clare Bycroft,Akvilė Žemgulytė,Taylor Applebaum,Alexander Pritzel,Lai Hong Wong,Michal Zielinski,Tobias Sargeant,Rosalia G. Schneider,Andrew W. Senior,John M. Jumper,Demis Hassabis,Pushmeet Kohli,Žiga Avsec +15 more
774
TL;DR: AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity, achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data.
read more
Abstract: The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic. Description Editor’s summary Single–amino acid changes in proteins sometimes have little effect but can often lead to problems in protein folding, activity, or stability. Only a small fraction of variants have been experimentally investigated, but there are vast amounts of biological sequence data that are suitable for use as training data for machine learning approaches. Cheng et al. developed AlphaMissense, a deep learning model that builds on the protein structure prediction tool AlphaFold2 (see the Perspective by Marsh and Teichmann). The model is trained on population frequency data and uses sequence and predicted structural context, all of which contribute to its performance. The authors evaluated the model against related methods using clinical databases not included in the training and demonstrated agreement with multiplexed assays of variant effect. Predictions for all single–amino acid substitutions in the human proteome are provided as a community resource. —Michael A. Funk AlphaFold fine-tuned on human and primate population variant frequency databases predicts variant pathogenicity. INTRODUCTION Genome sequencing has revealed extensive genetic variation in human populations. Missense variants are genetic variants that alter the amino acid sequence of proteins. Pathogenic missense variants disrupt protein function and reduce organismal fitness, while benign missense variants have limited effect. RATIONALE Classifying these variants is an important ongoing challenge in human genetics. Of more than 4 million observed missense variants, only an estimated 2% have been clinically classified as pathogenic or benign, while the vast majority of them are of unknown clinical significance. This limits the diagnosis of rare diseases, as well as the development or application of clinical treatments that target the underlying genetic cause. Machine learning approaches could close the variant interpretation gap by exploiting patterns in biological data to predict the pathogenicity of unannotated variants. Specifically, AlphaFold, which accurately predicts protein structure from protein sequence, may be used as a foundation to predict the pathogenicity of variants on proteins. RESULTS We developed AlphaMissense to leverage advances on multiple fronts: (i) unsupervised protein language modeling to learn amino acid distributions conditioned on sequence context; (ii) incorporating structural context by using an AlphaFold-derived system; and (iii) fine-tuning on weak labels from population frequency data, thereby avoiding bias from human-curated annotations. AlphaMissense achieves state-of-the-art missense pathogenicity predictions in clinical annotation, de novo disease variants, and experimental assay benchmarks without explicitly training on such data. As a resource to the community, we provide a database of predictions for all possible single amino acid substitutions in the human proteome. We classify 32% of all missense variants as likely pathogenic and 57% as likely benign using a cutoff yielding 90% precision on the ClinVar dataset, thereby providing a confident prediction for most human missense variants. We show how this resource can be used to accelerate research in multiple fields. Molecular biologists could use the database as a starting point for designing and interpreting experiments that probe saturating amino acid substitutions across the human proteome. Human geneticists could combine gene-level AlphaMissense predictions with population cohort–based approaches to quantify the functional significance of genes, especially for shorter human genes where cohort-based approaches lack statistical power. Finally, clinicians could benefit from the boost in coverage of confidently classified pathogenic variants when prioritizing de novo variants for rare disease diagnostics, and AlphaMissense predictions could inform studies of complex trait genetics that use annotations of rare, likely deleterious variants. CONCLUSION AlphaMissense predictions may illuminate the molecular effects of variants on protein function, contribute to the identification of pathogenic missense mutations and previously unknown disease-causing genes, and increase the diagnostic yield of rare genetic diseases. AlphaMissense will also foster further development of specialized protein variant effect predictors from structure prediction models. AlphaMissense pathogenicity prediction. AlphaMissense takes as input a missense variant and predicts its pathogenicity. We fine-tuned AlphaFold on human and primate variant population frequency data and calibrated the confidence on known disease variants. AlphaMissense predicts the probability of a missense variant being pathogenic and classifies it as either likely benign, likely pathogenic, or uncertain. We provide predictions for all possible human missense variants as a resource for the community.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
The Reactome Pathway Knowledgebase 2024.
Marija Milacic,Deidre Beavers,P. Conley,Chuqiao Gong,Marc Gillespie,Johannes Griss,Robin Haw,Bijay Jassal,Lisa Matthews,Bruce May,Robert Petryszak,Eliot Ragueneau,Karen Rothfels,Cristoffer Sevilla,Veronica Shamovsky,Ralf Stephan,Krishna Tiwari,Thawfeek M. Varusai,Joel Weiser,Adam Wright,Guanming Wu,Lincoln Stein,H. Hermjakob,Peter D'Eustachio +23 more
TL;DR: Progress towards annotation of the entire human proteome, targeted annotation of disease-causing genetic variants of proteins and of small-molecule drugs in a pathway context, and towards supporting explicit annotation of cell- and tissue-specific pathways are reviewed.
341
Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering
Jason Yang,Francesca-Zhoufan Li,Frances H. Arnold +2 more
TL;DR: Machine learning (ML) complements enzyme engineering by discovering novel protein sequences, optimizing fitness landscapes, and unlocking new catalytic activities, enhancing expression, stability, and catalytic efficiency, and potentially revolutionizing biocatalysis and biotechnology applications.
61
A guide to artificial intelligence for cancer researchers.
Raquel Perez-Lopez,N. Ghaffari Laleh,Faisal Mahmood,J. N. Kather +3 more
55
Artificial Intelligence in Oncology: Current Landscape, Challenges, and Future Directions.
William Lotter,Michael J Hassett,N. Schultz,Kenneth L. Kehl,Eliezer M. Van Allen,Ethan Cerami +5 more
TL;DR: The current state of the field is summarized through the lens of clinical translation along the clinical care continuum, with a specific focus on clinical integration of artificial intelligence applications in oncology.
38
Whole genome sequencing in clinical practice
Frederik Otzen Bagger,Line Borgwardt,Andreas Sand Jespersen,Anna Reimer Hansen,Birgitte Bertelsen,Miyako Kodama,Finn Cilius Nielsen +6 more
TL;DR: The review provides an overview of WGS in clinical practice - describing the technology and current applications as well as challenges connected with data processing, interpretation and clinical reporting.
33
References
Predicting and interpreting large scale mutagenesis data using analyses of protein stability and conservation
TL;DR: In this article, the authors collected and analysed experimental measurements of effects of >150,000 variants in 29 proteins and used biophysical calculations to predict changes in stability for each variant, and assessed them in light of sequence conservation.
Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations.
TL;DR: DeepSequence clearly stood out, showing both the strongest correlations with DMS data and having the best ability to predict pathogenic mutations, which is especially remarkable given that it is an unsupervised method.
Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval
Pascal Notin,M. Dias,Jonathan Frazer,Javier Marchena-Hurtado,Aidan N. Gomez,Debora S. Marks,Yarin Gal +6 more
- 27 May 2022
TL;DR: Tranception is introduced, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art protein prediction performance, and ProteinGym is developed – an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.