Accurate proteome-wide missense variant effect prediction with AlphaMissense

doi:10.1126/science.adg7492

Journal Article10.1126/science.adg7492

Accurate proteome-wide missense variant effect prediction with AlphaMissense

Jun Cheng, +15 more

- 19 Sep 2023

- Science

- Vol. 381

774

TL;DR: AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity, achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data.

Abstract: The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic. Description Editor’s summary Single–amino acid changes in proteins sometimes have little effect but can often lead to problems in protein folding, activity, or stability. Only a small fraction of variants have been experimentally investigated, but there are vast amounts of biological sequence data that are suitable for use as training data for machine learning approaches. Cheng et al. developed AlphaMissense, a deep learning model that builds on the protein structure prediction tool AlphaFold2 (see the Perspective by Marsh and Teichmann). The model is trained on population frequency data and uses sequence and predicted structural context, all of which contribute to its performance. The authors evaluated the model against related methods using clinical databases not included in the training and demonstrated agreement with multiplexed assays of variant effect. Predictions for all single–amino acid substitutions in the human proteome are provided as a community resource. —Michael A. Funk AlphaFold fine-tuned on human and primate population variant frequency databases predicts variant pathogenicity. INTRODUCTION Genome sequencing has revealed extensive genetic variation in human populations. Missense variants are genetic variants that alter the amino acid sequence of proteins. Pathogenic missense variants disrupt protein function and reduce organismal fitness, while benign missense variants have limited effect. RATIONALE Classifying these variants is an important ongoing challenge in human genetics. Of more than 4 million observed missense variants, only an estimated 2% have been clinically classified as pathogenic or benign, while the vast majority of them are of unknown clinical significance. This limits the diagnosis of rare diseases, as well as the development or application of clinical treatments that target the underlying genetic cause. Machine learning approaches could close the variant interpretation gap by exploiting patterns in biological data to predict the pathogenicity of unannotated variants. Specifically, AlphaFold, which accurately predicts protein structure from protein sequence, may be used as a foundation to predict the pathogenicity of variants on proteins. RESULTS We developed AlphaMissense to leverage advances on multiple fronts: (i) unsupervised protein language modeling to learn amino acid distributions conditioned on sequence context; (ii) incorporating structural context by using an AlphaFold-derived system; and (iii) fine-tuning on weak labels from population frequency data, thereby avoiding bias from human-curated annotations. AlphaMissense achieves state-of-the-art missense pathogenicity predictions in clinical annotation, de novo disease variants, and experimental assay benchmarks without explicitly training on such data. As a resource to the community, we provide a database of predictions for all possible single amino acid substitutions in the human proteome. We classify 32% of all missense variants as likely pathogenic and 57% as likely benign using a cutoff yielding 90% precision on the ClinVar dataset, thereby providing a confident prediction for most human missense variants. We show how this resource can be used to accelerate research in multiple fields. Molecular biologists could use the database as a starting point for designing and interpreting experiments that probe saturating amino acid substitutions across the human proteome. Human geneticists could combine gene-level AlphaMissense predictions with population cohort–based approaches to quantify the functional significance of genes, especially for shorter human genes where cohort-based approaches lack statistical power. Finally, clinicians could benefit from the boost in coverage of confidently classified pathogenic variants when prioritizing de novo variants for rare disease diagnostics, and AlphaMissense predictions could inform studies of complex trait genetics that use annotations of rare, likely deleterious variants. CONCLUSION AlphaMissense predictions may illuminate the molecular effects of variants on protein function, contribute to the identification of pathogenic missense mutations and previously unknown disease-causing genes, and increase the diagnostic yield of rare genetic diseases. AlphaMissense will also foster further development of specialized protein variant effect predictors from structure prediction models. AlphaMissense pathogenicity prediction. AlphaMissense takes as input a missense variant and predicts its pathogenicity. We fine-tuned AlphaFold on human and primate variant population frequency data and calibrated the confidence on known disease variants. AlphaMissense predicts the probability of a missense variant being pathogenic and classifies it as either likely benign, likely pathogenic, or uncertain. We provide predictions for all possible human missense variants as a resource for the community.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/j.jaci.2024.08.014

Evolving Spectrum of Adenosine Deaminase Deficiency: Assessing Genotype Pathogenicity Based on Expressed ADA Activity of 46 Variants

Inés Santistéban, +8 more

- 01 Aug 2024

- The Journal of Allergy and Clinical Immu...

TL;DR: This study evaluates the pathogenicity of 46 ADA variants based on expressed ADA activity, addressing the clinical and genetic heterogeneity of adenosine deaminase deficiency and its implications for asymptomatic infants with uncertain prognosis and variant significance.

...read moreread less

Journal Article•10.1002/advs.202402918

MuToN Quantifies Binding Affinity Changes upon Protein Mutations by Geometric Deep Learning

Yongmei Hu, +1 more

- 12 Jul 2024

- Advanced Science

TL;DR: A geometric deep learning framework called MuToN is introduced for quantifying protein binding affinity change upon residue mutations, designed with geometric attention networks, and it captures changes in the protein binding interfaces of mutated complexes and assesses the allosteric effects of amino acids.

...read moreread less

Journal Article•10.1093/bib/bbaf536

Data-efficient protein mutational effect prediction with weak supervision by molecular simulation and protein language models

Teppei Deguchi, +5 more

- 31 Aug 2025

- Briefings in Bioinformatics

Abstract: Abstract Machine learning-based protein mutational effect prediction is widely used in protein engineering and pathogenicity prediction, but training data scarcity remains a major challenge due to high costs of experimental measurements. A previous study proposed data augmentation using computational estimates by molecular simulation. However, this approach has been limited to predicting mutational effects on thermostability. Here, we present a new data augmentation method that combines molecular simulation with zero-shot prediction computed by protein language models. These computational estimates serve as ‘weak’ training data to supplement experimental training data. Our method dynamically adjusts the weight and inclusion of weak training data based on available experimental training data. This reduces potential negative impacts of weak training data while extending applicability to diverse protein properties such as binding affinity and enzymatic activity. Benchmark tests demonstrate that our method improves prediction accuracy particularly when experimental training data are scarce. These results indicate the capability of our approach to advance protein engineering and pathogenicity prediction in small data regimes.

...read moreread less

Journal Article•10.1038/s41467-025-57858-8

Deep learning prioritizes cancer mutations that alter protein nucleocytoplasmic shuttling to drive tumorigenesis.

Yongqiang Zheng, +7 more

- 14 Mar 2025

- Nature Communications

TL;DR: Researchers develop a deep learning model to identify cancer mutations that alter protein nucleocytoplasmic shuttling, finding that these mutations enrich functional genetic variations and disrupt critical signaling interactions involved in tumorigenesis.

...read moreread less

Journal Article•10.1002/1873-3468.70175

<scp>RAD50</scp> missense variants differentially affect the <scp>DNA</scp> damage response and mitotic progression

Hanna Redeker, +8 more

- 01 Oct 2025

- FEBS Letters

Abstract: RAD50 is the central protein of the MRN complex and crucial in DNA double‐strand break repair. RAD50 deficiency causes a genomic instability disorder characterized by microcephaly and stunted growth. Using lentiviral constructs, we investigated whether cancer‐related RAD50 missense variants can complement the delayed damage response after exposure to the chemotherapeutic agent epirubicin and/or mitotic progression in RAD50‐deficient fibroblasts. Eight missense variants, all capable of forming an MRN complex, supported the DNA damage response and mitotic features to different extents, indicating these functions are separable. Three variants showed both an impaired epirubicin response and slowed cell division in the likely pathogenic range. Assessing RAD50 missense variants with distinct functional readouts may help to further elucidate their differential roles in immunodeficiency and cancer and could improve therapeutic strategies. Impact statement RAD50 has a strong impact on DNA repair and cancer therapy. Here, we analyse RAD50 missense variants at four functional levels. Some variants showed an impaired epirubicin response and mitotic progression in the pathological range, while for others these endpoints were separable. Functional heterogeneity of RAD50 variants could contribute to clinical variability.

...read moreread less

...

Expand

References

•Journal Article•10.1038/S41586-021-03819-2

Highly accurate protein structure prediction with AlphaFold

John M. Jumper, +33 more

- 15 Jul 2021

- Nature

TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.

...read moreread less

28.2K

•Journal Article•10.1038/GIM.2015.30

Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.

Sue Richards, +12 more

- 05 Mar 2015

- Genetics in Medicine

TL;DR: Because of the increased complexity of analysis and interpretation of clinical genetic testing described in this report, the ACMG strongly recommends thatclinical molecular genetic testing should be performed in a Clinical Laboratory Improvement Amendments–approved laboratory, with results interpreted by a board-certified clinical molecular geneticist or molecular genetic pathologist or the equivalent.

...read moreread less

26.2K

•Journal Article•10.1038/NATURE19057

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more

- 18 Aug 2016

- Nature

TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.

...read moreread less

10.2K

•Journal Article•10.1038/S41586-020-2308-7

The mutational constraint spectrum quantified from variation in 141,456 humans

Konrad J. Karczewski, +64 more

- 27 May 2020

- Nature

TL;DR: A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.

...read moreread less

7.9K

•Journal Article•10.1038/S41586-018-0579-Z

The UK Biobank resource with deep phenotyping and genomic data

Clare Bycroft, +21 more

- 11 Oct 2018

- Nature

TL;DR: Deep phenotype and genome-wide genetic data from 500,000 individuals from the UK Biobank is described, describing population structure and relatedness in the cohort, and imputation to increase the number of testable variants to 96 million.

...read moreread less

7.6K

...

Expand