Journal Article10.1126/science.adg7492
Accurate proteome-wide missense variant effect prediction with AlphaMissense
Jun Cheng,Guido Novati,Joshua Pan,Clare Bycroft,Akvilė Žemgulytė,Taylor Applebaum,Alexander Pritzel,Lai Hong Wong,Michal Zielinski,Tobias Sargeant,Rosalia G. Schneider,Andrew W. Senior,John M. Jumper,Demis Hassabis,Pushmeet Kohli,Žiga Avsec +15 more
774
TL;DR: AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity, achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data.
read more
Abstract: The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic. Description Editor’s summary Single–amino acid changes in proteins sometimes have little effect but can often lead to problems in protein folding, activity, or stability. Only a small fraction of variants have been experimentally investigated, but there are vast amounts of biological sequence data that are suitable for use as training data for machine learning approaches. Cheng et al. developed AlphaMissense, a deep learning model that builds on the protein structure prediction tool AlphaFold2 (see the Perspective by Marsh and Teichmann). The model is trained on population frequency data and uses sequence and predicted structural context, all of which contribute to its performance. The authors evaluated the model against related methods using clinical databases not included in the training and demonstrated agreement with multiplexed assays of variant effect. Predictions for all single–amino acid substitutions in the human proteome are provided as a community resource. —Michael A. Funk AlphaFold fine-tuned on human and primate population variant frequency databases predicts variant pathogenicity. INTRODUCTION Genome sequencing has revealed extensive genetic variation in human populations. Missense variants are genetic variants that alter the amino acid sequence of proteins. Pathogenic missense variants disrupt protein function and reduce organismal fitness, while benign missense variants have limited effect. RATIONALE Classifying these variants is an important ongoing challenge in human genetics. Of more than 4 million observed missense variants, only an estimated 2% have been clinically classified as pathogenic or benign, while the vast majority of them are of unknown clinical significance. This limits the diagnosis of rare diseases, as well as the development or application of clinical treatments that target the underlying genetic cause. Machine learning approaches could close the variant interpretation gap by exploiting patterns in biological data to predict the pathogenicity of unannotated variants. Specifically, AlphaFold, which accurately predicts protein structure from protein sequence, may be used as a foundation to predict the pathogenicity of variants on proteins. RESULTS We developed AlphaMissense to leverage advances on multiple fronts: (i) unsupervised protein language modeling to learn amino acid distributions conditioned on sequence context; (ii) incorporating structural context by using an AlphaFold-derived system; and (iii) fine-tuning on weak labels from population frequency data, thereby avoiding bias from human-curated annotations. AlphaMissense achieves state-of-the-art missense pathogenicity predictions in clinical annotation, de novo disease variants, and experimental assay benchmarks without explicitly training on such data. As a resource to the community, we provide a database of predictions for all possible single amino acid substitutions in the human proteome. We classify 32% of all missense variants as likely pathogenic and 57% as likely benign using a cutoff yielding 90% precision on the ClinVar dataset, thereby providing a confident prediction for most human missense variants. We show how this resource can be used to accelerate research in multiple fields. Molecular biologists could use the database as a starting point for designing and interpreting experiments that probe saturating amino acid substitutions across the human proteome. Human geneticists could combine gene-level AlphaMissense predictions with population cohort–based approaches to quantify the functional significance of genes, especially for shorter human genes where cohort-based approaches lack statistical power. Finally, clinicians could benefit from the boost in coverage of confidently classified pathogenic variants when prioritizing de novo variants for rare disease diagnostics, and AlphaMissense predictions could inform studies of complex trait genetics that use annotations of rare, likely deleterious variants. CONCLUSION AlphaMissense predictions may illuminate the molecular effects of variants on protein function, contribute to the identification of pathogenic missense mutations and previously unknown disease-causing genes, and increase the diagnostic yield of rare genetic diseases. AlphaMissense will also foster further development of specialized protein variant effect predictors from structure prediction models. AlphaMissense pathogenicity prediction. AlphaMissense takes as input a missense variant and predicts its pathogenicity. We fine-tuned AlphaFold on human and primate variant population frequency data and calibrated the confidence on known disease variants. AlphaMissense predicts the probability of a missense variant being pathogenic and classifies it as either likely benign, likely pathogenic, or uncertain. We provide predictions for all possible human missense variants as a resource for the community.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
RP9 revisited; RP9 p.(H137L) remains a likely cause of dominant splicing factor-Retinitis Pigmentosa
Leon Chang,James A Poulter,Andrew R Webster,Gavin Arno,Rajarshi Mukherjee,Andrew Lotery,Alison J Hardcastle,Christopher M. Watson,Chris F Inglehearn,Leon Chang,James A Poulter,Andrew R Webster,Gavin Arno,Rajarshi Mukherjee,Andrew Lotery,Alison J Hardcastle,Christopher M. Watson,Chris F Inglehearn +17 more
Abstract: Abstract Variants in six pre-mRNA processing factors cause autosomal dominant Retinitis Pigmentosa (adRP). The RP9 gene encodes a seventh splicing factor, and in 2002, we published RP9 variants c.410A>T; p.(H137L) and c.509A>G; p.(D170G) as likely causes of adRP in a large multigenerational RP9 -linked family and a single case, respectively. It has since been suggested these variants might be artefacts due to simultaneous amplification of the RP9P pseudogene, and no further pathogenic variants have been reported. We therefore rescreened two members of the RP9 -linked family by genome sequencing. Examination of the 2 Mb locus defined by crossovers in the original family revealed no other plausible causative variants. Alignment of both short and long-read sequences confirmed that p.(H137L) is in the RP9 gene, not the pseudogene. Screening for p.(H137L) in 1961 RP/Rod-cone dystrophy (RCD) cases from the Leeds patient cohort and UK 100,000 Genomes Project (100kGP) database revealed four further carriers. Including the original family, this variant was therefore present in 5/1962 RP/RCD probands, and is absent from gnomAD, constituting statistically significant enrichment in RP cases. Long-read sequencing of p.(H137L) in available carriers showed this is a UK founder allele. The RP9 p.(D170G) allele was also confirmed as gene, not pseudogene, derived, but is present in 22 individuals in the 100kGP cohort, none with RP, as well as >200 individuals in gnomAD and Biobank, suggesting it is non-pathogenic. In conclusion, RP9 p.(H137L) is strongly associated with RP and remains the only plausible variant accounting for the condition in a large multi-generation adRP family.
Recessive genomic and phenotypic variation in consanguineous families with cerebral palsy
Pritha Bisarad,Yung-chün Wang,Peter T Skidmore,Carolina I. Galaz-Montoya,Sara A Lewis,Bader Alhaddad,Nahyun Kong,Dominic Julian,Helen Magee,Tyler N. Kruer,Yuhan Xie,Wangjie Zheng,Boyang Li,James Liu,Anjali Revanur,Khadijah Bakur,Saghar Ghasemi Firouzabadi,Sarina Sharbatkhori,Abbas Tafakhori,Ehsan Taghiabadi,Ermia Nezaminargabad,Shohreh Vosoogh,Javad Jamshidi,Serajaddin Arefnia,Seyed Ahmad Hosseini,Alireza Khajehmirzaei,Faezeh Jamali,Azadeh Ahmadifard,Hamidreza Khodadadi,Parvaneh Daneshmand,Saeed Bohlega,Sateesh Maddirevula,Mais O Hashem,Mustafa A. Salih,Inaam N Mohmed,Heinrich Sticht,Sara Peres Morias,Joana Damásio,Mariana Santos,José Leal Loureiro,Rita G. Rodrigues,Giovanni Stevanin,Mehdi Benkirane,Benjamin Dauriat,Nicholas Head,Julia Baptista,Saeid Shahhosseini,Farhan Mohammad,Hongyu Zhao,Sergio Padilla-Lopez,Fowzan Alkuraya,Somayeh Bakhtiari,Michael C. Kruer,Sheng Chih Jin,Hossein Darvish +54 more
- 06 Nov 2025
TL;DR: This study identifies 134 putative causative genes in 188 consanguineous families with cerebral palsy, with 110 known and 24 novel genes, highlighting recessive inheritance, phenotypic expansions, and new pathways in the disorder.
Dysregulated fetal heart contraction underlies complex genetics of congenital heart disease
Hongjun Shi,Xiaomin Luo,Lifeng Liu,Rong Han,Xiangyang Li +4 more
- 10 Nov 2023
TL;DR: The etiology of congenital heart disease (CHD) involves dysregulated fetal heart contraction and complex genetics. This study identified novel oligogenic causes of CHD through a large-scale ENU-based forward dominant screen in mice and analysis of whole exomes from humans. The findings highlight the importance of early hemodynamic perturbations in the developing heart as a major risk factor for CHD.
Investigation of missense mutation-related type 1 diabetes mellitus through integrating genomic databases and bioinformatic approach
Dyonisa Nasirochmi Pakha,Ratih Dewi Yudhani,Lalu Muhammad Irham +2 more
TL;DR: Investigation of missense mutation-related T1DM through integrating genomic databases and bioinformatic approach identifies nine missense variants and three variants with potential impact on disease progression.
Non-infectious mixed cryoglobulinemia as a new clinical presentation of mutation in the gene encoding coatomer subunit alpha: a case report of two adult sisters
Ksymena Leśniak,Rafał Płoski,Małgorzata Rydzanicz,Aleksandra Rymarz,Arkadiusz Lubas,Tomasz Syryło,Stanisław Niemczyk +6 more
TL;DR: Two adult sisters with non-infectious cryoglobulinemic vasculitis exhibit distinct clinical phenotypes, associated with a rare genetic variant in the COPA gene, encoding coatomer subunit alpha, expanding the spectrum of COPA syndrome manifestations in adults.
References
Highly accurate protein structure prediction with AlphaFold
John M. Jumper,Richard O. Evans,Alexander Pritzel,Tim Green,Michael Figurnov,Olaf Ronneberger,Kathryn Tunyasuvunakool,Russell Bates,Augustin Žídek,Anna Potapenko,Alex Bridgland,Clemens Meyer,Simon A. A. Kohl,Andrew J. Ballard,Andrew Cowie,Bernardino Romera-Paredes,Stanislav Nikolov,R. D. Jain,Jonas Adler,Trevor Back,Stig Petersen,David Reiman,Ellen Clancy,Michal Zielinski,Martin Steinegger,Michalina Pacholska,Tamas Berghammer,Sebastian Bodenstein,David L. Silver,Oriol Vinyals,Andrew W. Senior,Koray Kavukcuoglu,Pushmeet Kohli,Demis Hassabis +33 more
TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.
Sue Richards,Nazneen Aziz,Nazneen Aziz,Sherri J. Bale,David P. Bick,Soma Das,Julie M. Gastier-Foster,Wayne W. Grody,Madhuri Hegde,Elaine Lyon,Elaine B. Spector,Karl V. Voelkerding,Heidi L. Rehm +12 more
TL;DR: Because of the increased complexity of analysis and interpretation of clinical genetic testing described in this report, the ACMG strongly recommends thatclinical molecular genetic testing should be performed in a Clinical Laboratory Improvement Amendments–approved laboratory, with results interpreted by a board-certified clinical molecular geneticist or molecular genetic pathologist or the equivalent.
26.2K
Analysis of protein-coding genetic variation in 60,706 humans
Monkol Lek,Konrad J. Karczewski,Konrad J. Karczewski,Eric Vallabh Minikel,Eric Vallabh Minikel,Kaitlin E. Samocha,Eric Banks,Timothy Fennell,Anne H. O’Donnell-Luria,Anne H. O’Donnell-Luria,Anne H. O’Donnell-Luria,James S. Ware,Andrew J. Hill,Andrew J. Hill,Andrew J. Hill,Beryl B. Cummings,Beryl B. Cummings,Taru Tukiainen,Taru Tukiainen,Daniel P. Birnbaum,Jack A. Kosmicki,Laramie E. Duncan,Laramie E. Duncan,Karol Estrada,Karol Estrada,Fengmei Zhao,Fengmei Zhao,James Zou,Emma Pierce-Hoffman,Emma Pierce-Hoffman,Joanne Berghout,David Neil Cooper,Nicole A. Deflaux,Mark A. DePristo,Ron Do,Jason Flannick,Jason Flannick,Menachem Fromer,Laura D. Gauthier,Jackie Goldstein,Jackie Goldstein,Namrata Gupta,Daniel P. Howrigan,Daniel P. Howrigan,Adam Kiezun,Mitja I. Kurki,Mitja I. Kurki,Ami Levy Moonshine,Pradeep Natarajan,Lorena Orozco,Gina M. Peloso,Gina M. Peloso,Ryan Poplin,Manuel A. Rivas,Valentin Ruano-Rubio,Samuel A. Rose,Douglas M. Ruderfer,Khalid Shakir,Peter D. Stenson,Christine Stevens,Brett Thomas,Brett Thomas,Grace Tiao,María Teresa Tusié-Luna,Ben Weisburd,Hong-Hee Won,Dongmei Yu,David Altshuler,David Altshuler,Diego Ardissino,Michael Boehnke,John Danesh,Stacey Donnelly,Roberto Elosua,Jose C. Florez,Jose C. Florez,Stacey Gabriel,Gad Getz,Gad Getz,Stephen J. Glatt,Christina M. Hultman,Sekar Kathiresan,Markku Laakso,Steven A. McCarroll,Steven A. McCarroll,Mark I. McCarthy,Mark I. McCarthy,Dermot P.B. McGovern,Ruth McPherson,Benjamin M. Neale,Benjamin M. Neale,Aarno Palotie,Shaun Purcell,Danish Saleheen,Jeremiah M. Scharf,Pamela Sklar,Patrick F. Sullivan,Patrick F. Sullivan,Jaakko Tuomilehto,Ming T. Tsuang,Hugh Watkins,Hugh Watkins,James G. Wilson,Mark J. Daly,Mark J. Daly,Daniel G. MacArthur,Daniel G. MacArthur +106 more
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
The mutational constraint spectrum quantified from variation in 141,456 humans
Konrad J. Karczewski,Laurent C. Francioli,Grace Tiao,Beryl B. Cummings,Jessica Alföldi,Qingbo Wang,Ryan L. Collins,Kristen M. Laricchia,Andrea Ganna,Daniel P. Birnbaum,Laura D. Gauthier,Harrison Brand,Matthew Solomonson,Nicholas A. Watts,Daniel R. Rhodes,Moriel Singer-Berk,Eleina M. England,Eleanor G. Seaby,Jack A. Kosmicki,Raymond K. Walters,Katherine Tashman,Yossi Farjoun,Eric Banks,Timothy Poterba,Arcturus Wang,Cotton Seed,Nicola Whiffin,Jessica X. Chong,Kaitlin E. Samocha,Emma Pierce-Hoffman,Zachary Zappala,Anne H. O’Donnell-Luria,Eric Vallabh Minikel,Ben Weisburd,Monkol Lek,James S. Ware,Christopher Vittal,Irina M. Armean,Louis Bergelson,Kristian Cibulskis,Kristen M. Connolly,Miguel Covarrubias,Stacey Donnelly,Steven Ferriera,Stacey Gabriel,Jeff Gentry,Namrata Gupta,Thibault Jeandet,Diane Kaplan,Christopher Llanwarne,Ruchi Munshi,Sam Novod,Nikelle Petrillo,David Roazen,Valentin Ruano-Rubio,Andrea Saltzman,Molly Schleicher,Jose Soto,Kathleen Tibbetts,Charlotte Tolonen,Gordon Wade,Michael E. Talkowski,Benjamin M. Neale,Mark J. Daly,Daniel G. MacArthur +64 more
TL;DR: A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.
The UK Biobank resource with deep phenotyping and genomic data
Clare Bycroft,Colin Freeman,Desislava Petkova,Desislava Petkova,Gavin Band,Lloyd T. Elliott,Kevin Sharp,Allan Motyer,Damjan Vukcevic,Olivier Delaneau,Olivier Delaneau,Jared O'Connell,Adrian Cortes,Adrian Cortes,Samantha Welsh,Alan Young,Mark Effingham,Gil McVean,Stephen Leslie,Naomi E. Allen,Peter Donnelly,Jonathan Marchini +21 more
TL;DR: Deep phenotype and genome-wide genetic data from 500,000 individuals from the UK Biobank is described, describing population structure and relatedness in the cohort, and imputation to increase the number of testable variants to 96 million.