Regularized sequence-context mutational trees capture variation in mutation rates across the human genome

Q: What are the observed differences in polymorphism probabilities across human populations in the NYGC resequencing of 1KG Phase III?

In the NYGC resequencing of 1KG Phase III, polymorphism probabilities across human populations were analyzed. The study found that there were few instances of edges that were quantifiably different across continental groups, and those that were observed were largely confined to relatively small windows of context where we might have anticipated well-powered tests (e.g., 3- and 5-mers). The largest population-specific effect was discovered in East Asians, where ATACCTC>A polymorphism probabilities were roughly 2.7 times higher than in European, African, or South Asian models. The study also noted that the observed differences were largely confounded by batch effects in the original 1KG sequencing data, which were not present in the New York Genome Center resequencing project. The results suggest that there may exist large differences at certain sequence context sizes that were not reliably captured in the analysis. Overall, the study provides insights into the frequency of polymorphism across populations and the potential factors influencing these differences.

Question

1. What are the limitations of previous methods in estimating mutation probabilities in expanded sequence contexts?

2. How does the tree-based sequence-context model capture polymorphism probabilities?

3. How do asymmetric models compare to symmetric models in terms of edges and parsimony?

4. How does Baymer estimate posterior distributions for each parameter?

8. How can a sequence context model capture de novo mutational rates effectively?

Accepted Answer

The limitations of previous methods in estimating mutation probabilities in expanded sequence contexts include scalability, regularization, and uncertainty. First, the size of the model increases exponentially with the number of nucleotides included, presenting computational and statistical power limitations. Second, while every sequence context may seem meaningful, a more parsimonious model informed by biological intuition suggests that only a subset of contexts contribute meaningfully to observed variation. This is particularly important for inferring somatic and de novo mutation rates or in data-sparse situations. Finally, previous methods do not immediately emit uncertainty resulting from multinomial variance and heterogeneity in larger sequence contexts. As sequence context sizes expand, there is functionally less data and more uncertainty in estimates, making point estimates unreliable. Additionally, previous methods do not address all limitations simultaneously, such as scalability, regularization, and uncertainty. Previous methods have tackled scalability and regularization through deep-learning frameworks and IUPAC-motif-based clustering approaches, but none explicitly estimate the uncertainty of parameters. The CIPI model addresses uncertainty but focuses on smaller context-window motifs and is not scalable to larger context windows and contemporary population genomics data sets. Therefore, there is a need for a method that addresses all three limitations simultaneously and provides uncertainty estimates in expanded sequence contexts.

Accepted Answer

The tree-based sequence-context model captures polymorphism probabilities by structuring a rooted, tree-based graph where each substitution class is represented distinctly. Each level of the tree represents an increasing window size of sequence considered, alternating between incorporating nucleotides to the window on the 3' end for even-sized contexts and on the 5' end for odd-sized contexts. Non-root edges represent the log-transformed, multiplicative shift in polymorphism probability, while the root edge corresponds to an estimated base polymorphism probability. The model employs a Bayesian formulation to generate posterior distributions for polymorphism probabilities, providing uncertainty around parameter estimates. Regularization is incorporated in the parameter estimation procedure for tree edges, using a spike-and-slab prior to estimate the fraction of posterior samples in the slab and spike. An adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme is implemented to sample from and estimate the posterior distribution of the model. Parameters are estimated level-by-level, leveraging the conditional dependency structure of the hierarchical tree, ensuring identifiable mutation probabilities and aiding in convergence.

Accepted Answer

Asymmetric models, when expanding an odd-length context by two nucleotides, require 16 edges per context compared to 20 edges per context in symmetric models. Despite more total edges in the asymmetric model tree architecture, asymmetric models include approximately 38% fewer overall edges with high confidence. This suggests greater parsimony in asymmetric models. Additionally, asymmetric models produced models that better fit holdout data than symmetric models, specifically in situations where there is sufficient data to estimate 8-mer edges but insufficient data to confidently estimate 9-mer rates. This improvement arises specifically in situations where there is sufficient data to estimate 8-mer edges, but insufficient data to confidently estimate 9-mer rates. Overall, asymmetric models provide a more efficient and parsimonious approach to model inference in the context of Baymer graphs.

Accepted Answer

Baymer estimates posterior distributions for each parameter by allowing for uncertainty in the probabilities of polymorphism at each sequence context. This feature enables the model to capture the underlying rates with uncertainty, providing a more robust inference. The model uses simulations to assess how often the posterior distribution captures simulated values, ensuring comprehensive coverage of estimated probabilities. By incorporating regularization, Baymer creates parsimonious models that capture most information with the fewest non-zero parameters, addressing cases with limited data or rare sequence contexts. The robustness of inferred rates is evaluated by comparing probabilities in holdout sets and independent training models, demonstrating strong correlation and tight distribution of probabilities. This approach ensures the model's transferability across different data sets and highlights the significance of proximal bases to the focal site in influencing polymorphism probabilities.

Accepted Answer

9-mer edges have a considerable impact on polymorphism probabilities in extended sequence contexts. They account for the most total absolute edges (7189 total edges > 0.95 PIP) and are enriched for larger effect sizes. This trend is consistent across mutation types, although with varying degrees of impact. The 9-mer Baymer models substantially improved the likelihood and best fit to the data compared to lower context models on holdout data.

Accepted Answer

The identified motifs enriched in the highest or lowest 1% of 9-mer polymorphism probabilities include repeat-rich motifs, as well as motifs with flanks extending 4 base pairs from the focal nucleotide. These motifs were discovered through recapitulating previously reported motifs and emphasizing the utility of expanded sequence context windows for modeling mutability. The motifs with p < 0.0001 were found to be significant in understanding the correlation between sequence context motifs and changes in polymorphism probability.

Accepted Answer

In the NYGC resequencing of 1KG Phase III, polymorphism probabilities across human populations were analyzed. The study found that there were few instances of edges that were quantifiably different across continental groups, and those that were observed were largely confined to relatively small windows of context where we might have anticipated well-powered tests (e.g., 3- and 5-mers). The largest population-specific effect was discovered in East Asians, where ATACCTC>A polymorphism probabilities were roughly 2.7 times higher than in European, African, or South Asian models. The study also noted that the observed differences were largely confounded by batch effects in the original 1KG sequencing data, which were not present in the New York Genome Center resequencing project. The results suggest that there may exist large differences at certain sequence context sizes that were not reliably captured in the analysis. Overall, the study provides insights into the frequency of polymorphism across populations and the potential factors influencing these differences.

Accepted Answer

A sequence context model can capture de novo mutational rates effectively by considering the formulation in handling data sparsity and applying Baymer to develop a model that best captures rates of de novo mutations across the genome. By partitioning de novo variants into even and odd base pairs, substantial improvement in the overall likelihood in the testing set for 5-mer size windows compared to 3-mers can be observed. Baymer did not select any sequence context feature beyond the 5-mer level with PIP > 0.95, indicating the importance of including informative contexts to avoid overfitting. Previous work has demonstrated that inference of de novo mutational probabilities can be captured via rare variant polymorphism data obtained from population sets as a proxy. By building variant partitions based on larger sample sizes, closely matched ancestry, and focusing on rare variants, a transferrable model and robust rate estimates can be generated. The set of variants obtained from the de novo training set outperformed all other models, despite having fewer variants contributing to them. However, for larger windows of context, several polymorphism partitions explained the data better than models trained directly from de novo events. Models trained exclusively on singletons and ALL-2 performed considerably worse than the rest across all windows of sequence context. Stricter quality filters improved model performance, but did not surpass the de novo training model at the 3-mer level. Training from a population matched sample, excluding singletons, NFE-2+, best predicted rates of de novo mutations in 5-mer or larger contexts, better than models trained on de novo events directly. Downsampling each partition to match the number of variants in the de novo training set resulted in partitions that included NFE exclusively performing better than using the entirety of gnomAD, 'ALL', which included a more diverse panel of individuals within Europe. This indicates that variants derived from samples with ancestries more closely matching are the most informative for capturing de novo mutational rates effectively.

Accepted Answer

A grafted tree approach improves de novo mutational probability estimates by leveraging the strengths of both de novo and polymorphism-based models. The approach utilizes the nested tree structure of polymorphism probability models, allowing specific branches to be interchanged. This enables the grafted tree model to estimate edges in expanded sequence contexts where the de novo model may lack power due to sparsity. By combining 1-to 3-mer edges estimated in the de novo training data model with 4-to 9-mer edges estimated using the NFE-2+ data model, the grafted tree model achieves a greater fit to the holdout de novo data than either the NFE-2+ model or the de novo model alone. This strategy enhances the accuracy and reliability of de novo mutational probability estimates, providing superior estimates compared to traditional models.

Accepted Answer

Human polymorphism models show similar rates of polymorphism at higher orders of sequences contexts across closely related great ape species. In comparison, species-matched 9-mer models outperformed all other models for both chimpanzee and gorilla tests. While human 9-mer models are outperformed at the 9-mer level, they are more likely than chimpanzee 7-mer models against chimpanzee data and gorilla 5-mer models against gorilla data. These results suggest that within-species models best capture variability in observed polymorphism levels.

Regularized sequence-context mutational trees capture variation in mutation rates across the human genome

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the limitations of previous methods in estimating mutation probabilities in expanded sequence contexts?

2. How does the tree-based sequence-context model capture polymorphism probabilities?

3. How do asymmetric models compare to symmetric models in terms of edges and parsimony?

4. How does Baymer estimate posterior distributions for each parameter?

5. What impact do 9-mer edges have on polymorphism probabilities?

6. What are the identified motifs enriched in the highest or lowest 1% of 9-mer polymorphism probabilities?

7. What are the observed differences in polymorphism probabilities across human populations in the NYGC resequencing of 1KG Phase III?

8. How can a sequence context model capture de novo mutational rates effectively?

9. How does a grafted tree approach improve de novo mutational probability estimates?

10. How do human polymorphism models compare with chimpanzee and gorilla models?

Citations

Epigenomic insights into common human disease pathology

Evolution of the Mutation Spectrum Across a Mammalian Phylogeny

Accurate inference of population history in the presence of background selection

“Evolution of the mutation spectrum across a mammalian phylogeny”

Application of an ANN Model for Predicting Water Quality Parameters: A Case Study of the Tuul River, Mongolia

References

Initial sequencing and analysis of the human genome.

A global reference for human genetic variation.

Analysis of protein-coding genetic variation in 60,706 humans

The mutational constraint spectrum quantified from variation in 141,456 humans

A survey of transfer learning

Related Papers (5)

A Bayesian Approach to Exemplify the Identification Problem in Discrete-time Hazards Models with Multiple Interactions

Regularized sequence-context mutational trees capture variation in mutation rates across the human genome

Bayesian inference of population expansions in domestic bovines.

Regularized sequence-context mutational trees capture variation in mutation rates across the human genome

Multisystem Manifestations of Benign Ovarian Teratomas