1. What are the limitations of previous methods in estimating mutation probabilities in expanded sequence contexts?
The limitations of previous methods in estimating mutation probabilities in expanded sequence contexts include scalability, regularization, and uncertainty. First, the size of the model increases exponentially with the number of nucleotides included, presenting computational and statistical power limitations. Second, while every sequence context may seem meaningful, a more parsimonious model informed by biological intuition suggests that only a subset of contexts contribute meaningfully to observed variation. This is particularly important for inferring somatic and de novo mutation rates or in data-sparse situations. Finally, previous methods do not immediately emit uncertainty resulting from multinomial variance and heterogeneity in larger sequence contexts. As sequence context sizes expand, there is functionally less data and more uncertainty in estimates, making point estimates unreliable. Additionally, previous methods do not address all limitations simultaneously, such as scalability, regularization, and uncertainty. Previous methods have tackled scalability and regularization through deep-learning frameworks and IUPAC-motif-based clustering approaches, but none explicitly estimate the uncertainty of parameters. The CIPI model addresses uncertainty but focuses on smaller context-window motifs and is not scalable to larger context windows and contemporary population genomics data sets. Therefore, there is a need for a method that addresses all three limitations simultaneously and provides uncertainty estimates in expanded sequence contexts.
read more
2. How does the tree-based sequence-context model capture polymorphism probabilities?
The tree-based sequence-context model captures polymorphism probabilities by structuring a rooted, tree-based graph where each substitution class is represented distinctly. Each level of the tree represents an increasing window size of sequence considered, alternating between incorporating nucleotides to the window on the 3' end for even-sized contexts and on the 5' end for odd-sized contexts. Non-root edges represent the log-transformed, multiplicative shift in polymorphism probability, while the root edge corresponds to an estimated base polymorphism probability. The model employs a Bayesian formulation to generate posterior distributions for polymorphism probabilities, providing uncertainty around parameter estimates. Regularization is incorporated in the parameter estimation procedure for tree edges, using a spike-and-slab prior to estimate the fraction of posterior samples in the slab and spike. An adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme is implemented to sample from and estimate the posterior distribution of the model. Parameters are estimated level-by-level, leveraging the conditional dependency structure of the hierarchical tree, ensuring identifiable mutation probabilities and aiding in convergence.
read more
3. How do asymmetric models compare to symmetric models in terms of edges and parsimony?
Asymmetric models, when expanding an odd-length context by two nucleotides, require 16 edges per context compared to 20 edges per context in symmetric models. Despite more total edges in the asymmetric model tree architecture, asymmetric models include approximately 38% fewer overall edges with high confidence. This suggests greater parsimony in asymmetric models. Additionally, asymmetric models produced models that better fit holdout data than symmetric models, specifically in situations where there is sufficient data to estimate 8-mer edges but insufficient data to confidently estimate 9-mer rates. This improvement arises specifically in situations where there is sufficient data to estimate 8-mer edges, but insufficient data to confidently estimate 9-mer rates. Overall, asymmetric models provide a more efficient and parsimonious approach to model inference in the context of Baymer graphs.
read more
4. How does Baymer estimate posterior distributions for each parameter?
Baymer estimates posterior distributions for each parameter by allowing for uncertainty in the probabilities of polymorphism at each sequence context. This feature enables the model to capture the underlying rates with uncertainty, providing a more robust inference. The model uses simulations to assess how often the posterior distribution captures simulated values, ensuring comprehensive coverage of estimated probabilities. By incorporating regularization, Baymer creates parsimonious models that capture most information with the fewest non-zero parameters, addressing cases with limited data or rare sequence contexts. The robustness of inferred rates is evaluated by comparing probabilities in holdout sets and independent training models, demonstrating strong correlation and tight distribution of probabilities. This approach ensures the model's transferability across different data sets and highlights the significance of proximal bases to the focal site in influencing polymorphism probabilities.
read more