phastSim: efficient simulation of sequence evolution for pandemic-scale datasets

doi:10.1101/2021.03.15.435416

Open AccessPosted Content10.1101/2021.03.15.435416

phastSim: efficient simulation of sequence evolution for pandemic-scale datasets

De Maio N, +6 more

- 16 Mar 2021

- bioRxiv

13

TL;DR: PhastSim as mentioned in this paper is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny.

Abstract: AO_SCPLOWBSTRACTC_SCPLOWSequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including new ones that we developed to more realistically model SARS-CoV-2 genome evolution.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Fig 1. Example genome tree and genome tree search. An example genome search tree for ancestral genome ACGGT. Blue nodes are terminal and red nodes are internal. Inside each node we represent on top the genome positions represented by the node; at the center inside terminal nodes we show the allele of the node; at the bottom of nodes is their total rate. Under each terminal node we show the example relevant mutation rates. The black arrows show an example sampling of one mutation event. A parameter R is assigned an initial random number sampled uniformly between 0 and the total rate 8.1, in this case it is R = 4.7. As we move downward, the value of R can decrease, as described in Algorithm 2, determining which site will mutate and how. Here, an initial R = 4.7 results in the sampling of a G→T mutation at genome position 4.

Fig 5. Comparison of running times of different simulators in a SARS-CoV-2 scenario using different evolutionary models. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the model used for simulations: “nucleotide” is a nucleotide substitution model without variation; “nuc+10cat” is a nucleotide model with 10 rate categories;

Fig 6. Comparison of running times of Indelible and phastSim simulators in a SARS-CoV-2 scenario with indels. In this scenario we compare phastSim against Indelbile-m1 and Indelible-m2 (the only other methods considered here that model indels), with uniform insertion and deletion rate of 0.1 and with indel length distribution of Geo(0.5). Each boxplot represents ten replicates.

Fig 7. Comparison of running times of different simulators in a SARS-CoV-2 scenario after rescaling the tree branch lengths by different factors. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the rescaling factor we use to make the phylogenetic tree branch lengths longer or shorter. Colors are as in Fig 3. To aid the visual comparison, we use trees of different sizes for different simulators: 1000 tips for INDELible; 5,000 tips for Seq-Gen; 100,000 tips for phastSim.

Table 1. Indel length distribution options

Fig 3. Comparison of running times of different simulators in a scenario similar to SARS-CoV-2 data. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the number of tips simulated. Each boxplot represents ten replicates. We do not run the most demanding simulators when each replicate would take substantially more than 1 minute to run. In blue is the computational demand for generating the random trees with a customised version of NGESH [39] distributed within the phastSim package; sequence simulation is performed conditional on these simulated trees. In red is the time to run phastSim with a concise output, and in orange is the time for phastSim with additionally generating a FASTA format output. In green is the demand of pyvolve, and in purple of Seq-Gen. In yellow and brown are respectively the time for running INDELible with method 1 (matrix exponentiation) and method 2 (Gillespie approach).

Citations

•Journal Article•10.1038/s41588-023-01368-0

Maximum likelihood pandemic-scale phylogenetics

Nicola De Maio, +5 more

- 10 Apr 2023

- Nature Genetics

TL;DR: For example, MAPLE as mentioned in this paper infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets.

...read moreread less

32

•Journal Article•10.1371/journal.pcbi.1010409

VGsim: Scalable viral genealogy simulator for global pandemic

Vladimir Shchur, +5 more

- 01 Aug 2022

- PLOS Computational Biology

TL;DR: A new fast simulator VGsim is introduced which addresses the problem of simulation genealogies under epidemiological models and can model complex population structure, epistasis and immunity escape.

...read moreread less

10

•Posted Content•10.1101/2021.04.21.21255891

VGsim: scalable viral genealogy simulator for global pandemic

Shchur Vladimir, +5 more

- 27 Apr 2021

- medRxiv

TL;DR: VGsim as mentioned in this paper is a fast simulator for the COVID-19 pandemic, which is split into two phases. During the forward run the algorithm generates a chain of events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm.

...read moreread less

9

Journal Article•10.1101/2024.03.25.586631

Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA)

Romain Derelle, +9 more

- 29 Mar 2024

- bioRxiv

TL;DR: Split K-mer Analysis (SKA2) is described, a method which supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies and has the potential to become the tool of choice for genotyping bacteria.

...read moreread less

7

•Posted Content•10.1101/2022.11.17.516989

Running ahead of evolution - AI based simulation for predicting future high-risk SARS-CoV-2 variants

Jie Chen, +19 more

- 12 Jan 2023

- bioRxiv

TL;DR: Zhiwei Niepku et al. as discussed by the authors developed a multi-constraint variation prediction framework to simulate SARS-CoV-2 RBD mutations, reaching a peak performance of 366.8 PFLOPS with 96.5% scalability and 493.9× speedup.

...read moreread less

3

References

Journal Article•10.1021/J100540A008

Exact Stochastic Simulation of Coupled Chemical Reactions

Daniel T. Gillespie

- 01 Dec 1977

- The Journal of Physical Chemistry

TL;DR: In this article, a simulation algorithm for the stochastic formulation of chemical kinetics is proposed, which uses a rigorously derived Monte Carlo procedure to numerically simulate the time evolution of a given chemical system.

...read moreread less

11.5K

Book Chapter•10.1016/B978-1-4832-3211-9.50009-7

CHAPTER 24 – Evolution of Protein Molecules

Thomas H. Jukes

- 01 Jan 1969

11.1K

Journal Article•10.1007/BF02101694

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.

Masami Hasegawa, +2 more

- 01 Jan 1985

- Journal of Molecular Evolution

TL;DR: A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed, and this dating may pose a problem for the widely believed hypothesis that the bipedal creatureAustralopithecus afarensis, which lived some 3.7 million years ago, was ancestral to man and evolved after the human-ape splitting.

...read moreread less

8.9K