phastSim: efficient simulation of sequence evolution for pandemic-scale datasets
De Maio N,Lukas Weilguny,Conor R Walker,Conor R Walker,Yatish Turakhia,Russell Corbett-Detig,Nick Goldman +6 more
TL;DR: PhastSim as mentioned in this paper is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny.
read more
Abstract: AO_SCPLOWBSTRACTC_SCPLOWSequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including new ones that we developed to more realistically model SARS-CoV-2 genome evolution.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Fig 1. Example genome tree and genome tree search. An example genome search tree for ancestral genome ACGGT. Blue nodes are terminal and red nodes are internal. Inside each node we represent on top the genome positions represented by the node; at the center inside terminal nodes we show the allele of the node; at the bottom of nodes is their total rate. Under each terminal node we show the example relevant mutation rates. The black arrows show an example sampling of one mutation event. A parameter R is assigned an initial random number sampled uniformly between 0 and the total rate 8.1, in this case it is R = 4.7. As we move downward, the value of R can decrease, as described in Algorithm 2, determining which site will mutate and how. Here, an initial R = 4.7 results in the sampling of a G→T mutation at genome position 4. 
Fig 5. Comparison of running times of different simulators in a SARS-CoV-2 scenario using different evolutionary models. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the model used for simulations: “nucleotide” is a nucleotide substitution model without variation; “nuc+10cat” is a nucleotide model with 10 rate categories; 
Fig 6. Comparison of running times of Indelible and phastSim simulators in a SARS-CoV-2 scenario with indels. In this scenario we compare phastSim against Indelbile-m1 and Indelible-m2 (the only other methods considered here that model indels), with uniform insertion and deletion rate of 0.1 and with indel length distribution of Geo(0.5). Each boxplot represents ten replicates. 
Fig 7. Comparison of running times of different simulators in a SARS-CoV-2 scenario after rescaling the tree branch lengths by different factors. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the rescaling factor we use to make the phylogenetic tree branch lengths longer or shorter. Colors are as in Fig 3. To aid the visual comparison, we use trees of different sizes for different simulators: 1000 tips for INDELible; 5,000 tips for Seq-Gen; 100,000 tips for phastSim. 
Table 1. Indel length distribution options ![Fig 3. Comparison of running times of different simulators in a scenario similar to SARS-CoV-2 data. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the number of tips simulated. Each boxplot represents ten replicates. We do not run the most demanding simulators when each replicate would take substantially more than 1 minute to run. In blue is the computational demand for generating the random trees with a customised version of NGESH [39] distributed within the phastSim package; sequence simulation is performed conditional on these simulated trees. In red is the time to run phastSim with a concise output, and in orange is the time for phastSim with additionally generating a FASTA format output. In green is the demand of pyvolve, and in purple of Seq-Gen. In yellow and brown are respectively the time for running INDELible with method 1 (matrix exponentiation) and method 2 (Gillespie approach).](/figures/fig-3-comparison-of-running-times-of-different-simulators-in-1f39y1r3.png)
Fig 3. Comparison of running times of different simulators in a scenario similar to SARS-CoV-2 data. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the number of tips simulated. Each boxplot represents ten replicates. We do not run the most demanding simulators when each replicate would take substantially more than 1 minute to run. In blue is the computational demand for generating the random trees with a customised version of NGESH [39] distributed within the phastSim package; sequence simulation is performed conditional on these simulated trees. In red is the time to run phastSim with a concise output, and in orange is the time for phastSim with additionally generating a FASTA format output. In green is the demand of pyvolve, and in purple of Seq-Gen. In yellow and brown are respectively the time for running INDELible with method 1 (matrix exponentiation) and method 2 (Gillespie approach).
Citations
Maximum likelihood pandemic-scale phylogenetics
Nicola De Maio,Prabhav Kalaghatgi,Yatish Turakhia,Russell Corbett-Detig,Bui Quang Minh,Nick Goldman +5 more
TL;DR: For example, MAPLE as mentioned in this paper infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets.
VGsim: Scalable viral genealogy simulator for global pandemic
Vladimir Shchur,Vadim Spirin,D. S. Sirotkin,Evgeni Burovski,Nicola De Maio,Russell Corbett-Detig +5 more
TL;DR: A new fast simulator VGsim is introduced which addresses the problem of simulation genealogies under epidemiological models and can model complex population structure, epistasis and immunity escape.
VGsim: scalable viral genealogy simulator for global pandemic
Shchur Vladimir,Spirin Vadim,Pokrovskiy Victor,Burovski Evgeni,De Maio Nicola,Corbett-Detig Russell +5 more
TL;DR: VGsim as mentioned in this paper is a fast simulator for the COVID-19 pandemic, which is split into two phases. During the forward run the algorithm generates a chain of events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm.
Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA)
Romain Derelle,Johanna von Wachsmann,Tommi Mäklin,Joel Hellewell,Timothy Russell,Ajit Lalvani,Leonid Chindelevitch,Nicholas J. Croucher,Simon R. Harris,John A. Lees +9 more
TL;DR: Split K-mer Analysis (SKA2) is described, a method which supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies and has the potential to become the tool of choice for genotyping bacteria.
7
Running ahead of evolution - AI based simulation for predicting future high-risk SARS-CoV-2 variants
Jie Chen,Zhiwei Nie,Yu Wang,Kai Wang,Fan Xu,Zhiheng Hu,Bin Zheng,Zhennan Wang,Guoli Song,Jingyi Zhang,Jie Fu,Xiansong Huang,Zhongqi Wang,Zhixiang Ren,Qiankun Wang,Daixi Li,Dong-Qing Wei,Bin Zhou,Chao Yang,Yonghong Tian +19 more
TL;DR: Zhiwei Niepku et al. as discussed by the authors developed a multi-constraint variation prediction framework to simulate SARS-CoV-2 RBD mutations, reaching a peak performance of 366.8 PFLOPS with 96.5% scalability and 493.9× speedup.
3
References
Exact Stochastic Simulation of Coupled Chemical Reactions
TL;DR: In this article, a simulation algorithm for the stochastic formulation of chemical kinetics is proposed, which uses a rigorously derived Monte Carlo procedure to numerically simulate the time evolution of a given chemical system.
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.
TL;DR: A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed, and this dating may pose a problem for the widely believed hypothesis that the bipedal creatureAustralopithecus afarensis, which lived some 3.7 million years ago, was ancestral to man and evolved after the human-ape splitting.
8.9K
The Complete Genome Sequence of Escherichia coli K-12
Frederick R. Blattner,Guy Plunkett,Craig A. Bloch,Nicole T. Perna,Valerie Burland,Monica Riley,Julio Collado-Vides,Jeremy D. Glasner,Christopher K. Rode,George F. Mayhew,Jason Gregor,Nelson Wayne Davis,Heather A. Kirkpatrick,Michael A. Goeden,Debra J. Rose,Bob Mau,Ying Shao +16 more
TL;DR: The 4,639,221-base pair sequence of Escherichia coli K-12 is presented and reveals ubiquitous as well as narrowly distributed gene families; many families of similar genes within E. coli are also evident.
A Method for the Construction of Minimum-Redundancy Codes
David A. Huffman
- 01 Sep 1952
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
6.1K