Optimizing reduced-space sequence analysis.
Raymond Wheeler,Richard Hughey +1 more
TL;DR: The improved row checkpoint algorithm is improved by analyzing optimal checkpoint placement and performing up to one half the computation of the original algorithm, and the improved diagonal checkpoint algorithm performs up to 35% fewer computational steps than the original.
read more
Abstract: Motivation: Dynamic programming is the core algorithm of sequence comparison, alignment and linear hidden Markov model (HMM) training. For a pair of sequence lengths m and n, the problem can be solved readily in O(mn) time and O(mn) space. The checkpoint algorithm introduced by Grice et al. (CABIOS, 13, 45–53, 1997) runs in O(Lmn) time and O(Lm L √ n) space, where L is a positive integer determined by m, n, and the amount of available workspace. The algorithm is appropriate for many string comparison problems, including all-paths and single-best-path hidden Markov model training, and is readily parallelizable. The checkpoint algorithm has a diagonal version that can solve the single-best-path alignment problem in O(mn) time and O(m + n) space. Results: In this work, we improve performance by analyzing optimal checkpoint placement. The improved row checkpoint algorithm performs up to one half the computation of the original algorithm. The improved diagonal checkpoint algorithm performs up to 35% fewer computational steps than the original. We modified the SAM hidden Markov modeling package to use the improved row checkpoint algorithm. For a fixed sequence length, the new version is up to 33% faster for all-paths and 56% faster for single-best-path HMM training, depending on sequence length and allocated memory. Over a typical set of protein sequence lengths, the improvement is ∼10%. Availability: The SAM hidden Markov modeling package is freely available for academic use from http:// www. cse.ucsc.edu/ research/ compbio/ sam.html. The C++ code used to find optimal checkpoint placements is available from http:// www.cse.ucsc.edu/ research/ kestrel.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes
TL;DR: It is shown that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome‐wide SNP data or smaller amounts of data typical in fine‐mapping studies, and it is illustrated how association analyses of unobserved variants will benefit from ongoing advances such as larger Hap map reference panels and whole genome shotgun sequencing technologies.
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals
TL;DR: It is demonstrated that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation.
1.7K
Genotype Imputation with Millions of Reference Samples
TL;DR: A genotype imputation method that scales to millions of reference samples and achieves fast, accurate, and memory-efficient genotypes imputation by restricting the probability model to markers that are genotypes in the target samples and by performing linear interpolation to impute ungenotyped variants.
1.1K
Implementing EM and Viterbi algorithms for Hidden Markov Model in linear memory
TL;DR: A memory sparse version of the Baum-Welch algorithm with modifications to the original probabilistic table topologies to make memory use independent of sequence length (and linearly dependent on state number) and a linear memory implementation of the Viterbi decoding algorithm.
A linear memory algorithm for Baum-Welch training.
István Miklós,Irmtraud M. Meyer +1 more
TL;DR: In this article, the first linear space algorithm for Baum-Welch training was proposed, which has a memory requirement of O(M) memory and O(LMTγγγεγετε εγεεγαγεδεταταγατεγγα ε ≥ 0 for a hidden Markov model with M states, T free transition and E free emission parameters.
References
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
11.3K
•Book
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
Richard Durbin,Sean R. Eddy,Anders Krogh,Graeme Mitchison +3 more
- 01 Feb 2005
TL;DR: This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.
4.5K
The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999.
Amos Marc Bairoch,Rolf Apweiler +1 more
TL;DR: The Human Proteomics Initiative (HPI), a major project to annotate all known human sequences according to the quality standards of SWISS-PROT, is described.
The String-to-String Correction Problem
TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.
3.5K
Hidden markov models in computational biology: applications to protein modeling
TL;DR: The results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling.
2.1K