Practical methods for constructing suffix trees
Yuanyuan Tian,Sandeep Tata,Richard A. Hankins,Jignesh M. Patel +3 more
- 01 Sep 2005
- Vol. 14, Iss: 3, pp 281-299
TL;DR: This paper presents a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and shows that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.
read more
Abstract: Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized.In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Efficient Periodicity Mining in Time Series Databases Using Suffix Trees
TL;DR: This paper presents an algorithm which can detect symbol, sequence (partial), and segment (full cycle) periodicity in time series and is noise resilient; it is generally more time-efficient and noise-resilient than existing algorithms.
134
Fine-grained management of software artefacts: the ADAMS system
TL;DR: The traceability layer in ADAMS is used to propagate events concerning changes to an artefact to the dependent artefacts, thus also increasing the context-awareness in the project and valuable support to high-level documentation and traceability management.
105
Genome-scale disk-based suffix tree indexing
Benjarath Phoophakdee,Mohammed J. Zaki +1 more
- 11 Jun 2007
TL;DR: TRELLIS is a novel disk-based suffix tree algorithm which effectively scales up to genome-scale sequences and can index the entire human genome using 2GB of memory in about 4 hours and can recover all its suffix links within 2 hours.
97
A survey of the research on power management techniques for high-performance systems
Yongpeng Liu,Hong Zhu +1 more
TL;DR: The basic mechanisms that underlie power management techniques are reviewed and the new opportunities and problems presented by the recent adoption of virtualization techniques are discussed.
86
GPU-accelerated string matching for database applications
Evangelia Sitaridi,Kenneth A. Ross +1 more
- 01 Oct 2016
TL;DR: This work focuses on the efficient implementation of string matching operators common in SQL queries and studies the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory.
45
References
UniProt: the Universal Protein knowledgebase
Rolf Apweiler,Amos Marc Bairoch,Cathy H. Wu,Winona C. Barker,Brigitte Boeckmann,Serenella Ferro,Elisabeth Gasteiger,Hongzhan Huang,Rodrigo Lopez,Michele Magrane,Maria Jesus Martin,Darren A. Natale,Claire O'Donovan,Nicole Redaschi,Lai-Su L. Yeh +14 more
TL;DR: The Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt), which is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces.
Versatile and open software for comparing large genomes
Stefan Kurtz,Adam M. Phillippy,Arthur L. Delcher,Michael E. Smoot,Martin Shumway,Corina Antonescu,Steven L. Salzberg +6 more
TL;DR: The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes.
REPuter: the manifold applications of repeat analysis on a genomic scale.
Stefan Kurtz,Jomuna V. Choudhuri,Enno Ohlebusch,Chris Schleiermacher,Jens Stoye,Robert Giegerich +5 more
TL;DR: The wide scope of repeat analysis is circumscribes using applications in five different areas of sequence analysis: checking fragment assemblies, searching for low copy repeats, finding unique sequences, comparing gene structures and mapping of cDNA/EST sequences.
Linear pattern matching algorithms
Peter Weiner
- 15 Oct 1973
TL;DR: A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented and indicated how to solve several pattern matching problems, including some from [4] in linear time.
2.1K