Practical methods for constructing suffix trees

doi:10.1007/S00778-005-0154-8

Open AccessJournal Article10.1007/S00778-005-0154-8

Practical methods for constructing suffix trees

Yuanyuan Tian, +3 more

- 01 Sep 2005

- Vol. 14, Iss: 3, pp 281-299

99

TL;DR: This paper presents a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and shows that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.

Abstract: Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized.In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1109/TKDE.2010.76

Efficient Periodicity Mining in Time Series Databases Using Suffix Trees

Faraz Rasheed, +2 more

- 01 Jan 2011

- IEEE Transactions on Knowledge and Data ...

TL;DR: This paper presents an algorithm which can detect symbol, sequence (partial), and segment (full cycle) periodicity in time series and is noise resilient; it is generally more time-efficient and noise-resilient than existing algorithms.

...read moreread less

134

Journal Issue•10.1002/SPE.V40:11

Fine-grained management of software artefacts: the ADAMS system

Andrea De Lucia, +3 more

- 01 Oct 2010

- Software - Practice and Experience

TL;DR: The traceability layer in ADAMS is used to propagate events concerning changes to an artefact to the dependent artefacts, thus also increasing the context-awareness in the project and valuable support to high-level documentation and traceability management.

...read moreread less

105

Proceedings Article•10.1145/1247480.1247572

Genome-scale disk-based suffix tree indexing

Benjarath Phoophakdee, +1 more

- 11 Jun 2007

TL;DR: TRELLIS is a novel disk-based suffix tree algorithm which effectively scales up to genome-scale sequences and can index the entire human genome using 2GB of memory in about 4 hours and can recover all its suffix links within 2 hours.

...read moreread less

97

Journal Issue•10.1002/SPE.V40:11

A survey of the research on power management techniques for high-performance systems

Yongpeng Liu, +1 more

- 01 Oct 2010

- Software - Practice and Experience

TL;DR: The basic mechanisms that underlie power management techniques are reviewed and the new opportunities and problems presented by the recent adoption of virtualization techniques are discussed.

...read moreread less

86

Journal Article•10.1007/S00778-015-0409-Y

GPU-accelerated string matching for database applications

Evangelia Sitaridi, +1 more

- 01 Oct 2016

TL;DR: This work focuses on the efficient implementation of string matching operators common in SQL queries and studies the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory.

...read moreread less

45

...

Expand

References

•Journal Article•10.1093/NAR/GKH131

UniProt: the Universal Protein knowledgebase

Rolf Apweiler, +14 more

- 01 Jan 2004

- Nucleic Acids Research

TL;DR: The Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt), which is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces.

...read moreread less

8.3K

•Journal Article•10.1186/GB-2004-5-2-R12

Versatile and open software for comparing large genomes

Stefan Kurtz, +6 more

- 30 Jan 2004

- Genome Biology

TL;DR: The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes.

...read moreread less

5.7K

Monograph•10.1017/CBO9780511574931

Algorithms on Strings, Trees, and Sequences: Suffix Trees and Their Uses

Dan Gusfield

- 01 Jan 1997

2.8K

•Journal Article•10.1093/NAR/29.22.4633

REPuter: the manifold applications of repeat analysis on a genomic scale.

Stefan Kurtz, +5 more

- 15 Nov 2001

- Nucleic Acids Research

TL;DR: The wide scope of repeat analysis is circumscribes using applications in five different areas of sequence analysis: checking fragment assemblies, searching for low copy repeats, finding unique sequences, comparing gene structures and mapping of cDNA/EST sequences.

...read moreread less

2.3K

Proceedings Article•10.1109/SWAT.1973.13

Linear pattern matching algorithms

Peter Weiner

- 15 Oct 1973

TL;DR: A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented and indicated how to solve several pattern matching problems, including some from [4] in linear time.

...read moreread less

2.1K