Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

doi:10.1145/3375890

Open AccessJournal Article10.1145/3375890

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Travis Gagie, +2 more

- 15 Jan 2020

- Journal of the ACM

- Vol. 67, Iss: 1, pp 1-54

188

TL;DR: This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.

Abstract: Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O(m log log n) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space. By raising the space to O(r log log n), our index counts the occurrences in optimal time, O(m), and locates them in optimal time as well, O(m + occ). By further raising the space by an O(w/ log σ) factor, where σ is the alphabet size and w = Ω (log n) is the RAM machine size in bits, we support count and locate in O(⌈ m log (σ)/w ⌉) and O(⌈ m log (σ)/w ⌉ + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log (n/r)) space that replaces the text and extracts any text substring of length e in the almost-optimal time O(log (n/r)+e log (σ)/w). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O(log (n/r)), and extend these capabilities to full suffix tree functionality, typically in O(log (n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Algorithms and Complexity on Indexing Founder Graphs.

Massimo Equi, +5 more

- 25 Feb 2021

- arXiv: Data Structures and Algorithms

TL;DR: In this paper, the problem of matching a string in a labeled graph was studied, and it was shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time nor index the graph in polynomial time to answer queries efficiently.

...read moreread less

4

•Proceedings Article•10.5220/0010834100003123

Lossy Compressor Preserving Variant Calling through Extended BWT

Veronica Guerrini, +2 more

- 17 Apr 2023

- Bioinformatics

TL;DR: This paper considers the novel problem of lossy compressing, in a reference-free way, FASTQ data by modifying both components at the same time, while preserving the important information of the original FastQ.

...read moreread less

4

Journal Article•10.1007/978-3-031-43980-3_26

Constant Time and Space Updates for the Sigma-Tau Problem

Zsuzsanna Lipták, +3 more

TL;DR: Researchers improve algorithms for constructing Hamiltonian paths and cycles in the Sigma-Tau graph, achieving constant time and space updates, and presenting the first combinatorial generation algorithm optimal in both time and space for n-permutations.

...read moreread less

3

•Journal Article•10.1016/j.ic.2021.104749

Faster repetition-aware compressed suffix trees based on Block Trees

01 May 2022

TL;DR: The Block-Tree Compressed Topology (BT-CT) as mentioned in this paper was proposed to represent the topology of trees with large repeated subtrees by augmenting the block tree nodes with data that speeds up tree navigation.

...read moreread less

3

•Posted Content

Subpath Queries on Compressed Graphs: a Survey

Nicola Prezza

- 19 Nov 2020

- arXiv: Data Structures and Algorithms

TL;DR: This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today's compressed indexes for labeled graphs and regular languages.

...read moreread less

3

...

Expand

References

•Journal Article•10.1038/NMETH.1923

Fast gapped-read alignment with Bowtie 2

Ben Langmead, +3 more

- 01 Apr 2012

- Nature Methods

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

52.8K

•Journal Article•10.1186/GB-2009-10-3-R25

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

Ben Langmead, +3 more

- 04 Mar 2009

- Genome Biology

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

...read moreread less

23.4K

A Block-sorting Lossless Data Compression Algorithm

Michael Burrows, +1 more

- 01 Jan 1994

TL;DR: A block-sorting, lossless data compression algorithm, and the implementation of that algorithm and the performance of the implementation with widely available data compressors running on the same hardware are compared.

...read moreread less

3K

Journal Article•10.1109/TIT.1976.1055501

On the Complexity of Finite Sequences

A. Lempel, +1 more

- 01 Jan 1976

- IEEE Transactions on Information Theory

TL;DR: A new approach to the problem of evaluating the complexity ("randomness") of finite sequences is presented, related to the number of steps in a self-delimiting production process by which a given sequence is presumed to be generated.

...read moreread less

2.8K

Monograph•10.1017/CBO9780511574931

Algorithms on Strings, Trees, and Sequences: Suffix Trees and Their Uses

Dan Gusfield

- 01 Jan 1997

2.8K

...

Expand

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Chat with Paper

AI Agents for this Paper

Citations

Algorithms and Complexity on Indexing Founder Graphs.

Lossy Compressor Preserving Variant Calling through Extended BWT

Constant Time and Space Updates for the Sigma-Tau Problem

Faster repetition-aware compressed suffix trees based on Block Trees

Subpath Queries on Compressed Graphs: a Survey

References

Fast gapped-read alignment with Bowtie 2

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

A Block-sorting Lossless Data Compression Algorithm

On the Complexity of Finite Sequences

Algorithms on Strings, Trees, and Sequences: Suffix Trees and Their Uses

Related Papers (5)

A Block-sorting Lossless Data Compression Algorithm

Compact Data Structures: A Practical Approach

Suffix arrays: a new method for on-line string searches

Storage and Retrieval of Highly Repetitive Sequence Collections

The smallest grammar problem