Engineering a lightweight suffix array construction algorithm: (Extended abstract)

Open AccessProceedings Article

Engineering a lightweight suffix array construction algorithm: (Extended abstract)

- 01 Jan 2002

- pp 698-710

2

TL;DR: In this article, the problem of computing the suffix array of a text T [1, n] is considered, which consists in sorting the suffixes of T in lexicographic order.

Abstract: We consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in web-search engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and “lightweight” in the sense that it uses small space. The suffix array consists of n integers in the range [1, n]. This means that in theory it uses Θ(n logn) bits of storage. However, in most applications the size of the text is smaller than 2 and it is customary to store each integer in a four byte word; this yields a total space occupancy of 4n bytes. For what concerns the cost of constructing the suffix array, the theoretically best algorithms run in Θ(n) time [5]. These algorithms work by first building the suffix tree and then obtaining the sorted suffixes via an in-order traversal of the tree. However, suffix tree construction algorithms are both complex and space consuming since they occupy at least 15n bytes of working space (or even more, depending on the

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Patent

Identifying repeat subsequences by left and right contexts

Matthias Gallé

- 24 May 2013

TL;DR: In this article, a system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value y for a threshold of right contexts for an input sequence are disclosed.

...read moreread less

8

•Book Chapter•10.1007/978-3-642-00399-8_2

An Adaptive Algorithm for Splitting Large Sets of Strings and Its Application to Efficient External Sorting

Tatsuya Asai, +2 more

- 07 Feb 2009

TL;DR: A practical string sorting algorithm DistStrSort is presented, which is suitable for sorting string collections of large size in external memory, and also suitable for more complex string processing problems in text and semi-structured databases such as counting, aggregation, and statistics.

...read moreread less

1

References

Journal Article•10.2307/2670026

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

Susan Holmes, +1 more

- 01 Sep 1999

- Journal of the American Statistical Asso...

TL;DR: The author examines the importance of (sub)sequence comparison in molecular biology, core string edits, alignments and dynamic programming, and a deeper look at classical methods for exact string matching.

...read moreread less

3.1K

A Block-sorting Lossless Data Compression Algorithm

Michael Burrows, +1 more

- 01 Jan 1994

TL;DR: A block-sorting, lossless data compression algorithm, and the implementation of that algorithm and the performance of the implementation with widely available data compressors running on the same hardware are compared.

...read moreread less

3K

Journal Article•10.1137/0222058

Suffix arrays: a new method for on-line string searches

Udi Manber, +1 more

- 01 Oct 1993

- SIAM Journal on Computing

TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.

...read moreread less

2.4K

Proceedings Article•10.1109/SFCS.2000.892127

Opportunistic data structures with applications

Paolo Ferragina, +1 more

- 12 Nov 2000

TL;DR: A data structure whose space occupancy is a function of the entropy of the underlying data set is devised, which achieves sublinear space and sublinear query time complexity and is shown how to plug into the Glimpse tool.

...read moreread less

1.3K

•Proceedings Article•10.5555/314161.314321

Fast algorithms for sorting and searching strings

Jon Louis Bentley, +1 more

- 05 Jan 1997

TL;DR: This work presents theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings, and presents extensions to more complex string problems, such as partial-match searching.

...read moreread less

516

...

Expand

Engineering a lightweight suffix array construction algorithm: (Extended abstract)

Chat with Paper

AI Agents for this Paper

Citations

Identifying repeat subsequences by left and right contexts

An Adaptive Algorithm for Splitting Large Sets of Strings and Its Application to Efficient External Sorting

References

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

A Block-sorting Lossless Data Compression Algorithm

Suffix arrays: a new method for on-line string searches

Opportunistic data structures with applications

Fast algorithms for sorting and searching strings

Related Papers (5)

Linear-time Suffix Sorting - A New Approach for Suffix Array Construction.

A Partition-Based Suffix Tree Construction and Its Applications

Sparse Suffix Trees

Efficient Discovery of Proximity Patterns with Suffix Arrays (Extended Abstract)

On the sorting-complexity of suffix tree construction