Open AccessProceedings Article
Engineering a lightweight suffix array construction algorithm: (Extended abstract)
Giovanni Manzini,Paolo Ferragina +1 more
- 01 Jan 2002
- pp 698-710
TL;DR: In this article, the problem of computing the suffix array of a text T [1, n] is considered, which consists in sorting the suffixes of T in lexicographic order.
read more
Abstract: We consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in web-search engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and “lightweight” in the sense that it uses small space. The suffix array consists of n integers in the range [1, n]. This means that in theory it uses Θ(n logn) bits of storage. However, in most applications the size of the text is smaller than 2 and it is customary to store each integer in a four byte word; this yields a total space occupancy of 4n bytes. For what concerns the cost of constructing the suffix array, the theoretically best algorithms run in Θ(n) time [5]. These algorithms work by first building the suffix tree and then obtaining the sorted suffixes via an in-order traversal of the tree. However, suffix tree construction algorithms are both complex and space consuming since they occupy at least 15n bytes of working space (or even more, depending on the
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Patent
Identifying repeat subsequences by left and right contexts
Matthias Gallé
- 24 May 2013
TL;DR: In this article, a system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value y for a threshold of right contexts for an input sequence are disclosed.
8
An Adaptive Algorithm for Splitting Large Sets of Strings and Its Application to Efficient External Sorting
Tatsuya Asai,Seishi Okamoto,Hiroki Arimura +2 more
- 07 Feb 2009
TL;DR: A practical string sorting algorithm DistStrSort is presented, which is suitable for sorting string collections of large size in external memory, and also suitable for more complex string processing problems in text and semi-structured databases such as counting, aggregation, and statistics.
References
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology
Susan Holmes,Dan Gusfield +1 more
TL;DR: The author examines the importance of (sub)sequence comparison in molecular biology, core string edits, alignments and dynamic programming, and a deeper look at classical methods for exact string matching.
3.1K
A Block-sorting Lossless Data Compression Algorithm
Michael Burrows,David Wheeler +1 more
- 01 Jan 1994
TL;DR: A block-sorting, lossless data compression algorithm, and the implementation of that algorithm and the performance of the implementation with widely available data compressors running on the same hardware are compared.
Suffix arrays: a new method for on-line string searches
Udi Manber,Gene Myers +1 more
TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.
2.4K
Opportunistic data structures with applications
Paolo Ferragina,Giovanni Manzini +1 more
- 12 Nov 2000
TL;DR: A data structure whose space occupancy is a function of the entropy of the underlying data set is devised, which achieves sublinear space and sublinear query time complexity and is shown how to plug into the Glimpse tool.
Fast algorithms for sorting and searching strings
Jon Louis Bentley,Robert Sedgewick +1 more
- 05 Jan 1997
TL;DR: This work presents theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings, and presents extensions to more complex string problems, such as partial-match searching.
Related Papers (5)
Hongwei Huo,Vojislav Stojkovic +1 more
- 01 Nov 2008
[...]
Juha Kärkkäinen,Esko Ukkonen +1 more
- 17 Jun 1996