Open Access
Reorganizing Compressed Text.
Nieves R. Brisaboa,Antonio Fariña,Susana Ladra,Gonzalo Navarro +3 more
- 01 Jan 2009
pp 261-261
30
TL;DR: In this paper, a simple reordering of the target symbols in the compressed text was proposed to improve the search capabilities of word-based statistical semistatic compression, and the reordered compressed text became an implicitly indexed representation of the text, which can be searched for words in time independent of the original text length.
read more
Abstract: Recent research has demonstrated beyond doubts the benefits of compressing natural language texts using word-based statistical semistatic compression. Not only it achieves extremely competitive compression rates, but also direct search on the compressed text can be carried out faster than on the original text; indexing based on inverted lists benefits from compression as well.Such compression methods assign a variable-length codeword to each different text word. Some coding methods (Plain Huffman and Restricted Prefix Byte Codes) do not clearly mark codeword boundaries, and hence cannot be accessed at random positions nor searched with the fastest text search algorithms. Other coding methods (Tagged Huffman, End-Tagged Dense Code, or (s, c)-Dense Code) do mark codeword boundaries, achieving a self-synchronization property that enables fast search and random access, in exchange for some loss in compression effectiveness.In this paper, we show that by just performing a simple reordering of the target symbols in the compressed text (more precisely, reorganizing the bytes into a wavelet-treelike shape) and using little additional space, searching capabilities are greatly improved without a drastic impact in compression and decompression times. With this approach, all the codes achieve synchronism and can be searched fast and accessed at arbitrary points. Moreover, the reordered compressed text becomes an implicitly indexed representation of the text, which can be searched for words in time independent of the text length. That is, we achieve not only fast sequential search time, but indexed search time, for almost no extra space cost.We experiment with three well-known word-based compression techniques with different characteristics (Plain Huffman, End-Tagged Dense Code and Restricted Prefix Byte Codes), and show the searching capabilities achieved by reordering the compressed representation on several corpora. We show that the reordered versions are not only much more efficient than their classical counterparts, but also more efficient than explicit inverted indexes built on the collection, when using the same amount of space.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Wavelet trees for all
Gonzalo Navarro
- 03 Jul 2012
TL;DR: This survey gives an overview of wavelet trees and the surprising number of applications in which they are useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.
183
Implicit indexing of natural language text by reorganizing bytecodes
TL;DR: This work shows that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, it is obtained a new implicitly indexed representation of the compression text, where search times are drastically improved.
Random access to Fibonacci encoded files
Shmuel T. Klein,Dana Shapira +1 more
TL;DR: The Wavelet tree is adapted, in this paper, to Fibonacci codes, so that in addition to supporting direct access to the fibonacci encoded file, it also increases the compression savings when compared to the original Fib onacci compressed file.
33
Compressed self-indices supporting conjunctive queries on document collections
Diego Arroyuelo,Senén González,Mauricio Oyarzún +2 more
- 11 Oct 2010
TL;DR: It is shown that an inverted index can be replaced by a more space-efficient in-memory encoding, outperforming the query performance of inverted indices when the ratio nM/δ is ω(log |Σ|).
Exploiting SIMD instructions in current processors to improve classical string algorithms
Susana Ladra,Oscar Pedreira,José Duato,Nieves R. Brisaboa +3 more
- 18 Sep 2012
TL;DR: This paper proclaims their benefits and encourages their use, and performs an experimental evaluation by straightforwardly including some of these complex instructions in basic string algorithms used for indexing and search, obtaining significant speedups.
24
References
A method for the construction of minimum-redundancy codes
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
5.2K
A fast string searching algorithm
TL;DR: The algorithm has the unusual property that, in most cases, not all of the first i .” in another string, are inspected.
High-order entropy-compressed text indexes
Roberto Grossi,Ankur Gupta,Jeffrey Scott Vitter +2 more
- 12 Jan 2003
TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.
Wavelet trees for all
Gonzalo Navarro
- 03 Jul 2012
TL;DR: This survey gives an overview of wavelet trees and the surprising number of applications in which they are useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.
183
Related Papers (5)
Nieves R. Brisaboa,Antonio Fariña,Susana Ladra,Gonzalo Navarro +3 more
- 20 Jul 2008
Ruy Luiz Milidiú,Artur Alves Pessoa,Eduardo Sany Laber +2 more
- 09 Sep 1998