Reorganizing Compressed Text.

Open Access

Reorganizing Compressed Text.

- 01 Jan 2009

pp 261-261

30

TL;DR: In this paper, a simple reordering of the target symbols in the compressed text was proposed to improve the search capabilities of word-based statistical semistatic compression, and the reordered compressed text became an implicitly indexed representation of the text, which can be searched for words in time independent of the original text length.

Abstract: Recent research has demonstrated beyond doubts the benefits of compressing natural language texts using word-based statistical semistatic compression. Not only it achieves extremely competitive compression rates, but also direct search on the compressed text can be carried out faster than on the original text; indexing based on inverted lists benefits from compression as well.Such compression methods assign a variable-length codeword to each different text word. Some coding methods (Plain Huffman and Restricted Prefix Byte Codes) do not clearly mark codeword boundaries, and hence cannot be accessed at random positions nor searched with the fastest text search algorithms. Other coding methods (Tagged Huffman, End-Tagged Dense Code, or (s, c)-Dense Code) do mark codeword boundaries, achieving a self-synchronization property that enables fast search and random access, in exchange for some loss in compression effectiveness.In this paper, we show that by just performing a simple reordering of the target symbols in the compressed text (more precisely, reorganizing the bytes into a wavelet-treelike shape) and using little additional space, searching capabilities are greatly improved without a drastic impact in compression and decompression times. With this approach, all the codes achieve synchronism and can be searched fast and accessed at arbitrary points. Moreover, the reordered compressed text becomes an implicitly indexed representation of the text, which can be searched for words in time independent of the text length. That is, we achieve not only fast sequential search time, but indexed search time, for almost no extra space cost.We experiment with three well-known word-based compression techniques with different characteristics (Plain Huffman, End-Tagged Dense Code and Restricted Prefix Byte Codes), and show the searching capabilities achieved by reordering the compressed representation on several corpora. We show that the reordered versions are not only much more efficient than their classical counterparts, but also more efficient than explicit inverted indexes built on the collection, when using the same amount of space.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Book Chapter•10.1007/978-3-642-31265-6_2

Wavelet trees for all

Gonzalo Navarro

- 03 Jul 2012

TL;DR: This survey gives an overview of wavelet trees and the surprising number of applications in which they are useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.

...read moreread less

183

Journal Article•10.1007/S10791-012-9184-1

Implicit indexing of natural language text by reorganizing bytecodes

Nieves R. Brisaboa, +3 more

- 01 Dec 2012

- Information Retrieval

TL;DR: This work shows that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, it is obtained a new implicitly indexed representation of the compression text, where search times are drastically improved.

...read moreread less

37

•Journal Article•10.1016/J.DAM.2015.11.003

Random access to Fibonacci encoded files

Shmuel T. Klein, +1 more

- 30 Oct 2016

- Discrete Applied Mathematics

TL;DR: The Wavelet tree is adapted, in this paper, to Fibonacci codes, so that in addition to supporting direct access to the fibonacci encoded file, it also increases the compression savings when compared to the original Fib onacci compressed file.

...read moreread less

33

•Book Chapter•10.1007/978-3-642-16321-0_5

Compressed self-indices supporting conjunctive queries on document collections

Diego Arroyuelo, +2 more

- 11 Oct 2010

TL;DR: It is shown that an inverted index can be replaced by a more space-efficient in-memory encoding, outperforming the query performance of inverted indices when the ratio nM/δ is ω(log |Σ|).

...read moreread less

31

•Book Chapter•10.1007/978-3-642-33074-2_19

Exploiting SIMD instructions in current processors to improve classical string algorithms

Susana Ladra, +3 more

- 18 Sep 2012

TL;DR: This paper proclaims their benefits and encourages their use, and performs an experimental evaluation by straightforwardly including some of these complex instructions in basic string algorithms used for indexing and search, obtaining significant speedups.

...read moreread less

24

...

Expand

References

•Book

Human behavior and the principle of least effort

George Kingsley Zipf

- 01 Jan 1949

7.7K

Journal Article•10.1007/BF02837279

A method for the construction of minimum-redundancy codes

David A. Huffman

- 01 Feb 2006

- Resonance

TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

...read moreread less

5.2K

Journal Article•10.1145/359842.359859

A fast string searching algorithm

Robert S. Boyer, +1 more

- 01 Oct 1977

- Communications of The ACM

TL;DR: The algorithm has the unusual property that, in most cases, not all of the first i.” in another string, are inspected.

...read moreread less

2.7K

•Proceedings Article•10.5555/644108.644250

High-order entropy-compressed text indexes

Roberto Grossi, +2 more

- 12 Jan 2003

TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.

...read moreread less

900

Book Chapter•10.1007/978-3-642-31265-6_2

Wavelet trees for all

Gonzalo Navarro

- 03 Jul 2012

TL;DR: This survey gives an overview of wavelet trees and the surprising number of applications in which they are useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.

...read moreread less

183

...

Expand

Reorganizing Compressed Text.

Chat with Paper

AI Agents for this Paper

Citations

Wavelet trees for all

Implicit indexing of natural language text by reorganizing bytecodes

Random access to Fibonacci encoded files

Compressed self-indices supporting conjunctive queries on document collections

Exploiting SIMD instructions in current processors to improve classical string algorithms

References

Human behavior and the principle of least effort

A method for the construction of minimum-redundancy codes

A fast string searching algorithm

High-order entropy-compressed text indexes

Wavelet trees for all

Related Papers (5)

Reorganizing compressed text

Boosting Text Compression with Word-Based Statistical Encoding1

Improving semistatic compression via pair-based coding

Direct pattern matching on compressed text

In-place length-restricted prefix coding