Boosting Text Compression with Word-Based Statistical Encoding1
TL;DR: A new suffix-free Dense-Code-based compressor that compresses slightly better and some self-indexes can handle non-suffix-free codes is presented, which allows indexed searches for both words and phrases.
read more
Abstract: Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30–35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

TABLE 1. k-order entropy using the CR corpus of 48.7 MB. The Hk values are relative to each text, and thus not comparable to each other. Compression ratios are comparable, but they do not include the size of the model. The impact of the latter can be estimated considering the number of contexts generated. 
FIGURE 1. Probability of byte values on CR corpus. In the right upper part, there is a magnified area of the values between the byte values 45 and 250. 
FIGURE 5. Space vs compression/decompression time trade-offs on corpus ALL. The x axis shows compression ratio (leftwards is better). The y axis (in logarithmic scale) shows compression and decompression speed (MB/sec), respectively (upper is better). 
TABLE 2. Comparison between byte-oriented codes and a fixed-length compressor with and without a backend compressor, on corpus CR. 
FIGURE 4. In-memory data structures used to hold the vocabulary. 
FIGURE 3. Suffixes and prefixes in SCDC and SCBDC.
Citations
•Journal Article
Indexing text using the Ziv-Lempel trie
TL;DR: In this paper, a data structure based on the Ziv-Lempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in worst case time O(m 2 log(mσ)+(m+R) log n).
132
Reorganizing Compressed Text.
Nieves R. Brisaboa,Antonio Fariña,Susana Ladra,Gonzalo Navarro +3 more
- 01 Jan 2009
TL;DR: In this paper, a simple reordering of the target symbols in the compressed text was proposed to improve the search capabilities of word-based statistical semistatic compression, and the reordered compressed text became an implicitly indexed representation of the text, which can be searched for words in time independent of the original text length.
30
To index or not to index: time-space trade-offs in search engines with positional ranking functions
Diego Arroyuelo,Senén González,Mauricio Marin,Mauricio Oyarzún,Torsten Suel +4 more
- 12 Aug 2012
TL;DR: This paper answers the question of whether one should index positional data or not and shows that there is a wide range of practical time-space trade-offs for search engines with positional ranking functions and text snippet generation.
FM-index for Dummies
Szymon Grabowski,Marcin Raniszewski,Sebastian Deorowicz +2 more
- 30 May 2017
TL;DR: In this article, the authors propose a cache-friendly implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed, for the price of using typically 1.5-5 times more space.
10
On the Randomness of Compressed Data
Shmuel T. Klein,Dana Shapira +1 more
TL;DR: Evidence is presented here that arithmetic coding may produce an output that is identical to that of Huffman coding, and it is found that there is much variability in the randomness of the output of these techniques.
7
References
A universal algorithm for sequential data compression
Jacob Ziv,A. Lempel +1 more
TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
A Method for the Construction of Minimum-Redundancy Codes
David A. Huffman
- 01 Sep 1952
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
6.1K
A method for the construction of minimum-redundancy codes
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
5.2K
Compression of individual sequences via variable-rate coding
Jacob Ziv,A. Lempel +1 more
TL;DR: The proposed concept of compressibility is shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences.
4K