Boosting Text Compression with Word-Based Statistical Encoding1

doi:10.1093/COMJNL/BXR096

Open AccessJournal Article10.1093/COMJNL/BXR096

Boosting Text Compression with Word-Based Statistical Encoding1

Antonio Fariña, +2 more

- 01 Jan 2012

- The Computer Journal

- Vol. 55, Iss: 1, pp 111-131

19

TL;DR: A new suffix-free Dense-Code-based compressor that compresses slightly better and some self-indexes can handle non-suffix-free codes is presented, which allows indexed searches for both words and phrases.

Abstract: Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30–35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

TABLE 1. k-order entropy using the CR corpus of 48.7 MB. The Hk values are relative to each text, and thus not comparable to each other. Compression ratios are comparable, but they do not include the size of the model. The impact of the latter can be estimated considering the number of contexts generated.

FIGURE 1. Probability of byte values on CR corpus. In the right upper part, there is a magnified area of the values between the byte values 45 and 250.

FIGURE 5. Space vs compression/decompression time trade-offs on corpus ALL. The x axis shows compression ratio (leftwards is better). The y axis (in logarithmic scale) shows compression and decompression speed (MB/sec), respectively (upper is better).

TABLE 2. Comparison between byte-oriented codes and a fixed-length compressor with and without a backend compressor, on corpus CR.

FIGURE 4. In-memory data structures used to hold the vocabulary.

FIGURE 3. Suffixes and prefixes in SCDC and SCBDC.

Citations

•Journal Article

Indexing text using the Ziv-Lempel trie

Gonzalo Navarro

- 01 Jan 2002

- Lecture Notes in Computer Science

TL;DR: In this paper, a data structure based on the Ziv-Lempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in worst case time O(m 2 log(mσ)+(m+R) log n).

...read moreread less

132

Reorganizing Compressed Text.

Nieves R. Brisaboa, +3 more

- 01 Jan 2009

TL;DR: In this paper, a simple reordering of the target symbols in the compressed text was proposed to improve the search capabilities of word-based statistical semistatic compression, and the reordered compressed text became an implicitly indexed representation of the text, which can be searched for words in time independent of the original text length.

...read moreread less

30

Proceedings Article•10.1145/2348283.2348320

To index or not to index: time-space trade-offs in search engines with positional ranking functions

Diego Arroyuelo, +4 more

- 12 Aug 2012

TL;DR: This paper answers the question of whether one should index positional data or not and shows that there is a wide range of practical time-space trade-offs for search engines with positional ranking functions and text snippet generation.

...read moreread less

21

Book Chapter•10.1007/978-3-319-58274-0_16

FM-index for Dummies

Szymon Grabowski, +2 more

- 30 May 2017

TL;DR: In this article, the authors propose a cache-friendly implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed, for the price of using typically 1.5-5 times more space.

...read moreread less

10

•Journal Article•10.3390/INFO11040196

On the Randomness of Compressed Data

Shmuel T. Klein, +1 more

- 07 Apr 2020

- Information-an International Interdiscip...

TL;DR: Evidence is presented here that arithmetic coding may produce an output that is identical to that of Huffman coding, and it is found that there is much variability in the randomness of the output of these techniques.

...read moreread less

7

...

Expand

References

•Book

Human behavior and the principle of least effort

George Kingsley Zipf

- 01 Jan 1949

7.7K

Journal Article•10.1109/TIT.1977.1055714

A universal algorithm for sequential data compression

Jacob Ziv, +1 more

- 01 May 1977

- IEEE Transactions on Information Theory

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.

...read moreread less

6.3K

Journal Article•10.1109/JRPROC.1952.273898

A Method for the Construction of Minimum-Redundancy Codes

David A. Huffman

- 01 Sep 1952

TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

...read moreread less

6.1K

Journal Article•10.1007/BF02837279

A method for the construction of minimum-redundancy codes

David A. Huffman

- 01 Feb 2006

- Resonance

TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

...read moreread less

5.2K

Journal Article•10.1109/TIT.1978.1055934

Compression of individual sequences via variable-rate coding

Jacob Ziv, +1 more

- 01 Sep 1978

- IEEE Transactions on Information Theory

TL;DR: The proposed concept of compressibility is shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences.

...read moreread less

4K

...

Expand

Boosting Text Compression with Word-Based Statistical Encoding1

Chat with Paper

AI Agents for this Paper

Figures

Citations

Indexing text using the Ziv-Lempel trie

Reorganizing Compressed Text.

To index or not to index: time-space trade-offs in search engines with positional ranking functions

FM-index for Dummies

On the Randomness of Compressed Data

References

Human behavior and the principle of least effort

A universal algorithm for sequential data compression

A Method for the Construction of Minimum-Redundancy Codes

A method for the construction of minimum-redundancy codes

Compression of individual sequences via variable-rate coding

Related Papers (5)

Word-Based Statistical Compressors as Natural Language Compression Boosters

Reorganizing compressed text

New adaptive compressors for natural language text

A universal algorithm for sequential data compression

Compression of individual sequences via variable-rate coding