Efficient Variable-to-Fixe d Length Coding Algorithms for Text Compression

Open Access

Efficient Variable-to-Fixe d Length Coding Algorithms for Text Compression

- 01 Jan 2014

5

TL;DR: This thesis focuses on lossless compression for text data, that is, text compression, and Variable-to-Fixed-length coding, a coding scheme that segments an input text into a consecutive sequence of substrings and then assigns a fixed length codeword to each substring.

Abstract: Data compression is a technique for reducing the storage space and the cost of transferring a large amount of data, using redundancy hidden in the data. We focus on lossless compression for text data, that is, text compression, in this thesis. To reuse a huge amount of data stored in secondary storage, I/O speeds are bottlenecks. Such a communication-speed problem can be relieved if we transfer only compressed data through the communication channel and furthermore can perform every necessary processes, such as string search, on the compressed data itself without decompression. Therefore, a new criterion “ease of processing the compressed data” is required in the field of data compression. Development of compression algorithms is currently in the mainstream of data compression field but many of them are not adequate for that criterion. The algorithms employing variable length codewords succeeded to achieve an extremely good compression ratio, but the boundaries between codewords are not obvious without a special processing. Such an “unclear boundary problem” prevents us from direct accessing to the compressed data. On the contrary, Variable-to-Fixed-length coding , which is referred to as VF coding, is promising for our demand. VF coding is a coding scheme that segments an input text into a consecutive sequence of substrings (called phrases) and then assigns a fixed length codeword to each substring. Boundaries between codewords of VF coding are obvious because all of them have the same length. Therefore, we can realize “accessible data compression” by VF coding. Nevertheless, VF coding was not paid much attention

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Book Chapter•10.1142/9789812778222_0004

On-line construction of suffix trees

Maxime Crochemore, +1 more

- 01 Sep 2002

472

•Posted Content

Random Access to Grammar Compressed Strings

Philip Bille, +5 more

- 11 Jan 2010

- arXiv: Data Structures and Algorithms

TL;DR: Two representations of a string of length n compressed into a context-free grammar of size n achieving random access time and several new techniques and data structures of independent interest are introduced, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy- paths in grammars.

...read moreread less

77

Shift-And Approach to Pattern Matching in LZW Compressed Text

Takuya Kida, +7 more

- 01 Jan 1999

TL;DR: In this article, the Shift-And algorithm was used to solve the problem of pattern matching in LZW compressed text, where a pattern length is at most 32 or the word length.

...read moreread less

58

•Journal Article•10.1007/978-3-540-89097-3-5

Context-sensitive grammar transform: Compression and pattern matching

Shirou Maruyama, +3 more

- 01 Jan 2008

- Lecture Notes in Computer Science

TL;DR: In this article, a greedy compression algorithm with the transform model is presented as well as a Knuth-Morris-Pratt (KMP)-type compressed pattern matching (CPM) algorithm.

...read moreread less

8

•Journal Article

Fast $q$-gram Mining on SLP Compressed Strings

Keisuke Goto, +11 more

- 13 Jul 2011

- Lecture Notes in Computer Science

TL;DR: An O(qn) time and space algorithm that computes the occurrence frequencies of all q-grams in T, namely, as a straight line program (SLP), which is practical for small q.

...read moreread less

1

References

Journal Article•10.1109/TIT.1977.1055714

A universal algorithm for sequential data compression

Jacob Ziv, +1 more

- 01 May 1977

- IEEE Transactions on Information Theory

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.

...read moreread less

6.3K

Journal Article•10.1109/TIT.1978.1055934

Compression of individual sequences via variable-rate coding

Jacob Ziv, +1 more

- 01 Sep 1978

- IEEE Transactions on Information Theory

TL;DR: The proposed concept of compressibility is shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences.

...read moreread less

4K

A Block-sorting Lossless Data Compression Algorithm

Michael Burrows, +1 more

- 01 Jan 1994

TL;DR: A block-sorting, lossless data compression algorithm, and the implementation of that algorithm and the performance of the implementation with widely available data compressors running on the same hardware are compared.

...read moreread less

3K

•Book

Introduction to data compression

Khalid Sayood

- 01 Jan 1996

TL;DR: The author explains the development of the Huffman Coding Algorithm and some of the techniques used in its implementation, as well as some of its applications, including Image Compression, which is based on the JBIG standard.

...read moreread less

2.6K

Proceedings Article•10.1109/SWAT.1973.13

Linear pattern matching algorithms

Peter Weiner

- 15 Oct 1973

TL;DR: A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented and indicated how to solve several pattern matching problems, including some from [4] in linear time.

...read moreread less

2.1K