Efficiently compressing string columnar data using frequent pattern mining

Open AccessDissertation

Efficiently compressing string columnar data using frequent pattern mining

- 20 Jun 2016

1

TL;DR: This thesis develops a compression algorithm using frequent string patterns directly mined from a sample of a string column, and develops a pruning method to address the cache inefficiencies in indexing the patterns.

Abstract: In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP [16] or Snappy [13]. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a pruning method to address the cache inefficiencies in indexing the patterns. Experiments show that our compression algorithm outperforms Snappy in compression ratio while retains compression and decompression speed.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1145/2339530.2339606

The Long and the Short of It: Summarising Event Sequences with Serial Episodes

Nikolaj Tatti, +1 more

- 07 Feb 2019

- arXiv: Data Structures and Algorithms

TL;DR: This paper formalises how to encode sequential data using sets of serial episodes, and uses the encoded length as a quality score to identify the set of sequential patterns that summarises the data best.

...read moreread less

References

Journal Article•10.1109/TIT.1978.1055934

Compression of individual sequences via variable-rate coding

Jacob Ziv, +1 more

- 01 Sep 1978

- IEEE Transactions on Information Theory

TL;DR: The proposed concept of compressibility is shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences.

...read moreread less

4K

Journal Article•10.1109/MC.1984.1659158

A Technique for High-Performance Data Compression

Welch

- 01 Jun 1984

- IEEE Computer

TL;DR: A new compression algorithm is introduced that is based on principles not found in existing commercial methods in that it dynamically adapts to the redundancy characteristics of the data being compressed, and serves to illustrate system problems inherent in using any compression scheme.

...read moreread less

2.6K

Proceedings Article•10.1109/ICDE.2001.914830

PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth

Jian Pei, +6 more

- 02 Apr 2001

TL;DR: This work proposes a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern Mining, and shows that Pre fixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.

...read moreread less

2.1K

DEFLATE Compressed Data Format Specification version 1.3

P. Deutsch

- 01 May 1996

TL;DR: This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding, with efficiency comparable to the best currently available general-purpose compression methods.

...read moreread less

876

...

Expand

Efficiently compressing string columnar data using frequent pattern mining

Chat with Paper

AI Agents for this Paper

Citations

The Long and the Short of It: Summarising Event Sequences with Serial Episodes

References

Compression of individual sequences via variable-rate coding

A Technique for High-Performance Data Compression

PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth

C-store: a column-oriented DBMS

DEFLATE Compressed Data Format Specification version 1.3

Related Papers (5)

Graph-based storage pattern mining method

Frequent pattern mining method and storage media storing the same

Method of mining a frequent pattern apparatus performing the same and storage medium storing a program performing the same

A New Parallel Algorithm for Frequent Pattern Mining

Pattern Growth Method for Mining Embedded Frequent Trees