Efficiently compressing string columnar data using frequent pattern mining

Open AccessDissertation

Efficiently compressing string columnar data using frequent pattern mining

- 20 Jun 2016

1

TL;DR: This thesis develops a compression algorithm using frequent string patterns directly mined from a sample of a string column, and develops a pruning method to address the cache inefficiencies in indexing the patterns.

Abstract: In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP [16] or Snappy [13]. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a pruning method to address the cache inefficiencies in indexing the patterns. Experiments show that our compression algorithm outperforms Snappy in compression ratio while retains compression and decompression speed.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1145/2339530.2339606

The Long and the Short of It: Summarising Event Sequences with Serial Episodes

Nikolaj Tatti, +1 more

- 07 Feb 2019

- arXiv: Data Structures and Algorithms

TL;DR: This paper formalises how to encode sequential data using sets of serial episodes, and uses the encoded length as a quality score to identify the set of sequential patterns that summarises the data best.

...read moreread less

References

Journal Article•10.1145/335191.335372

Mining frequent patterns without candidate generation

Jiawei Han, +2 more

- 16 May 2000

TL;DR: This study proposes a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develops an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth.

...read moreread less

7K

Journal Article•10.1109/TIT.1977.1055714

A universal algorithm for sequential data compression

Jacob Ziv, +1 more

- 01 May 1977

- IEEE Transactions on Information Theory

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.

...read moreread less

6.3K

Proceedings Article•10.1109/ICDE.1995.380415

Mining sequential patterns

Rakesh Agrawal, +1 more

- 06 Mar 1995

TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.

...read moreread less

6K

Journal Article•10.1007/BF02837279

A method for the construction of minimum-redundancy codes

David A. Huffman

- 01 Feb 2006

- Resonance

TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

...read moreread less

5.2K

{SNAP Datasets}: {Stanford} Large Network Dataset Collection

Jure Leskovec, +1 more

- 01 Jun 2014

TL;DR: A collection of more than 50 large network datasets from tens of thousands of node and edges to tens of millions of nodes and edges that includes social networks, web graphs, road networks, internet networks, citation networks, collaboration networks, and communication networks.

...read moreread less

4.2K

...

Expand

Efficiently compressing string columnar data using frequent pattern mining

Chat with Paper

AI Agents for this Paper

Citations

The Long and the Short of It: Summarising Event Sequences with Serial Episodes

References

Mining frequent patterns without candidate generation

A universal algorithm for sequential data compression

Mining sequential patterns

A method for the construction of minimum-redundancy codes

{SNAP Datasets}: {Stanford} Large Network Dataset Collection

Related Papers (5)

Graph-based storage pattern mining method

Frequent pattern mining method and storage media storing the same

Method of mining a frequent pattern apparatus performing the same and storage medium storing a program performing the same

A New Parallel Algorithm for Frequent Pattern Mining

Pattern Growth Method for Mining Embedded Frequent Trees