Open AccessDissertation
Efficiently compressing string columnar data using frequent pattern mining
Xiaojian Wang
- 20 Jun 2016
TL;DR: This thesis develops a compression algorithm using frequent string patterns directly mined from a sample of a string column, and develops a pruning method to address the cache inefficiencies in indexing the patterns.
read more
Abstract: In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP [16] or Snappy [13]. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a pruning method to address the cache inefficiencies in indexing the patterns. Experiments show that our compression algorithm outperforms Snappy in compression ratio while retains compression and decompression speed.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
The Long and the Short of It: Summarising Event Sequences with Serial Episodes
Nikolaj Tatti,Jilles Vreeken +1 more
TL;DR: This paper formalises how to encode sequential data using sets of serial episodes, and uses the encoded length as a quality score to identify the set of sequential patterns that summarises the data best.
References
Mining frequent patterns without candidate generation
Jiawei Han,Jian Pei,Yiwen Yin +2 more
- 16 May 2000
TL;DR: This study proposes a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develops an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth.
A universal algorithm for sequential data compression
Jacob Ziv,A. Lempel +1 more
TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Mining sequential patterns
Rakesh Agrawal,Ramakrishnan Srikant +1 more
- 06 Mar 1995
TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.
A method for the construction of minimum-redundancy codes
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
5.2K
{SNAP Datasets}: {Stanford} Large Network Dataset Collection
Jure Leskovec,Andrej Krevl +1 more
- 01 Jun 2014
TL;DR: A collection of more than 50 large network datasets from tens of thousands of node and edges to tens of millions of nodes and edges that includes social networks, web graphs, road networks, internet networks, citation networks, collaboration networks, and communication networks.
4.2K