Open AccessDissertation
Efficiently compressing string columnar data using frequent pattern mining
Xiaojian Wang
- 20 Jun 2016
TL;DR: This thesis develops a compression algorithm using frequent string patterns directly mined from a sample of a string column, and develops a pruning method to address the cache inefficiencies in indexing the patterns.
read more
Abstract: In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP [16] or Snappy [13]. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a pruning method to address the cache inefficiencies in indexing the patterns. Experiments show that our compression algorithm outperforms Snappy in compression ratio while retains compression and decompression speed.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
The Long and the Short of It: Summarising Event Sequences with Serial Episodes
Nikolaj Tatti,Jilles Vreeken +1 more
TL;DR: This paper formalises how to encode sequential data using sets of serial episodes, and uses the encoded length as a quality score to identify the set of sequential patterns that summarises the data best.
References
Compression of individual sequences via variable-rate coding
Jacob Ziv,A. Lempel +1 more
TL;DR: The proposed concept of compressibility is shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences.
4K
A Technique for High-Performance Data Compression
TL;DR: A new compression algorithm is introduced that is based on principles not found in existing commercial methods in that it dynamically adapts to the redundancy characteristics of the data being compressed, and serves to illustrate system problems inherent in using any compression scheme.
2.6K
PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth
Jian Pei,Jiawei Han,Behzad Mortazavi-Asl,Helen Pinto,Qiming Chen,Umeshwar Dayal,Meichun Hsu +6 more
- 02 Apr 2001
TL;DR: This work proposes a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern Mining, and shows that Pre fixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.
C-store: a column-oriented DBMS
Michael Stonebraker,Daniel J. Abadi,Adam Batkin,Xuedong Chen,Mitch Cherniack,Miguel Ferreira,Edmond Lau,Amerson Lin,Samuel Madden,Elizabeth O'Neil,Patrick O'Neil,Alexander Rasin,Nga Tran,Stan Zdonik +13 more
- 01 Dec 2018
TL;DR: Preliminary performance data on a subset of TPC-H is presented and it is shown that the system the team is building, C-Store, is substantially faster than popular commercial products.
DEFLATE Compressed Data Format Specification version 1.3
P. Deutsch
- 01 May 1996
TL;DR: This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding, with efficiency comparable to the best currently available general-purpose compression methods.
876