Journal Article10.1109/TC.2005.85
Universal text preprocessing for data compression
J. Abel,William J. Teahan +1 more
55
TL;DR: Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme and the compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes.
read more
Abstract: Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications
TL;DR: Insight is gained to various open issues and research directions to explore the promising areas for future developments in data compression techniques and its applications.
247
MEAD: support for Real‐Time Fault‐Tolerant CORBA
Priya Narasimhan,Tudor Dumitras,Aaron Paulos,Soila Pertet,Carlos F. Reverte,Joseph G. Slember,Deepti Srivastava +6 more
TL;DR: The MEAD (Middleware for Embedded Adaptive Dependability) system attempts to identify and to reconcile the conflicts between real‐time and fault tolerance, in a resource‐aware manner, for distributed CORBA applications.
Revisiting dictionary‐based compression
TL;DR: This paper discusses several aspects of dictionary‐based compression, including compact dictionary representation, and presents a PPM/BWCA‐oriented scheme, word replacing transformation, achieving compression ratios higher by 2–6% than the state‐of‐the‐art StarNT (2003) text preprocessor.
•Dissertation
Adaptive models of Arabic text
Khaled M. Alhawiti
- 01 Jan 2014
TL;DR: Two new adaptive models, BS-P PM and CS-PPM, based on the Prediction by Partial Matching (PPM) compression scheme are introduced to improve the compression performance of standard PPM model by using preprocessing techniques.
35
Analyzing tourism reviews using an LDA topic-based sentiment analysis approach
TL;DR: In this article , a combination of topic modeling and sentiment analysis, as well as human validation techniques of topic labels, was employed to extract valuable insights about Marrakech city from TripAdvisor reviews.
20
References
A universal algorithm for sequential data compression
Jacob Ziv,A. Lempel +1 more
TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
A Block-sorting Lossless Data Compression Algorithm
Michael Burrows,David Wheeler +1 more
- 01 Jan 1994
TL;DR: A block-sorting, lossless data compression algorithm, and the implementation of that algorithm and the performance of the implementation with widely available data compressors running on the same hardware are compared.
Universal codeword sets and representations of the integers
TL;DR: An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate.
1.4K
Data Compression Using Adaptive Coding and Partial String Matching
John G. Cleary,Ian H. Witten +1 more
TL;DR: This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can be coded in as little as 2.2 bits/ character with no prior knowledge of the source.
1.4K