Finite-Context Models for DNA Coding
Armando J. Pinho,António J. R. Neves,Daniel A. Martins,Carlos A. C. Bastos,Paulo J. S. G. Ferreira +4 more
- 01 Mar 2010
TL;DR: This chapter proposed a three-state finite-context model for DNA protein-coding regions, i.e., for the parts of the DNA that carry information regarding how proteins are synthesized, and proved to be better than a singlestate model, giving additional evidence of a phenomenon that is common in these proteincoding regions.
read more
Abstract: Usually, the purpose of studying data compression algorithms is twofold. The need for efficient storage and transmission is often the main motivation, but underlying every compression technique there is a model that tries to reproduce as closely as possible the information source to be compressed. This model may be interesting on its own, as it can shed light on the statistical properties of the source. DNA data are no exception. We urge to find out efficient methods able to reduce the storage space taken by the impressive amount of genomic data that are continuously being generated. Nevertheless, we also desire to know how the code of life works and what is its structure. Creating good (compression) models for DNA is one of the ways to achieve these goals. Recently, and with the completion of the human genome sequencing, the development of efficient lossless compression methods for DNA sequences gained considerable interest (Behzadi and Le Fessant, 2005; Cao et al., 2007; Chen et al., 2001; Grumbach and Tahi, 1993; Korodi and Tabus, 2005; 2007; Manzini and Rastero, 2004; Matsumoto et al., 2000; Pinho et al., 2006; 2009; 2008; Rivals et al., 1996). For example, the human genome is determined by approximately 3 000 million base pairs (Rowen et al., 1997), whereas the genome of wheat has about 16 000 million (Dennis and Surridge, 2000). Since DNA is based on an alphabet of four different symbols (usually known as nucleotides or bases), namely, Adenine (A), Cytosine (C), Guanine (G), and Thymine (T), without compression it takes approximately 750 MBytes to store the human genome (using log2 4 = 2 bits per symbol) and 4 GBytes to store the genome of wheat. In this chapter, we address the problem of DNA data modeling and coding. We review the main approaches proposed in the literature over the last fifteen years and we present some recent advances attained with finite-context models (Pinho et al., 2006; 2009; 2008). Low-order finite-context models have been used for DNA compression as a secondary, fall back method. However, we have shown that models of orders higher than four are indeed able to attain significant compression performance. Initially, we proposed a three-state finite-context model for DNA protein-coding regions, i.e., for the parts of the DNA that carry information regarding how proteins are synthesized (Ferreira et al., 2006; Pinho et al., 2006). This three-state model proved to be better than a singlestate model, giving additional evidence of a phenomenon that is common in these proteincoding regions, the periodicity of period three.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
DNA Lossless Compression Algorithms: Review
Nour S. Bakr,Amr A. Sharawi +1 more
- 01 Jan 2013
TL;DR: This paper comparatively survey the main ideas and results of lossless compression algorithms that have been developed for DNA sequences and their effects on data storage costs.
Compressing the Human Genome Using Exclusively Markov Models
Diogo Pratas,Armando J. Pinho +1 more
- 01 Jan 2011
TL;DR: Finite-context models that rely exclusively on the Markov property are investigated and some properties of these models are used in order to improve the compression.
19
DNA synthetic sequences generation using multiple competing Markov models
Diogo Pratas,Carlos A. C. Bastos,Armando J. Pinho,António J. R. Neves,Luís M. O. Matos +4 more
- 28 Jun 2011
TL;DR: This paper presents results regarding a synthetic DNA generator based on multiple competing finite-context models, and shows that DNA is better represented by multiple finite- context models.
AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data
Jorge Miguel Silva,Weihong Qi,Armando J. Pinho,Diogo Pratas +3 more
TL;DR: AlcoR is a novel method for identifying and visualizing low-complexity regions in biological sequences. It is alignment-free and enables the use of models with different memories to distinguish local from distant low-complexity patterns.
2
Lossy-to-Lossless Compression of Biomedical Images Based on Image Decomposition
Luís M. O. Matos,António J. R. Neves,Armando J. Pinho +2 more
- 28 Oct 2015
TL;DR: This chapter study the performance of several compression methods developed by the authors, as well as of image coding standards, when used to compress medical images (computed radiography, computed tomography, magnetic resonance, and ul‐ trasound), RNAi images, and microarray images.
References
A universal algorithm for sequential data compression
Jacob Ziv,A. Lempel +1 more
TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
An invariant form for the prior probability in estimation problems.
TL;DR: It is shown that a certain differential form depending on the values of the parameters in a law of chance is invariant for all transformations of the parameter when the law is differentiable with regard to all parameters.
2.6K
•Book
Data Compression: The Complete Reference
David Salomon
- 01 Dec 2006
TL;DR: Detailed descriptions and explanations of the most well-known and frequently used compression methods are covered in a self-contained fashion, with an accessible style and technical level for specialists and nonspecialists.
2K
The performance of universal encoding
R. Krichevsky,V. Trofimov +1 more
TL;DR: Universal coding theory is surveyed from the viewpoint of the interplay between delay and redundancy and the price for universality turns out to be acceptably small.
709