Finite-Context Models for DNA Coding

doi:10.5772/8539

Open AccessBook Chapter10.5772/8539

Finite-Context Models for DNA Coding

- 01 Mar 2010

7

TL;DR: This chapter proposed a three-state finite-context model for DNA protein-coding regions, i.e., for the parts of the DNA that carry information regarding how proteins are synthesized, and proved to be better than a singlestate model, giving additional evidence of a phenomenon that is common in these proteincoding regions.

Abstract: Usually, the purpose of studying data compression algorithms is twofold. The need for efficient storage and transmission is often the main motivation, but underlying every compression technique there is a model that tries to reproduce as closely as possible the information source to be compressed. This model may be interesting on its own, as it can shed light on the statistical properties of the source. DNA data are no exception. We urge to find out efficient methods able to reduce the storage space taken by the impressive amount of genomic data that are continuously being generated. Nevertheless, we also desire to know how the code of life works and what is its structure. Creating good (compression) models for DNA is one of the ways to achieve these goals. Recently, and with the completion of the human genome sequencing, the development of efficient lossless compression methods for DNA sequences gained considerable interest (Behzadi and Le Fessant, 2005; Cao et al., 2007; Chen et al., 2001; Grumbach and Tahi, 1993; Korodi and Tabus, 2005; 2007; Manzini and Rastero, 2004; Matsumoto et al., 2000; Pinho et al., 2006; 2009; 2008; Rivals et al., 1996). For example, the human genome is determined by approximately 3 000 million base pairs (Rowen et al., 1997), whereas the genome of wheat has about 16 000 million (Dennis and Surridge, 2000). Since DNA is based on an alphabet of four different symbols (usually known as nucleotides or bases), namely, Adenine (A), Cytosine (C), Guanine (G), and Thymine (T), without compression it takes approximately 750 MBytes to store the human genome (using log2 4 = 2 bits per symbol) and 4 GBytes to store the genome of wheat. In this chapter, we address the problem of DNA data modeling and coding. We review the main approaches proposed in the literature over the last fifteen years and we present some recent advances attained with finite-context models (Pinho et al., 2006; 2009; 2008). Low-order finite-context models have been used for DNA compression as a secondary, fall back method. However, we have shown that models of orders higher than four are indeed able to attain significant compression performance. Initially, we proposed a three-state finite-context model for DNA protein-coding regions, i.e., for the parts of the DNA that carry information regarding how proteins are synthesized (Ferreira et al., 2006; Pinho et al., 2006). This three-state model proved to be better than a singlestate model, giving additional evidence of a phenomenon that is common in these proteincoding regions, the periodicity of period three.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

DNA Lossless Compression Algorithms: Review

Nour S. Bakr, +1 more

- 01 Jan 2013

TL;DR: This paper comparatively survey the main ideas and results of lossless compression algorithms that have been developed for DNA sequences and their effects on data storage costs.

...read moreread less

40

Book Chapter•10.1007/978-3-642-19914-1_29

Compressing the Human Genome Using Exclusively Markov Models

Diogo Pratas, +1 more

- 01 Jan 2011

TL;DR: Finite-context models that rely exclusively on the Markov property are investigated and some properties of these models are used in order to improve the compression.

...read moreread less

19

Proceedings Article•10.1109/SSP.2011.5967639

DNA synthetic sequences generation using multiple competing Markov models

Diogo Pratas, +4 more

- 28 Jun 2011

TL;DR: This paper presents results regarding a synthetic DNA generator based on multiple competing finite-context models, and shows that DNA is better represented by multiple finite- context models.

...read moreread less

6

Journal Article•10.1093/gigascience/giad101

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data

Jorge Miguel Silva, +3 more

- 28 Dec 2022

- GigaScience

TL;DR: AlcoR is a novel method for identifying and visualizing low-complexity regions in biological sequences. It is alignment-free and enables the use of models with different memories to distinguish local from distant low-complexity patterns.

...read moreread less

2

•Book Chapter•10.5772/60650

Lossy-to-Lossless Compression of Biomedical Images Based on Image Decomposition

Luís M. O. Matos, +2 more

- 28 Oct 2015

TL;DR: This chapter study the performance of several compression methods developed by the authors, as well as of image coding standards, when used to compress medical images (computed radiography, computed tomography, magnetic resonance, and ul‐ trasound), RNAi images, and microarray images.

...read moreread less

1

References

Journal Article•10.1109/TIT.1977.1055714

A universal algorithm for sequential data compression

Jacob Ziv, +1 more

- 01 May 1977

- IEEE Transactions on Information Theory

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.

...read moreread less

6.3K

•Journal Article•10.1098/RSPA.1946.0056

An invariant form for the prior probability in estimation problems.

Harold Jeffreys

- 24 Sep 1946

- Proceedings of The Royal Society A: Math...

TL;DR: It is shown that a certain differential form depending on the values of the parameters in a law of chance is invariant for all transformations of the parameter when the law is differentiable with regard to all parameters.

...read moreread less

2.6K

•Book

Data Compression: The Complete Reference

David Salomon

- 01 Dec 2006

TL;DR: Detailed descriptions and explanations of the most well-known and frequently used compression methods are covered in a self-contained fashion, with an accessible style and technical level for specialists and nonspecialists.

...read moreread less

2K

Journal Article•10.1109/TIT.1981.1056331

The performance of universal encoding

R. Krichevsky, +1 more

- 01 Mar 1981

- IEEE Transactions on Information Theory

TL;DR: Universal coding theory is surveyed from the viewpoint of the interplay between delay and redundancy and the price for universality turns out to be acceptably small.

...read moreread less

709

•Book

Data Compression: The Complete Reference, 3rd Edition

David Salomon

- 01 Jan 2004

286

...

Expand

Finite-Context Models for DNA Coding

Chat with Paper

AI Agents for this Paper

Citations

DNA Lossless Compression Algorithms: Review

Compressing the Human Genome Using Exclusively Markov Models

DNA synthetic sequences generation using multiple competing Markov models

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data

Lossy-to-Lossless Compression of Biomedical Images Based on Image Decomposition

References

A universal algorithm for sequential data compression

An invariant form for the prior probability in estimation problems.

Data Compression: The Complete Reference

The performance of universal encoding

Data Compression: The Complete Reference, 3rd Edition

Related Papers (5)

Rank modulation codes for DNA storage

DNA Codeword Design: Theory and Applications

Theory and applications of DNA codeword design

Coding over Sets for DNA Storage

Data compression and learning in time sequences analysis