LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

doi:10.1101/GR.926603

Open AccessJournal Article10.1101/GR.926603

LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

Michael Brudno, +8 more

- 01 Apr 2003

- Genome Research

- Vol. 13, Iss: 4, pp 721-731

1.2K

TL;DR: Both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu.

Abstract: Comparing genomic sequences across related species is a fruitful source of biological insight, because functional elements such as exons tend to exhibit significant sequence similarity, whereas regions that are not functional tend to be less conserved. The first step in comparing genomic sequences is to align them—that is, to map the letters of one sequence to those of the others. There are several categories of alignments: local alignments that identify local similarities between regions of each sequence, and global alignments that find a monotonically increasing map between the letters of each sequence; pairwise alignments that compare two sequences, and multiple alignments that compare several sequences. Local pairwise alignment methods such as Smith-Waterman (1981), BLAST (Altschul et al. 1990, 1997), BLASTZ (Schwartz et al. 2000), SSAHA (Ning et al. 2001), and BLAT (Kent 2002) are able to pinpoint locations of rearrangements between two sequences, and are suitable for aligning draft sequences or individual reads. Global alignments are important because they reveal the shared order of biological features in the compared species, and produce a more accurate alignment at the base-pair level when the features are in the same order. The best-known global alignment algorithm is Needleman-Wunsch (1970), which requires time proportional to the product of the lengths of the aligned sequences. Unfortunately this algorithm is too inefficient for comparing long genomic sequences. Faster methods have been developed recently: DIALIGN (Morgenstern et al. 1998, Brudno and Morgenstern 2002), MUMmer (Delcher et al. 1999, 2002), GLASS (Batzoglou et al. 2000), WABA (Kent and Zahler 2000), and AVID (Bray et al. 2003). Most of these methods have proven effective in aligning genomic sequences from two closely related organisms, such as human and mouse or Caenorhabditis elegans and C. briggsae, but have not been tested in alignments between distant relatives such as human and fugu. Multiple alignments, a natural extension of two-sequence comparisons, are a powerful way to study biological sequences. Even weak similarity across several sequences usually reveals an important conserved biological feature (Dubchak et al. 2000; Gottgens et al. 2002). Moreover, multiple alignments enable the computation of local rates of evolution, giving a quantitative measure of the strength of evolutionary constraints and the functional importance of local regions (Simon et al. 2002). Multiple alignments are considerably more difficult to compute than are pairwise alignments: the running time scales as the product of the lengths of all the sequences. Formally, the problem is NP-complete (Wang and Jiang 1994; Bonizzoni and Vedova 2001). For this reason heuristic approaches are usually applied, of which the most widely used is progressive alignment, which constructs a multiple alignment by successive applications of a pairwise alignment algorithm. The best-known system based on progressive alignment is perhaps CLUSTALW (Thompson et al. 1994). Some other systems include MULTALIGN (Barton and Sternberg 1987), MULTAL (Taylor 1988), YAMA (Hardison et al. 1993, 1994), and PRRP (Gotoh 1996). DIALIGN (Morgenstern 1999) does not use progressive alignment; instead it uses another heuristic approach to chain local conserved blocks between several sequences into a multiple alignment. These systems can effectively align proteins and relatively short genomic regions, but are not efficient enough to align entire genomes. MGA (Hohl et al. 2002) is a rapid multiple aligner suitable for comparing very close homologs, such as different strains of a bacterium, but is not designed to align distant homologs. Here we describe novel systems for pairwise and multiple alignment of genomic sequences: LAGAN (Limited Area Global Alignment of Nucleotides), an efficient and reliable pairwise aligner that is suitable for genomic comparison of distantly related organisms, and MLAGAN (Multi-LAGAN), a multiple aligner based on progressive alignment with LAGAN. We tested our systems on sequence from 12 species generated for the genomic segment harboring the cystic fibrosis transmembrane conductance regulator (CFTR) gene (J.W. Thomas, J.W. Touchman, R.W. Blakesley, G.G. Bouffard, S.M. Beckstrom-Sternberg, E.H. Margulies, M. Blanchette, A.C. Siepel, P.J. Thomas, J.C. McDowell, B. Maskeri, N.F. Hansen, M.S. Schwartz, R.J. Weber, W.J. Kent, D. Karolchik, T.C. Bruen, R. Bevan, D.J. Cutler, S. Schwartz, L. Elnitski, J.R. Idol, A.B. Prasad, S.-Q. Lee-Lin, V.V.B. Maduro, M.E. Portnoy, N.L. Dietrich, N. Akhter, K. Ayele, B. Benjamin, K. Cariaga, C.P. Brinkley, S.Y. Brooks, S. Granite, X. Guan, J. Gupta, P. Haghighi, S-L. Ho, M.C. Huang, E. Karlins, P.L. Laric, R. Legaspi, M.J. Lim, Q.L. Maduro, C.A. Masiello, S.D. Mastrian, J.C. McCloskey, R. Pearson, S. Stantripop, E.E. Tiongson, J.T. Tran, C. Tsurgeon, J.L. Vogt, M.A. Walker, K.D. Wetherby, L.S. Wiggins, A.C. Young, L-H. Zhang, K. Osoegawa, B. Zhu, B. Zhao, C.L. Shu, P.J. De Jong, C.E. Lawrence, A.F. Smit, A. Chakravarti, D. Haussler, P. Green, W. Miller, and E.D. Green, in prep.). Based on comparisons with other available alignment programs and benchmarking on standard desktop computer systems, we conclude that LAGAN and MLAGAN are practical and reliable methods for large-scale pairwise and multiple genomic alignment that should prove useful for obtaining alignments of the entire human, mouse, fugu, rat, and other genomes in the context of a whole-genome alignment pipeline.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1093/NAR/GKH340

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Robert C. Edgar

- 01 Mar 2004

- Nucleic Acids Research

TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.

...read moreread less

45.1K

•Journal Article•10.1186/GB-2004-5-2-R12

Versatile and open software for comparing large genomes

Stefan Kurtz, +6 more

- 30 Jan 2004

- Genome Biology

TL;DR: The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes.

...read moreread less

5.7K

•Journal Article•10.1038/NATURE05874

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project

Ewan Birney, +320 more

- 14 Jun 2007

- Nature

TL;DR: Functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project are reported, providing convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts.

...read moreread less

5.6K

•Journal Article•10.1101/GR.2289704

Mauve: multiple alignment of conserved genomic sequence with rearrangements.

Aaron E. Darling, +3 more

- 01 Jul 2004

- Genome Research

TL;DR: This work presents methods for identification and alignment of conserved genomic DNA in the presence of rearrangements and horizontal transfer and evaluated the quality of Mauve alignments and drawn comparison to other methods through extensive simulations of genome evolution.

...read moreread less

4.7K

•Journal Article•10.1101/GR.3715005

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Adam Siepel, +15 more

- 01 Aug 2005

- Genome Research

TL;DR: A comprehensive search for conserved elements in vertebrate genomes is conducted, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes), using a two-state phylogenetic hidden Markov model (phylo-HMM).

...read moreread less

4.4K

...

Expand

References

Journal Article•10.1016/S0022-2836(05)80360-2

Basic Local Alignment Search Tool

Stephen F. Altschul, +4 more

- 01 Oct 1990

- Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

98.8K

•Journal Article•10.1093/NAR/25.17.3389

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Stephen F. Altschul, +6 more

- 01 Sep 1997

- Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

76.5K

•Journal Article•10.1093/NAR/22.22.4673

Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Julie D. Thompson, +2 more

- 11 Nov 1994

- Nucleic Acids Research

TL;DR: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved and modifications are incorporated into a new program, CLUSTAL W, which is freely available.

...read moreread less

67.3K

Journal Article•10.1016/0022-2836(70)90057-4

A general method applicable to the search for similarities in the amino acid sequence of two proteins

Saul B. Needleman, +1 more

- 28 Mar 1970

- Journal of Molecular Biology

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

...read moreread less

13.2K

Journal Article•10.1016/0022-2836(81)90087-5

Identification of common molecular subsequences.

Temple F. Smith, +1 more

- 25 Mar 1981

- Journal of Molecular Biology

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

...read moreread less

11.3K