LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.
Michael Brudno,Chuong B. Do,Gregory M. Cooper,Michael F. Kim,Eugene Davydov,Nisc Comparative Sequencing Program,Eric D. Green,Arend Sidow,Serafim Batzoglou +8 more
TL;DR: Both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu.
read more
Abstract: Comparing genomic sequences across related species is a fruitful source of biological insight, because functional elements such as exons tend to exhibit significant sequence similarity, whereas regions that are not functional tend to be less conserved. The first step in comparing genomic sequences is to align them—that is, to map the letters of one sequence to those of the others. There are several categories of alignments: local alignments that identify local similarities between regions of each sequence, and global alignments that find a monotonically increasing map between the letters of each sequence; pairwise alignments that compare two sequences, and multiple alignments that compare several sequences.
Local pairwise alignment methods such as Smith-Waterman (1981), BLAST (Altschul et al. 1990, 1997), BLASTZ (Schwartz et al. 2000), SSAHA (Ning et al. 2001), and BLAT (Kent 2002) are able to pinpoint locations of rearrangements between two sequences, and are suitable for aligning draft sequences or individual reads. Global alignments are important because they reveal the shared order of biological features in the compared species, and produce a more accurate alignment at the base-pair level when the features are in the same order. The best-known global alignment algorithm is Needleman-Wunsch (1970), which requires time proportional to the product of the lengths of the aligned sequences. Unfortunately this algorithm is too inefficient for comparing long genomic sequences. Faster methods have been developed recently: DIALIGN (Morgenstern et al. 1998, Brudno and Morgenstern 2002), MUMmer (Delcher et al. 1999, 2002), GLASS (Batzoglou et al. 2000), WABA (Kent and Zahler 2000), and AVID (Bray et al. 2003). Most of these methods have proven effective in aligning genomic sequences from two closely related organisms, such as human and mouse or Caenorhabditis elegans and C. briggsae, but have not been tested in alignments between distant relatives such as human and fugu.
Multiple alignments, a natural extension of two-sequence comparisons, are a powerful way to study biological sequences. Even weak similarity across several sequences usually reveals an important conserved biological feature (Dubchak et al. 2000; Gottgens et al. 2002). Moreover, multiple alignments enable the computation of local rates of evolution, giving a quantitative measure of the strength of evolutionary constraints and the functional importance of local regions (Simon et al. 2002). Multiple alignments are considerably more difficult to compute than are pairwise alignments: the running time scales as the product of the lengths of all the sequences. Formally, the problem is NP-complete (Wang and Jiang 1994; Bonizzoni and Vedova 2001). For this reason heuristic approaches are usually applied, of which the most widely used is progressive alignment, which constructs a multiple alignment by successive applications of a pairwise alignment algorithm. The best-known system based on progressive alignment is perhaps CLUSTALW (Thompson et al. 1994). Some other systems include MULTALIGN (Barton and Sternberg 1987), MULTAL (Taylor 1988), YAMA (Hardison et al. 1993, 1994), and PRRP (Gotoh 1996). DIALIGN (Morgenstern 1999) does not use progressive alignment; instead it uses another heuristic approach to chain local conserved blocks between several sequences into a multiple alignment. These systems can effectively align proteins and relatively short genomic regions, but are not efficient enough to align entire genomes. MGA (Hohl et al. 2002) is a rapid multiple aligner suitable for comparing very close homologs, such as different strains of a bacterium, but is not designed to align distant homologs.
Here we describe novel systems for pairwise and multiple alignment of genomic sequences: LAGAN (Limited Area Global Alignment of Nucleotides), an efficient and reliable pairwise aligner that is suitable for genomic comparison of distantly related organisms, and MLAGAN (Multi-LAGAN), a multiple aligner based on progressive alignment with LAGAN. We tested our systems on sequence from 12 species generated for the genomic segment harboring the cystic fibrosis transmembrane conductance regulator (CFTR) gene (J.W. Thomas, J.W. Touchman, R.W. Blakesley, G.G. Bouffard, S.M. Beckstrom-Sternberg, E.H. Margulies, M. Blanchette, A.C. Siepel, P.J. Thomas, J.C. McDowell, B. Maskeri, N.F. Hansen, M.S. Schwartz, R.J. Weber, W.J. Kent, D. Karolchik, T.C. Bruen, R. Bevan, D.J. Cutler, S. Schwartz, L. Elnitski, J.R. Idol, A.B. Prasad, S.-Q. Lee-Lin, V.V.B. Maduro, M.E. Portnoy, N.L. Dietrich, N. Akhter, K. Ayele, B. Benjamin, K. Cariaga, C.P. Brinkley, S.Y. Brooks, S. Granite, X. Guan, J. Gupta, P. Haghighi, S-L. Ho, M.C. Huang, E. Karlins, P.L. Laric, R. Legaspi, M.J. Lim, Q.L. Maduro, C.A. Masiello, S.D. Mastrian, J.C. McCloskey, R. Pearson, S. Stantripop, E.E. Tiongson, J.T. Tran, C. Tsurgeon, J.L. Vogt, M.A. Walker, K.D. Wetherby, L.S. Wiggins, A.C. Young, L-H. Zhang, K. Osoegawa, B. Zhu, B. Zhao, C.L. Shu, P.J. De Jong, C.E. Lawrence, A.F. Smit, A. Chakravarti, D. Haussler, P. Green, W. Miller, and E.D. Green, in prep.). Based on comparisons with other available alignment programs and benchmarking on standard desktop computer systems, we conclude that LAGAN and MLAGAN are practical and reliable methods for large-scale pairwise and multiple genomic alignment that should prove useful for obtaining alignments of the entire human, mouse, fugu, rat, and other genomes in the context of a whole-genome alignment pipeline.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
MUSCLE: multiple sequence alignment with high accuracy and high throughput
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
45.1K
Versatile and open software for comparing large genomes
Stefan Kurtz,Adam M. Phillippy,Arthur L. Delcher,Michael E. Smoot,Martin Shumway,Corina Antonescu,Steven L. Salzberg +6 more
TL;DR: The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes.
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Ewan Birney,John A. Stamatoyannopoulos,Anindya Dutta,Roderic Guigó,Thomas R. Gingeras,Elliott H. Margulies,Zhiping Weng,Michael Snyder,Emmanouil T. Dermitzakis,Robert E. Thurman,Michael S. Kuehn,Christopher M. Taylor,Shane Neph,Christoph M. Koch,Saurabh Asthana,Ankit Malhotra,Ivan Adzhubei,Jason A. Greenbaum,Robert M. Andrews,Paul Flicek,Patrick J. Boyle,Hua Cao,Nigel P. Carter,Gayle K. Clelland,Sean Davis,Nathan Day,Pawandeep Dhami,Shane C. Dillon,Michael O. Dorschner,Heike Fiegler,Paul G. Giresi,Jeff Goldy,Michael Hawrylycz,Andrew Haydock,Richard Humbert,Keith D. James,Brett E. Johnson,Ericka M. Johnson,Tristan Frum,Elizabeth Rosenzweig,Neerja Karnani,Kirsten Lee,Gregory Lefebvre,Patrick A. Navas,Fidencio Neri,Stephen C. J. Parker,Peter J. Sabo,Richard Sandstrom,Anthony Shafer,David Vetrie,Molly Weaver,Sarah Wilcox,Man Yu,Francis S. Collins,Job Dekker,Jason D. Lieb,Thomas D. Tullius,Gregory E. Crawford,Shamil R. Sunyaev,William Stafford Noble,Ian Dunham,Alexandre Reymond,Alexandre Reymond,Philipp Kapranov,Joel Rozowsky,Deyou Zheng,Robert Castelo,Adam Frankish,Jennifer Harrow,Srinka Ghosh,Albin Sandelin,Ivo L. Hofacker,Robert Baertsch,Damian Keefe,Sujit Dike,Jill Cheng,Heather A. Hirsch,Edward A. Sekinger,Julien Lagarde,Josep F. Abril,Josep F. Abril,Atif Shahab,Christoph Flamm,Christoph Flamm,Claudia Fried,Jörg Hackermüller,Jana Hertel,Manja Lindemeyer,Kristin Missal,Andrea Tanzer,Andrea Tanzer,Stefan Washietl,Jan O. Korbel,Olof Emanuelsson,Jakob Skou Pedersen,Nancy Holroyd,Ruth Taylor,David Swarbreck,Nicholas Matthews,Mark Dickson,Daryl J. Thomas,Matthew T. Weirauch,James G. R. Gilbert,Jorg Drenkow,Ian Bell,Xiaodong Zhao,Kandhadayar G. Srinivasan,Wing-Kin Sung,Hong Sain Ooi,Kuo Ping Chiu,Sylvain Foissac,Tyler Alioto,Michael R. Brent,Lior Pachter,Michael L. Tress,Alfonso Valencia,Siew Woh Choo,Chiou Yu Choo,Catherine Ucla,Caroline Manzano,Carine Wyss,Evelyn Cheung,Taane G. Clark,James B. Brown,Madhavan Ganesh,Sandeep Patel,Hari Tammana,Jacqueline Chrast,Charlotte N. Henrichsen,Chikatoshi Kai,Jun Kawai,Ugrappa Nagalakshmi,Jia Qian Wu,Zheng Lian,Jin Lian,Peter E. Newburger,Xueqing Zhang,Peter J. Bickel,John S. Mattick,Piero Carninci,Yoshihide Hayashizaki,Sherman M. Weissman,Tim Hubbard,Richard M. Myers,Jane Rogers,Peter F. Stadler,Peter F. Stadler,Peter F. Stadler,Todd M. Lowe,Chia-Lin Wei,Yijun Ruan,Kevin Struhl,Mark Gerstein,Stylianos E. Antonarakis,Yutao Fu,Eric D. Green,Ulas Karaoz,Adam Siepel,Adam Siepel,James Taylor,Laura A. Liefer,Kris A. Wetterstrand,Peter J. Good,Elise A. Feingold,Mark S. Guyer,Gregory M. Cooper,Gregory M. Cooper,George Asimenos,Colin N. Dewey,Minmei Hou,Sergey Nikolaev,Juan I. Montoya-Burgos,Ari Löytynoja,Simon Whelan,Fabio Pardi,Tim Massingham,Haiyan Huang,Nan Zhang,Nan Zhang,Ian Holmes,James C. Mullikin,Abel Ureta-Vidal,Benedict Paten,Michael Seringhaus,Deanna M. Church,Kate R. Rosenbloom,W. James Kent,Eric A. Stone,Serafim Batzoglou,Nick Goldman,Ross C. Hardison,David Haussler,Webb Miller,Arend Sidow,Nathan D. Trinklein,Zhengdong D. Zhang,Leah O. Barrera,Rhona K. Stuart,David C. King,Adam Ameur,Stefan Enroth,Mark Bieda,Jonghwan Kim,Akshay Bhinge,Nan Jiang,Jun Liu,Fei Yao,Vinsensius B. Vega,Charlie W.H. Lee,Patrick Ng,Annie Yang,Zarmik Moqtaderi,Zhou Zhu,Xiaoqin Xu,Sharon L. Squazzo,Matthew J. Oberley,David R. Inman,Michael A. Singer,Todd Richmond,Kyle J. Munn,Kyle J. Munn,Alvaro Rada-Iglesias,Ola Wallerman,Jan Komorowski,Joanna C. Fowler,Phillippe Couttet,Alexander W. Bruce,Oliver M. Dovey,Peter D. Ellis,Cordelia Langford,David A. Nix,Ghia Euskirchen,Stephen Hartman,Alexander E. Urban,Peter Kraus,Sara Van Calcar,Nate Heintzman,Tae Hoon Kim,Kun Wang,Chunxu Qu,Gary C. Hon,Rosa Luna,Christopher K. Glass,M. Geoff Rosenfeld,Shelley Force Aldred,Sara J. Cooper,Anason S. Halees,Jane M. Lin,Hennady P. Shulha,Xiaoling Zhang,Mousheng Xu,Jaafar N. Haidar,Yong Yu,Vishwanath R. Iyer,Roland Green,Claes Wadelius,Peggy J. Farnham,Bing Ren,Rachel A. Harte,Angie S. Hinrichs,Heather Trumbower,Hiram Clawson,Jennifer Hillman-Jackson,Ann S. Zweig,Kayla E. Smith,Archana Thakkapallayil,Galt P. Barber,Robert M. Kuhn,Donna Karolchik,Lluís Armengol,Christine P. Bird,Paul I.W. de Bakker,Andrew D. Kern,Nuria Lopez-Bigas,Joel D. Martin,Barbara E. Stranger,Abigail Woodroffe,Eugene Davydov,Antigone S. Dimas,Eduardo Eyras,Ingileif B. Hallgrímsdóttir,Julian L. Huppert,Michael C. Zody,Gonçalo R. Abecasis,Xavier Estivill,Gerard G. Bouffard,Xiaobin Guan,Nancy F. Hansen,Jacquelyn R. Idol,Valerie Maduro,Baishali Maskeri,Jennifer C. McDowell,Morgan Park,Pamela J. Thomas,Alice C. Young,Robert W. Blakesley,Donna M. Muzny,Erica Sodergren,David A. Wheeler,Kim C. Worley,Huaiyang Jiang,George M. Weinstock,Richard A. Gibbs,Tina Graves,Robert S. Fulton,Elaine R. Mardis,Richard K. Wilson,Michele Clamp,James Cuff,Sante Gnerre,David B. Jaffe,Jean L. Chang,Kerstin Lindblad-Toh,Eric S. Lander,Eric S. Lander,Maxim Koriabine,Mikhail Nefedov,Kazutoyo Osoegawa,Yuko Yoshinaga,Baoli Zhu,Pieter J. de Jong +320 more
TL;DR: Functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project are reported, providing convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts.
Mauve: multiple alignment of conserved genomic sequence with rearrangements.
TL;DR: This work presents methods for identification and alignment of conserved genomic DNA in the presence of rearrangements and horizontal transfer and evaluated the quality of Mauve alignments and drawn comparison to other methods through extensive simulations of genome evolution.
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
Adam Siepel,Gill Bejerano,Jakob Skou Pedersen,Angie S. Hinrichs,Minmei Hou,Kate R. Rosenbloom,Hiram Clawson,John Spieth,LaDeana W. Hillier,Stephen Richards,George M. Weinstock,Richard K. Wilson,Richard A. Gibbs,W. James Kent,Webb Miller,David Haussler +15 more
TL;DR: A comprehensive search for conserved elements in vertebrate genomes is conducted, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes), using a two-state phylogenetic hidden Markov model (phylo-HMM).
4.4K
References
Basic Local Alignment Search Tool
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
98.8K
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Stephen F. Altschul,Thomas L. Madden,Alejandro A. Schäffer,Jinghui Zhang,Zheng Zhang,Webb Miller,David J. Lipman +6 more
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
TL;DR: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved and modifications are incorporated into a new program, CLUSTAL W, which is freely available.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
13.2K
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
11.3K