Top 67 papers published in the topic of Alignment-free sequence analysis in 2003

Showing papers on "Alignment-free sequence analysis published in 2003"

Journal Article•10.1093/BIOINFORMATICS/BTG1005•

Glocal alignment: finding rearrangements during alignment

[...]

Michael Brudno¹, Sanket Malde, Alexander Poliakov, Chuong B. Do, Olivier Couronne, Inna Dubchak, Serafim Batzoglou¹ - Show less +3 more•Institutions (1)

Stanford University¹

03 Jul 2003-Bioinformatics

TL;DR: Shuffle-LAGAN is presented, a glocal alignment algorithm that is based on the CHAOS local alignment algorithm and the LAGAN global aligner, and is able to align long genomic sequences.

...read moreread less

Abstract: Motivation: To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. The two main classes of pairwise alignments are global alignment, where one string is transformed into the other, and local alignment, where all locations of similarity between the two strings are returned. Global alignments are less prone to demonstrating false homology as each letter of one sequence is constrained to being aligned to only one letter of the other. Local alignments, on the other hand, can cope with rearrangements between non-syntenic, orthologous sequences by identifying similar regions in sequences; this, however, comes at the expense of a higher false positive rate due to the inability of local aligners to take into account overall conservation maps. Results: In this paper we introduce the notion of glocal alignment, a combination of global and local methods, where one creates a map that transforms one sequence into the other while allowing for rearrangement events. We present Shuffle-LAGAN, a glocal alignment algorithm that is based on the CHAOS local alignment algorithm and the LAGAN global aligner, and is able to align long genomic sequences. To test Shuffle-LAGAN we split the mouse genome into BAC-sized pieces, and aligned these pieces to the human genome. We demonstrate that ShuffleLAGAN compares favorably in terms of sensitivity and specificity with standard local and global aligners. From the alignments we conclude that about 9% of human/mouse homology may be attributed to small rearrangements, 63% of which are duplications. Availability: Our systems, supplemental information, and the alignment of the human and mouse genomes using ∗ To whom correspondence should be addressed.

...read moreread less

521 citations

Journal Article•10.1093/BIOINFORMATICS/BTG295•

A new sequence distance measure for phylogenetic tree construction

[...]

Hasan H. Otu¹, Khalid Sayood•Institutions (1)

University of Nebraska–Lincoln¹

01 Nov 2003-Bioinformatics

TL;DR: A new sequence distance measure based on the relative information between the sequences using Lempel-Ziv complexity is proposed, which can be used to construct phylogenetic trees.

...read moreread less

Abstract: Motivation Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, e.g. whole genome phylogeny, and the evolutionary models may not always be correct. We propose a new sequence distance measure based on the relative information between the sequences using Lempel-Ziv complexity. The distance matrix thus obtained can be used to construct phylogenetic trees. Results The proposed approach does not require sequence alignment and is totally automatic. The algorithm has successfully constructed consistent phylogenies for real and simulated data sets. Availability Available on request from the authors.

...read moreread less

378 citations

Journal Article•10.1093/NAR/GKG522•

Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments

[...]

Olivier Poirot¹, Eamonn O'Toole, Cedric Notredame•Institutions (1)

Centre national de la recherche scientifique¹

01 Jul 2003-Nucleic Acids Research

TL;DR: Tcoffee@igs, a new server provided to the community by Hewlet Packard computers and the Centre National de la Recherche Scientifique, is a web-based tool dedicated to the computation, the evaluation and the combination of multiple sequence alignments.

...read moreread less

Abstract: This paper presents Tcoffee@igs, a new server provided to the community by Hewlet Packard computers and the Centre National de la Recherche Scientifique. This server is a web-based tool dedicated to the computation, the evaluation and the combination of multiple sequence alignments. It uses the latest version of the T-Coffee package. Given a set of unaligned sequences, the server returns an evaluated multiple sequence alignment and the associated phylogenetic tree. This server also makes it possible to evaluate the local reliability of an existing alignment and to combine several alternative multiple alignments into a single new one. Tcoffee@igs can be used for aligning protein, RNA or DNA sequences. Datasets of up to 100 sequences (2000 residues long) can be processed. The server and its documentation are available from: http://igs-server.cnrs-mrs.fr/Tcoffee/.

...read moreread less

261 citations

Journal Article•10.1093/BIOINFORMATICS/BTG193•

A hidden Markov model for progressive multiple alignment

[...]

Ari Löytynoja¹, Michel C. Milinkovitch¹•Institutions (1)

Free University of Brussels¹

12 Aug 2003-Bioinformatics

TL;DR: A new method for multiple sequence alignment is presented that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment.

...read moreread less

Abstract: Motivation: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. Results: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment. Availability: The software is freely available at http://www. ulb.ac.be/sciences/ueg/

...read moreread less

180 citations

Journal Article•10.1093/BIOINFORMATICS/BTG200•

Comprehensive aligned sequence construction for automated design of effective probes (CASCADE-P) using 16S rDNA.

[...]

Todd Z. DeSantis¹, I. Dubosarskiy¹, S. R. Murray¹, Gary L. Andersen¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

12 Aug 2003-Bioinformatics

TL;DR: The main focus of creating the prokMSA was to provide a comprehensive, categorized, updateable 16S rDNA collection useful as a foundation for any probe selection algorithm.

...read moreread less

Abstract: Motivation: Prokaryotic organisms have been identified utilizing the sequence variation of the 16S rRNA gene. Variations steer the design of DNA probes for the detection of taxonomic groups or specific organisms. The longterm goal of our project is to create probe arrays capabl eo fidentifying 16S rDNA sequences in unknown samples. This necessitated the authentication, categorization and alignment of the >75 000 publicly available ‘16S’ sequences. Preferably, the entire process should be computationally administrated so the aligned collection could periodically absorb 16S rDNA sequences from the public records. A complete multiple sequence alignment would provide a foundation for computational probe selection and facilitates microbial taxonomy and phylogeny. Results: Here we report the alignment and similarity clustering of 62 662 16S rDNA sequences and an approach for designing effective probes for each cluster. A novel alignment compression algorithm, NAST (Nearest Alignment Space Termination), was designed to produce the uniform multiple sequence alignment referred to as the prokMSA. From the prokMSA, 9020 Operational Taxonomic Units (OTUs) were found based on transitive sequence similarities. An automated approach to probe design was straightforward using the prokMSA clustered into OTUs. As a test case, multiple probes were computationally picked for each of the 27 OTUs that were identified within the Staphylococcus Group. The probes were incorporated into a customized microarray and were able to correctly categorize Staphylococcus aureus and Bacillus anthracis into their correct OTUs. Although a successful probe picking strategy is outlined, the main focus of creating the prokMSA was to provide a comprehensive, categorized, updateable 16S rDNA collection useful as a foundation for any probe selection algorithm. Availability: http://greengenes.llnl.gov/16S/

...read moreread less

133 citations

Journal Article•10.1104/PP.102.018101•

Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping.

[...]

Wei Zhu¹, Shannon D. Schlueter¹, Volker Brendel¹•Institutions (1)

Iowa State University¹

01 Jun 2003-Plant Physiology

TL;DR: The complete set of 176,915 publicly available Arabidopsis EST sequences is mapped using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring that provides verified sets of EST clusters for evaluation of EST clustering programs.

...read moreread less

Abstract: Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data Thus, EST data sets provide a vast resource for gene identification and expression profiling We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment The mapping provides verified sets of EST clusters for evaluation of EST clustering programs Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing All results of this study were parsed into a database and are accessible via a flexible Web interface at http://wwwplantgdborg/AtGDB/

...read moreread less

124 citations

Journal Article•10.1093/BIOINFORMATICS/19.2.228•

A generalized global alignment algorithm

[...]

Xiaoqiu Huang¹, Kun-Mao Chao²•Institutions (2)

Iowa State University¹, National Taiwan University²

22 Jan 2003-Bioinformatics

TL;DR: A generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions, which is implemented as a computer program named GAP3 (Global Alignment Program Version 3).

...read moreread less

Abstract: MOTIVATION Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. RESULTS We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. AVAILABILITY The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.

...read moreread less

96 citations

Journal Article•10.1093/BIOINFORMATICS/BTG188•

Pandit: a database of protein and associated nucleotide domains with inferred trees.

[...]

Simon Whelan¹, Paul I.W. de Bakker, Nick Goldman•Institutions (1)

University of Cambridge¹

12 Aug 2003-Bioinformatics

TL;DR: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics and may provide inspiration for new models and methodology to study sequence evolution.

...read moreread less

Abstract: Motivation: Al arge, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. Results: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach. Availability: The Pandit database is available for browsing and download via its home page at http://www.ebi.ac.uk/

...read moreread less

54 citations

Journal Article•10.1142/S0219720003000095•

Constrained multiple sequence alignment tool development and its application to RNase family alignment.

[...]

Chuan Yi Tang¹, Chin Lung Lu¹, Margaret Dah-Tsyr Chang¹, Yin Te Tsai¹, Yuh-Ju Sun¹, Kun-Mao Chao¹, Jia-Ming Chang¹, Yu Han Chiou¹, Chia Mao Wu¹, Hao Teng Chang¹, Wei I. Chou¹ - Show less +7 more•Institutions (1)

National Tsing Hua University¹

01 Jul 2003-Journal of Bioinformatics and Computational Biology

TL;DR: A heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together.

...read moreread less

Abstract: In this paper, we design a heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together. If the number of residues needed to be aligned together is a constant alpha, then the time-complexity of our CMSA algorithm for aligning K sequences is O(alphaKn(4)), where n is the maximum of the lengths of sequences. In addition, we have built up such a CMSA software system and made several experiments on the RNase sequences, which mainly function in catalyzing the degradation of RNA molecules. The resulting alignments illustrate the practicability of our method.

...read moreread less

54 citations

Book Chapter•10.1007/978-3-540-39763-2_31•

Optimal Multiple Parsimony Alignment with Affine Gap Cost Using a Phylogenetic Tree

[...]

Bjarne Knudsen¹•Institutions (1)

University of Florida¹

15 Sep 2003

TL;DR: A new method for multiple parsimony alignment over a tree using an affine gap cost rather than a simple linear gap cost is presented, which should prove useful in the multiple alignment scenario.

...read moreread less

Abstract: Many methods in bioinformatics rely on evolutionary relationships between protein, DNA, or RNA sequences. Alignment is a crucial first step in most analyses, since it yields information about which regions of the sequences are related to each other. Here, a new method for multiple parsimony alignment over a tree is presented. The novelty is that an affine gap cost is used rather than a simple linear gap cost. Affine gap costs have been used with great success for pairwise alignments and should prove useful in the multiple alignment scenario. The algorithmic challenge of using an affine gap cost in multiple alignment is the introduction of dependence between different columns in the alignment. The utility of the new method is illustrated by a number of protein sequences where increased alignment accuracy is obtained by using multiple sequences.

...read moreread less

53 citations

Journal Article•10.1093/NAR/GKG082•

ZmDB, an integrated database for maize genome research

[...]

Qunfeng Dong¹, Laura M. Roy, Michael Freeling, Virginia Walbot, Volker Brendel - Show less +1 more•Institutions (1)

Iowa State University¹

01 Jan 2003-Nucleic Acids Research

TL;DR: Zea mays DataBase (ZmDB) seeks to provide a comprehensive view of maize (corn) genetics by linking genomic sequence data with gene expression analysis and phenotypes of mutant plants by linking ESTs, genome survey sequences, and protein sequences.

...read moreread less

Abstract: Zea mays DataBase (ZmDB) seeks to provide a comprehensive view of maize (corn) genetics by linking genomic sequence data with gene expression analysis and phenotypes of mutant plants. ZmDB originated in 1999 as the Web portal for a large project of maize gene discovery, sequencing and phenotypic analysis using a transposon tagging strategy and expressed sequence tag (EST) sequencing. Recently, ZmDB has broadened its scope to include all public maize ESTs, genome survey sequences (GSSs), and protein sequences. More than 170 000 ESTs are currently clustered into approximately 20 000 contigs and about an equal number of apparent singlets. These clusters are continuously updated and annotated with respect to potential encoded protein products. More than 100 000 GSSs are similarly assembled and annotated by spliced alignment with EST and protein sequences. The ZmDB interface provides quick access to analytical tools for further sequence analysis. Every sequence record is linked to several display options and similarity search tools, including services for multiple sequence alignment, protein domain determination and spliced alignment. Furthermore, ZmDB provides web-based ordering of materials generated in the project, including ESTs, ordered collections of genomic sequences tagged with the RescueMu transposon and microarrays of amplified ESTs. ZmDB can be accessed at http://zmdb.iastate.edu/.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/BTG073•

A segment alignment approach to protein comparison.

[...]

Yuzhen Ye¹, Lukasz Jaroszewski¹, Weizhong Li¹, Adam Godzik¹•Institutions (1)

Sanford-Burnham Institute for Medical Research¹

12 Apr 2003-Bioinformatics

TL;DR: It is shown that application of the SEA algorithm improves alignment quality as compared to FFAS profile-profile alignment, and in some cases SEA alignments can match the structural alignments, a feat previously impossible for any sequence based alignment methods.

...read moreread less

Abstract: Motivation: Local structure segments (LSSs) are small structural units shared by unrelated proteins. They are extensively used in protein structure comparison, and predicted LSSs (PLSSs) are used very successfully in ab initio folding simulations. However, predicted or real LSSs are rarely exploited by protein sequence comparison programs that are based on position-by-position alignments. Results: We developed a SEgment Alignment algorithm (SEA) to compare proteins described as a collection of predicted local structure segments (PLSSs), which is equivalent to an unweighted graph (network). Any specific structure, real or predicted corresponds to a specific path in this network. SEA then uses a network matching approach to find two most similar paths in networks representing two proteins. SEA explores the uncertainty and diversity of predicted local structure information to search for a globally optimal solution. It simultaneously solves two related problems: the alignment of two proteins and the local structure prediction for each of them. On a benchmark of protein pairs with low sequence similarity, we show that application of the SEA algorithm improves alignment quality as compared to FFAS profile-profile alignment, and in some cases SEA alignments can match the structural alignments, a feat previously impossible for any sequence based alignment methods. Availability: SEA is freely available for academic users on aw eb server http://ffas.ljcrf.edu/sea.

...read moreread less

Journal Article•10.1089/106652703322756096•

An Eulerian path approach to global multiple alignment for DNA sequences.

[...]

Yu Zhang¹, Michael S. Waterman¹•Institutions (1)

University of Southern California¹

01 Jan 2003-Journal of Computational Biology

TL;DR: This work introduces a novel approach that is fundamentally different from all currently available methods for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem.

...read moreread less

Abstract: With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.

...read moreread less

Journal Article•10.1093/NAR/GKG561•

MGAlignIt: A web service for the alignment of mRNA/EST and genomic sequences.

[...]

Bernett T. K. Lee¹, Tin Wee Tan¹, Shoba Ranganathan¹•Institutions (1)

National University of Singapore¹

01 Jul 2003-Nucleic Acids Research

TL;DR: This work presents here a freely available web service, MGAlignIt, which allows users to effectively visualize the alignment in a graphical manner and to perform limited analysis on the alignment output.

...read moreread less

Abstract: Splicing is a biological phenomenon that removes the non-coding sequence from the transcripts to produce a mature transcript suitable for translation. To study this phenomenon, information on the intron–exon arrangement of a gene is essential, usually obtained by aligning mRNA/EST sequences to their cognate genomic sequences. MGAlign is a novel, rapid, memory efficient and practical method for aligning mRNA/EST and genome sequences. We present here a freely available web service, MGAlignIt (http://origin.bic.nus.edu.sg/mgalign/mgalignit), based on MGAlign. Besides the alignment itself, this web service allows users to effectively visualize the alignment in a graphical manner and to perform limited analysis on the alignment output. The server also permits the alignment to be saved in several forms, both graphical and text, suitable for further processing and analysis by other programs.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/19.2.297•

SEGID: Identifying Interesting Segments in (Multiple) Sequence Alignments

[...]

Lusheng Wang¹, Ying Xu•Institutions (1)

City University of Hong Kong¹

22 Jan 2003-Bioinformatics

TL;DR: SEGID is a tool for finding conserved regions (regions of high scores) for a given (multiple) sequence alignment that converts the alignment into a sequence of numbers, where each number is the alignment score of a column.

...read moreread less

Abstract: Summary: SEGID is a tool for finding conserved regions (regions of high scores) for a given (multiple) sequence alignment. It takes a (multiple) sequence alignment as its input and converts the alignment into a sequence of numbers, where each number is the alignment score of a column. Three algorithms are used to identify regions of high scores. A graphical interface is provided to present those identified regions. Availability: Free from http://www.cs.cityu.edu.hk/ ∼lwang/segid/ subject to copyright restrictions.

...read moreread less

Proceedings Article•10.1142/9789812704856_0004•

Genome-wide detection of alternative splicing in expressed sequences using partial order multiple sequence alignment graphs.

[...]

Catherine S. Grasso¹, Barmak Modrek, Yi Xing, Christopher Lee¹•Institutions (1)

University of California, Los Angeles¹

1 Dec 2003

TL;DR: This method effectively copes with many of the problems inherent in making inferences about splicing and alternative splicing on the basis of EST sequences, which in addition to being fragmentary and full of sequencing errors, may also be chimeric, misoriented, or contaminated with genomic sequence.

...read moreread less

Abstract: We present a method for high-throughput alternative splicing detection in expressed sequence data. This method effectively copes with many of the problems inherent in making inferences about splicing and alternative splicing on the basis of EST sequences, which in addition to being fragmentary and full of sequencing errors, may also be chimeric, misoriented, or contaminated with genomic sequence. Our method, which relies both on the Partial Order Alignment (POA) program for constructing multiple sequence alignments, and its Heaviest Bundling function for generating consensus sequences, accounts for the real complexity of expressed sequence data by building and analyzing a single multiple sequence alignment containing all of the expressed sequences in a particular cluster aligned to genomic sequence. We illustrate application of this method to human UniGene Cluster Hs.1162, which contains expressed sequences from the human HLA-DMB gene. We have used this method to generate databases, published elsewhere, of splices and alternative splicing relationships for the human, mouse and rat genomes. We present statistics from these calculations, as well as the CPU time for running our method on expressed sequence clusters of varying size, to verify that it truly scales to complete genomes.

...read moreread less

Proceedings Article•10.1109/CSB.2003.1227338•

Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance

[...]

Bailin Hao¹, Ji Qi²•Institutions (2)

Fudan University¹, Academia Sinica²

11 Aug 2003

TL;DR: A new and essentially simple method to reconstruct prokaryotic phylogenetic trees from their complete genome data without using sequence alignment is proposed, based on the appearance frequency of oligopeptides of a fixed length in their proteomes.

...read moreread less

Abstract: A new and essentially simple method to reconstruct prokaryotic phylogenetic trees from their complete genome data without using sequence alignment is proposed. It is based on the appearance frequency of oligopeptides of a fixed length (up to K=6) in their proteomes. This is a method without fine adjustment and choice of genes. It can incorporate the effect of lateral gene transfer to some extent and leads to results comparable with the bacteriologists' systematics as reflected in the latest 2001 edition of the Sergey's manual of systematic bacteriology. A key point in our approach is subtraction of a random back-groundby using a Markovian model of order K-1 from the composition vectors to highlight the shaping role of natural selection.

...read moreread less

Journal Article•10.1186/GB-2003-4-12-122•

Multi-species sequence comparison: the next frontier in genome annotation

[...]

Inna Dubchak¹, Kelly A. Frazer•Institutions (1)

Lawrence Berkeley National Laboratory¹

27 Nov 2003-Genome Biology

TL;DR: Efficient extension of computational tools to multiple species will require knowledge of the ideal evolutionary distance to choose and the development of new algorithms for alignment, analysis of conservation, and visualization of results.

...read moreread less

Abstract: Multi-species comparisons of DNA sequences are more powerful for discovering functional sequences than pairwise DNA sequence comparisons. Most current computational tools have been designed for pairwise comparisons, and efficient extension of these tools to multiple species will require knowledge of the ideal evolutionary distance to choose and the development of new algorithms for alignment, analysis of conservation, and visualization of results.

...read moreread less

Proceedings Article•

Visualization and comparison of DNA sequences by use of three-dimensional trajectories

[...]

Hsuan T. Chang¹, Neng-Wen Lo², Wei C. Lu³, Chung J. Kuo³•Institutions (3)

National Yunlin University of Science and Technology¹, Tunghai University², National Chung Cheng University³

1 Jan 2003

TL;DR: The proposed visualization tool for dexoyribonucleic acid (DNA) sequences by the use of three-dimensional (3-D) trajectories (TDT) can easily discriminate the differences and similarities among various DNA sequences.

...read moreread less

Abstract: In this paper, we propose a visualization tool for dexoyribonucleic acid (DNA) sequences by the use of three-dimensional (3-D) trajectories (TDT). In the proposed method, four different nucleotides are assigned by four corresponding positions in the 3-D space, which are equally spaced and the origin is the centroid of four positions. With the distance accumulated from the distances between multiple consecutive two positions, a DNA sequence can be represented by a trajectory in the 3-D space. A global view of the DNA sequence can thus be obtained no matter how large the sequence is. The alignment of two DNA sequences can be determined by the use of correlation operation on the trajectories. From our simulation results, the TDTs for different functions of DNA sequences vary a lot and thus are easy to be distinguished. On the other hand, there exist some similarities between the trajectories for the same type of DNA sequences obtained from different kinds of creatures. Therefore, in addition to the low computation complexity, the proposed visualization tool can easily discriminate the differences and similarities among various DNA sequences.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/BTG1077•

Divide-and-conquer multiple alignment with segment-based constraints

[...]

Michael Sammeth¹, Burkhard Morgenstern, Jens Stoye•Institutions (1)

Bielefeld University¹

27 Sep 2003

TL;DR: A new algorithm for multiple sequence alignment is introduced that integrates the global divide-and-conquer approach with the local segment-based approach, thereby combining the strengths of those two strategies.

...read moreread less

Abstract: A large number of methods for multiple sequence alignment are currenty available. Recent benchmarking tests demonstrated that strengths and drawbacks of these methods differ substantially. Global strategies can be outperformed by approaches based on local similarities and vice versa, depending on the characteristics of the input sequences. In recent years, mixed approaches that include both global and local features have shown promising results. Herein, we introduce a new algorithm fo rm ultiple sequence alignment that integrates the global divide-and-conquer approach with the local segmentbased approach, thereby combining the strengths of those two strategies.

...read moreread less

Proceedings Article•10.1109/CEC.2003.1299591•

Improvement of clustal-derived sequence alignments with evolutionary algorithms

[...]

René Thomsen¹, Gary B. Fogel, Thiemo Krink•Institutions (1)

Aarhus University¹

8 Dec 2003

TL;DR: Previous efforts using evolutionary algorithms (EAs) for MSA were extended and three new alignment operators were introduced and tested within the framework of protein sequence alignment, showing the degree to which EAs can enhance the results of Clustal X.

...read moreread less

Abstract: Multiple sequence alignment (MSA) is a central problem in bioinformatics. In this study, we extended previous efforts using evolutionary algorithms (EAs) for MSA. Candidate solutions in the initial population were derived from the well-known alignment program Clustal X. Evolutionary computation was then used to evolve increasingly appropriate solutions. Three new alignment operators were introduced and tested within the framework of protein sequence alignment. Statistics on alignment quality were generated with respect to selected alignment benchmarks from the BAliBASE database using the BLOSUM 62 substitution matrix. Our results indicate the degree to which EAs can enhance the results of Clustal X. Moreover, the experimental results show that the commonly used sum-of-pairs scoring scheme sometimes fails to correlate higher scoring alignments with increase in alignment quality in terms of the BAliBASE sum-of-pairs score.

...read moreread less

Journal Article•10.1016/S0022-2836(03)00858-1•

Potential for dramatic improvement in sequence alignment against structures of remote homologous proteins by extracting structural information from multiple structure alignment.

[...]

Ziding Zhang¹, Mats Lindstam¹, Johan Unge¹, Carsten Peterson¹, Guoguang Lu¹ - Show less +1 more•Institutions (1)

Lund University¹

05 Sep 2003-Journal of Molecular Biology

TL;DR: The results indicate that the novel alignment strategy could be helpful for extending the application of highly reliable methods for fold identification and homology modeling to a huge number of homologous proteins of low sequence similarity.

...read moreread less

Sequence Alignment Algorithms

[...]

Sérgio Anibal de Carvalho

1 Jan 2003

TL;DR: This work is concerned with efficient methods for practical biomolecular sequence comparison, focusing on global and local alignment algorithms and analyses the classical approaches of Needleman & Wunsch and Smith & Waterman as well as efficient alternatives; in particular, the algorithms recently designed by Crochemore, Landau and Ziv-Ukelson that use compression techniques to achieve sub-quadratic time complexity.

...read moreread less

Abstract: iii Ninguém mais do que meus pais merecem a minha gratidão por todo o amor, apoio, força e exemplo de vida que sempre me ofereceram. iv Preface The discovery of the DNA structure in 1953 has dramatically changed how biology is studied. It has opened a new frontier in the development of this exciting science. Biologists are working today to " decipher " the DNA of every form of life on earth, producing an extraordinary amount of data that needs to be analysed. No doubt this is why they are appealing to computer scientists and the expertise developed in the last decades on information storage, retrieval and analysis. This merging of biology and computer science has created a new interdisciplinary field know as computational biology that explores the capacities of computers to gain knowledge from biological data. In fact, researches can learn a great deal about a biomolecular sequence by comparing it to already well-studied sequences. For this reason, sequence comparison is regarded as one of the most fundamental problems of computational biology, which is usually solved with a technique known as sequence alignment. This work is concerned with efficient methods for practical biomolecular sequence comparison, focusing on global and local alignment algorithms. It analyses the classical approaches of Needleman & Wunsch and Smith & Waterman as well as efficient alternatives; in particular, the algorithms recently designed by Crochemore, Landau and Ziv-Ukelson that use compression techniques to achieve sub-quadratic time complexity. Chapter 1 presents a brief introduction to the field of computational biology and the sequence comparison problem. Chapter 2 discusses how two sequences can be compared by finding the best alignment between them, and describes standard and alternative algorithms to compute an optimal alignment. Chapters 3, 4 and 5 are devoted to the design, implementation and evaluation of a library of computational biology algorithms developed as part of this work 1 with the aim of studying the alignment algorithms described in Chapter 2. Acknowledgments I would like to thank my supervisor, Professor Maxime Crochemore, for his guidance throughout the development of this project.

...read moreread less

Book Chapter•10.1016/S0065-3233(03)01026-X•

Proteomics and bioinformatics.

[...]

Carol S. Giometti¹•Institutions (1)

Argonne National Laboratory¹

01 Jan 2003-Advances in Protein Chemistry

TL;DR: This chapter focuses on the use of existing bioinformatics approaches in the analysis of global-protein expression using two-dimensional electrophoresis and mass spectrometry and on the extension of bio informatics to include the analysis and management of the proteome data being produced.

...read moreread less

Abstract: Publisher Summary The chapter focuses on the use of existing bioinformatics approaches in the analysis of global-protein expression using two-dimensional electrophoresis and mass spectrometry and on the extension of bioinformatics to include the analysis and management of the proteome data being produced. Bioinformatics will, of necessity, come to include the acquisition, analysis, and management of such protein-expression data, as well as the integration of those data with genome and protein sequence databases. The chapter discusses several bioinformatics tools, such as (1) sequence databases, (2) sequence analysis, and (3) annotation and proteomics tools—such as open reading frame (ORF) databases and proteomics, proteome (2DE) databases, and database integration. In the context of protein expression, the ORF sequences are used to select specific nucleotide sequences to include on the complementary DNA (cDNA) microarray chips used to quantify changes in specific messenger RNA (mRNA) abundance. The ORF sequences are also used as the starting material for protein arrays, providing the nucleic acid sequence that is incorporated into host cells to produce overexpression of the target proteins for the arrays. A comparative analysis of the metabolic processes of whole cells will be facilitated through the comparison of protein expression and genome sequence data from diverse cell types and the dream of computational cell modeling. Fully, and even partially, annotated genome databases have multiple uses in the broad context of proteomics, including protein structure and function prediction, as well as protein-expression analysis.

...read moreread less

Proceedings Article•

Inferring an original sequence from erroneous copies: a Bayesian approach

[...]

Jonathan M. Keith¹, Peter Adams¹, Darryn Bryant¹, Keith Mitchelson¹, Duncan A. E. Cochran¹, Gita H. Lala¹ - Show less +2 more•Institutions (1)

University of Queensland¹

1 Jan 2003

TL;DR: The implication is that high error levels need not be a barrier to the adoption of sequencing technologies that are in other respects promising, because most errors can be detected and corrected using a small number of reads.

...read moreread less

Abstract: This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages, but at the cost of introducing many errors. We develop a Bayesian probabilistic model of the introduction of errors, and search for a sequence that has maximum posterior probability with respect to the model. We present results of extensive tests in which error-prone sequencing of real DNA was simulated. The results obtained using the new approach are compared to results obtained by deriving a consensus sequence from a multiple sequence alignment. We find that a significant improvement in accuracy is obtained using the new approach. The implication is that high error levels need not be a barrier to the adoption of sequencing technologies that are in other respects promising, because most errors can be detected and corrected using a small number of reads.

...read moreread less

Journal Article•10.1023/A:1026145703834•

More for less in structural genomics.

[...]

Andreas Heger¹, Liisa Holm¹•Institutions (1)

University of Helsinki¹

01 Jan 2003-Journal of Structural and Functional Genomics

TL;DR: A new transitive alignment algorithm (MaxFlow) is developed, which generates accurate alignments between proteins deep in the twilight zone of sequence similarity, below 20% sequence identity, and proposes novel strategies for target prioritization using MaxFlow scores to predict the optimal templates in a superfamily.

...read moreread less

Abstract: Structural genomics is the idea of covering protein space so that every protein sequence comes within model building distance of a protein of known structure. Unfortunately, reproducing the structural alignment of distantly related proteins is a difficult challenge to existing sequence alignment and motif search software. We have developed a new transitive alignment algorithm (MaxFlow), which generates accurate alignments between proteins deep in the twilight zone of sequence similarity, below 20% sequence identity. In particular, MaxFlow reliably identifies conserved core motifs between proteins which are only indirect PSI-Blast neighbours. Based on MaxFlow alignments, useful 3D models can be generated for all members of a superfamily from as few as a single structural template – despite hundreds of representatives at 40% sequence identity level and patchy detection of homology by PSI-Blast. We propose novel strategies for target prioritization using MaxFlow scores to predict the optimal templates in a superfamily. Our results support an increase in the granularity of covering protein space that has potentially enormous economic implications for planning the transition to the full production phase of structural genomics.

...read moreread less

Journal Article•10.1142/S0219030303000284•

Inferring an Original Sequence from Erroneous Copies: Two Approaches

[...]

Jonathan M. Keith¹, Peter Adams¹, Darryn Bryant¹, Keith Mitchelson¹, Duncan A. E. Cochran¹, Gita H. Lala¹ - Show less +2 more•Institutions (1)

University of Queensland¹

03 Feb 2003-Asia-pacific Biotech News

TL;DR: This paper considers the problem of inferring an original sequence from a number of erroneous copies of DNA sequences, and describes and compares two approaches that have recently been developed by the authors, concluding that the Steiner approach is better for this purpose because it is faster.

...read moreread less

Abstract: This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages at the cost of an increased number of errors. We describe and compare two approaches that have recently been developed by the authors. The first approach searches for a sequence known as a Steiner string; the second searches for the most probable original sequence with respect to a simple Bayesian model of sequencing errors. We present the results of extensive tests in which erroneous copies of real DNA sequences were simulated and the algorithms were used to infer the original sequences. The results are used to compare the two approaches to each other and to a third, more conventional, approach based on multiple sequence alignment. We find that the Bayesian approach is superior to the Steiner approach, which in turn is superior to the alignment approach. The two new algorithms can also be used to construct multiple sequence alignments. We show that the two methods produce alignments of approximately equal quality, and conclude that the Steiner approach is better for this purpose because it is faster. Both methods produce better alignments than a well-known multiple sequence alignment package, for the cases tested.

...read moreread less

Book Chapter•10.1007/3-540-45110-2_124•

Evolving consensus sequence for multiple sequence alignment with a genetic algorithm

[...]

Conrad Shyu¹, James A. Foster¹•Institutions (1)

University of Idaho¹

12 Jul 2003

TL;DR: An encoding scheme that evolves the consensus sequence for multiple sequence alignment (MSA) with genetic algorithm (GA) such that the number of generations needed to find the optimal solution is approximately the same regardless of number of sequences.

...read moreread less

Abstract: In this paper we present an approach that evolves the consensus sequence [25] for multiple sequence alignment (MSA) with genetic algorithm (GA). We have developed an encoding scheme such that the number of generations needed to find the optimal solution is approximately the same regardless the number of sequences. Instead it only depends on the length of the template and similarity between sequences. The objective function gives a sum-of-pairs (SP) score as the fitness values. We conducted some preliminary studies and compared our approach with the commonly used heuristic alignment program Clustal W. Results have shown that the GA can indeed scale and perform well.

...read moreread less

Proceedings Article•10.1109/CEC.2003.1299894•

Improved GA-based method for multiple protein sequence alignment

[...]

Hung Dinh Nguyen, Kunihito Yamamori, Ikuo Yoshihara, Moritoshi Yasunaga

8 Dec 2003

TL;DR: Local alignment information is added to the weighted sum-of-pairs objective function to achieve better alignment from the biological viewpoint and the PHGA is extended to run in parallel on a cluster of machines instead of a multi-processor machine to speed it up.

...read moreread less

Abstract: In previous work, we have proposed a parallel hybrid genetic algorithm (PHGA) which can find high quality solution from the mathematical viewpoint for the multiple protein sequence alignment. We present new improvements to the PHGA. Local alignment information is added to the weighted sum-of-pairs objective function to achieve better alignment from the biological viewpoint. We also extend our method to run in parallel on a cluster of machines instead of a multi-processor machine to speed it up.

...read moreread less

Proceedings Article•10.1109/CSB.2003.1227381•

Automatic recognition of regions of intrinsically poor multiple alignment using machine learning

[...]

Y. Shan¹, E.E. Milios¹, A.J. Roger, C. Blouin, E. Susko - Show less +1 more•Institutions (1)

Dalhousie University¹

11 Aug 2003

TL;DR: The results of a machine learning approach to detect regions of poor alignment automatically are presented and compared with results obtained from Naive Bayes, C4.5 decision tree, SVM and support vector machine approaches.

...read moreread less

Abstract: Phylogenetic analysis requires alignment of gene or protein sequences. Some regions of genes evolve fast and suffer numerous insertion and deletion events and cannot be aligned reliably with automatic alignment algorithms. Such regions of intrinsically uncertain alignment are currently detected and deleted manually before performing phylogenetic analysis. We present the results of a machine learning approach to detect regions of poor alignment automatically. We compare the results obtained from Naive Bayes (NB), C4.5 decision tree (C4.5) and support vector machine (SVM) approaches.

...read moreread less