TL;DR: Shuffle-LAGAN is presented, a glocal alignment algorithm that is based on the CHAOS local alignment algorithm and the LAGAN global aligner, and is able to align long genomic sequences.
Abstract: Motivation: To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. The two main classes of pairwise alignments are global alignment, where one string is transformed into the other, and local alignment, where all locations of similarity between the two strings are returned. Global alignments are less prone to demonstrating false homology as each letter of one sequence is constrained to being aligned to only one letter of the other. Local alignments, on the other hand, can cope with rearrangements between non-syntenic, orthologous sequences by identifying similar regions in sequences; this, however, comes at the expense of a higher false positive rate due to the inability of local aligners to take into account overall conservation maps. Results: In this paper we introduce the notion of glocal alignment, a combination of global and local methods, where one creates a map that transforms one sequence into the other while allowing for rearrangement events. We present Shuffle-LAGAN, a glocal alignment algorithm that is based on the CHAOS local alignment algorithm and the LAGAN global aligner, and is able to align long genomic sequences. To test Shuffle-LAGAN we split the mouse genome into BAC-sized pieces, and aligned these pieces to the human genome. We demonstrate that ShuffleLAGAN compares favorably in terms of sensitivity and specificity with standard local and global aligners. From the alignments we conclude that about 9% of human/mouse homology may be attributed to small rearrangements, 63% of which are duplications. Availability: Our systems, supplemental information, and the alignment of the human and mouse genomes using ∗ To whom correspondence should be addressed.
TL;DR: A new sequence distance measure based on the relative information between the sequences using Lempel-Ziv complexity is proposed, which can be used to construct phylogenetic trees.
Abstract: Motivation Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, e.g. whole genome phylogeny, and the evolutionary models may not always be correct. We propose a new sequence distance measure based on the relative information between the sequences using Lempel-Ziv complexity. The distance matrix thus obtained can be used to construct phylogenetic trees. Results The proposed approach does not require sequence alignment and is totally automatic. The algorithm has successfully constructed consistent phylogenies for real and simulated data sets. Availability Available on request from the authors.
TL;DR: Tcoffee@igs, a new server provided to the community by Hewlet Packard computers and the Centre National de la Recherche Scientifique, is a web-based tool dedicated to the computation, the evaluation and the combination of multiple sequence alignments.
Abstract: This paper presents Tcoffee@igs, a new server provided to the community by Hewlet Packard computers and the Centre National de la Recherche Scientifique. This server is a web-based tool dedicated to the computation, the evaluation and the combination of multiple sequence alignments. It uses the latest version of the T-Coffee package. Given a set of unaligned sequences, the server returns an evaluated multiple sequence alignment and the associated phylogenetic tree. This server also makes it possible to evaluate the local reliability of an existing alignment and to combine several alternative multiple alignments into a single new one. Tcoffee@igs can be used for aligning protein, RNA or DNA sequences. Datasets of up to 100 sequences (2000 residues long) can be processed. The server and its documentation are available from: http://igs-server.cnrs-mrs.fr/Tcoffee/.
TL;DR: A new method for multiple sequence alignment is presented that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment.
Abstract: Motivation: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. Results: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment. Availability: The software is freely available at http://www. ulb.ac.be/sciences/ueg/
TL;DR: The main focus of creating the prokMSA was to provide a comprehensive, categorized, updateable 16S rDNA collection useful as a foundation for any probe selection algorithm.
Abstract: Motivation: Prokaryotic organisms have been identified utilizing the sequence variation of the 16S rRNA gene. Variations steer the design of DNA probes for the detection of taxonomic groups or specific organisms. The longterm goal of our project is to create probe arrays capabl eo fidentifying 16S rDNA sequences in unknown samples. This necessitated the authentication, categorization and alignment of the >75 000 publicly available ‘16S’ sequences. Preferably, the entire process should be computationally administrated so the aligned collection could periodically absorb 16S rDNA sequences from the public records. A complete multiple sequence alignment would provide a foundation for computational probe selection and facilitates microbial taxonomy and phylogeny. Results: Here we report the alignment and similarity clustering of 62 662 16S rDNA sequences and an approach for designing effective probes for each cluster. A novel alignment compression algorithm, NAST (Nearest Alignment Space Termination), was designed to produce the uniform multiple sequence alignment referred to as the prokMSA. From the prokMSA, 9020 Operational Taxonomic Units (OTUs) were found based on transitive sequence similarities. An automated approach to probe design was straightforward using the prokMSA clustered into OTUs. As a test case, multiple probes were computationally picked for each of the 27 OTUs that were identified within the Staphylococcus Group. The probes were incorporated into a customized microarray and were able to correctly categorize Staphylococcus aureus and Bacillus anthracis into their correct OTUs. Although a successful probe picking strategy is outlined, the main focus of creating the prokMSA was to provide a comprehensive, categorized, updateable 16S rDNA collection useful as a foundation for any probe selection algorithm. Availability: http://greengenes.llnl.gov/16S/
TL;DR: The complete set of 176,915 publicly available Arabidopsis EST sequences is mapped using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring that provides verified sets of EST clusters for evaluation of EST clustering programs.
Abstract: Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data Thus, EST data sets provide a vast resource for gene identification and expression profiling We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment The mapping provides verified sets of EST clusters for evaluation of EST clustering programs Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing All results of this study were parsed into a database and are accessible via a flexible Web interface at http://wwwplantgdborg/AtGDB/
TL;DR: A generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions, which is implemented as a computer program named GAP3 (Global Alignment Program Version 3).
Abstract: MOTIVATION Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. RESULTS We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. AVAILABILITY The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.
TL;DR: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics and may provide inspiration for new models and methodology to study sequence evolution.
Abstract: Motivation: Al arge, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. Results: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach. Availability: The Pandit database is available for browsing and download via its home page at http://www.ebi.ac.uk/
TL;DR: A heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together.
Abstract: In this paper, we design a heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together. If the number of residues needed to be aligned together is a constant alpha, then the time-complexity of our CMSA algorithm for aligning K sequences is O(alphaKn(4)), where n is the maximum of the lengths of sequences. In addition, we have built up such a CMSA software system and made several experiments on the RNase sequences, which mainly function in catalyzing the degradation of RNA molecules. The resulting alignments illustrate the practicability of our method.
TL;DR: A new method for multiple parsimony alignment over a tree using an affine gap cost rather than a simple linear gap cost is presented, which should prove useful in the multiple alignment scenario.
Abstract: Many methods in bioinformatics rely on evolutionary relationships between protein, DNA, or RNA sequences. Alignment is a crucial first step in most analyses, since it yields information about which regions of the sequences are related to each other. Here, a new method for multiple parsimony alignment over a tree is presented. The novelty is that an affine gap cost is used rather than a simple linear gap cost. Affine gap costs have been used with great success for pairwise alignments and should prove useful in the multiple alignment scenario. The algorithmic challenge of using an affine gap cost in multiple alignment is the introduction of dependence between different columns in the alignment. The utility of the new method is illustrated by a number of protein sequences where increased alignment accuracy is obtained by using multiple sequences.
TL;DR: Zea mays DataBase (ZmDB) seeks to provide a comprehensive view of maize (corn) genetics by linking genomic sequence data with gene expression analysis and phenotypes of mutant plants by linking ESTs, genome survey sequences, and protein sequences.
Abstract: Zea mays DataBase (ZmDB) seeks to provide a comprehensive view of maize (corn) genetics by linking genomic sequence data with gene expression analysis and phenotypes of mutant plants. ZmDB originated in 1999 as the Web portal for a large project of maize gene discovery, sequencing and phenotypic analysis using a transposon tagging strategy and expressed sequence tag (EST) sequencing. Recently, ZmDB has broadened its scope to include all public maize ESTs, genome survey sequences (GSSs), and protein sequences. More than 170 000 ESTs are currently clustered into approximately 20 000 contigs and about an equal number of apparent singlets. These clusters are continuously updated and annotated with respect to potential encoded protein products. More than 100 000 GSSs are similarly assembled and annotated by spliced alignment with EST and protein sequences. The ZmDB interface provides quick access to analytical tools for further sequence analysis. Every sequence record is linked to several display options and similarity search tools, including services for multiple sequence alignment, protein domain determination and spliced alignment. Furthermore, ZmDB provides web-based ordering of materials generated in the project, including ESTs, ordered collections of genomic sequences tagged with the RescueMu transposon and microarrays of amplified ESTs. ZmDB can be accessed at http://zmdb.iastate.edu/.
TL;DR: It is shown that application of the SEA algorithm improves alignment quality as compared to FFAS profile-profile alignment, and in some cases SEA alignments can match the structural alignments, a feat previously impossible for any sequence based alignment methods.
Abstract: Motivation: Local structure segments (LSSs) are small structural units shared by unrelated proteins. They are extensively used in protein structure comparison, and predicted LSSs (PLSSs) are used very successfully in ab initio folding simulations. However, predicted or real LSSs are rarely exploited by protein sequence comparison programs that are based on position-by-position alignments. Results: We developed a SEgment Alignment algorithm (SEA) to compare proteins described as a collection of predicted local structure segments (PLSSs), which is equivalent to an unweighted graph (network). Any specific structure, real or predicted corresponds to a specific path in this network. SEA then uses a network matching approach to find two most similar paths in networks representing two proteins. SEA explores the uncertainty and diversity of predicted local structure information to search for a globally optimal solution. It simultaneously solves two related problems: the alignment of two proteins and the local structure prediction for each of them. On a benchmark of protein pairs with low sequence similarity, we show that application of the SEA algorithm improves alignment quality as compared to FFAS profile-profile alignment, and in some cases SEA alignments can match the structural alignments, a feat previously impossible for any sequence based alignment methods. Availability: SEA is freely available for academic users on aw eb server http://ffas.ljcrf.edu/sea.
TL;DR: This work introduces a novel approach that is fundamentally different from all currently available methods for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem.
Abstract: With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.
TL;DR: This work presents here a freely available web service, MGAlignIt, which allows users to effectively visualize the alignment in a graphical manner and to perform limited analysis on the alignment output.
Abstract: Splicing is a biological phenomenon that removes the non-coding sequence from the transcripts to produce a mature transcript suitable for translation. To study this phenomenon, information on the intron–exon arrangement of a gene is essential, usually obtained by aligning mRNA/EST sequences to their cognate genomic sequences. MGAlign is a novel, rapid, memory efficient and practical method for aligning mRNA/EST and genome sequences. We present here a freely available web service, MGAlignIt (http://origin.bic.nus.edu.sg/mgalign/mgalignit), based on MGAlign. Besides the alignment itself, this web service allows users to effectively visualize the alignment in a graphical manner and to perform limited analysis on the alignment output. The server also permits the alignment to be saved in several forms, both graphical and text, suitable for further processing and analysis by other programs.
TL;DR: SEGID is a tool for finding conserved regions (regions of high scores) for a given (multiple) sequence alignment that converts the alignment into a sequence of numbers, where each number is the alignment score of a column.
Abstract: Summary: SEGID is a tool for finding conserved regions (regions of high scores) for a given (multiple) sequence alignment. It takes a (multiple) sequence alignment as its input and converts the alignment into a sequence of numbers, where each number is the alignment score of a column. Three algorithms are used to identify regions of high scores. A graphical interface is provided to present those identified regions. Availability: Free from http://www.cs.cityu.edu.hk/ ∼lwang/segid/ subject to copyright restrictions.
TL;DR: This method effectively copes with many of the problems inherent in making inferences about splicing and alternative splicing on the basis of EST sequences, which in addition to being fragmentary and full of sequencing errors, may also be chimeric, misoriented, or contaminated with genomic sequence.
Abstract: We present a method for high-throughput alternative splicing detection in expressed sequence data. This method effectively copes with many of the problems inherent in making inferences about splicing and alternative splicing on the basis of EST sequences, which in addition to being fragmentary and full of sequencing errors, may also be chimeric, misoriented, or contaminated with genomic sequence. Our method, which relies both on the Partial Order Alignment (POA) program for constructing multiple sequence alignments, and its Heaviest Bundling function for generating consensus sequences, accounts for the real complexity of expressed sequence data by building and analyzing a single multiple sequence alignment containing all of the expressed sequences in a particular cluster aligned to genomic sequence. We illustrate application of this method to human UniGene Cluster Hs.1162, which contains expressed sequences from the human HLA-DMB gene. We have used this method to generate databases, published elsewhere, of splices and alternative splicing relationships for the human, mouse and rat genomes. We present statistics from these calculations, as well as the CPU time for running our method on expressed sequence clusters of varying size, to verify that it truly scales to complete genomes.
TL;DR: A new and essentially simple method to reconstruct prokaryotic phylogenetic trees from their complete genome data without using sequence alignment is proposed, based on the appearance frequency of oligopeptides of a fixed length in their proteomes.
Abstract: A new and essentially simple method to reconstruct prokaryotic phylogenetic trees from their complete genome data without using sequence alignment is proposed. It is based on the appearance frequency of oligopeptides of a fixed length (up to K=6) in their proteomes. This is a method without fine adjustment and choice of genes. It can incorporate the effect of lateral gene transfer to some extent and leads to results comparable with the bacteriologists' systematics as reflected in the latest 2001 edition of the Sergey's manual of systematic bacteriology. A key point in our approach is subtraction of a random back-groundby using a Markovian model of order K-1 from the composition vectors to highlight the shaping role of natural selection.
TL;DR: Efficient extension of computational tools to multiple species will require knowledge of the ideal evolutionary distance to choose and the development of new algorithms for alignment, analysis of conservation, and visualization of results.
Abstract: Multi-species comparisons of DNA sequences are more powerful for discovering functional sequences than pairwise DNA sequence comparisons. Most current computational tools have been designed for pairwise comparisons, and efficient extension of these tools to multiple species will require knowledge of the ideal evolutionary distance to choose and the development of new algorithms for alignment, analysis of conservation, and visualization of results.
TL;DR: The proposed visualization tool for dexoyribonucleic acid (DNA) sequences by the use of three-dimensional (3-D) trajectories (TDT) can easily discriminate the differences and similarities among various DNA sequences.
Abstract: In this paper, we propose a visualization tool for dexoyribonucleic acid (DNA) sequences by the use of three-dimensional (3-D) trajectories (TDT). In the proposed method, four different nucleotides are assigned by four corresponding positions in the 3-D space, which are equally spaced and the origin is the centroid of four positions. With the distance accumulated from the distances between multiple consecutive two positions, a DNA sequence can be represented by a trajectory in the 3-D space. A global view of the DNA sequence can thus be obtained no matter how large the sequence is. The alignment of two DNA sequences can be determined by the use of correlation operation on the trajectories. From our simulation results, the TDTs for different functions of DNA sequences vary a lot and thus are easy to be distinguished. On the other hand, there exist some similarities between the trajectories for the same type of DNA sequences obtained from different kinds of creatures. Therefore, in addition to the low computation complexity, the proposed visualization tool can easily discriminate the differences and similarities among various DNA sequences.
TL;DR: A new algorithm for multiple sequence alignment is introduced that integrates the global divide-and-conquer approach with the local segment-based approach, thereby combining the strengths of those two strategies.
Abstract: A large number of methods for multiple sequence alignment are currenty available. Recent benchmarking tests demonstrated that strengths and drawbacks of these methods differ substantially. Global strategies can be outperformed by approaches based on local similarities and vice versa, depending on the characteristics of the input sequences. In recent years, mixed approaches that include both global and local features have shown promising results. Herein, we introduce a new algorithm fo rm ultiple sequence alignment that integrates the global divide-and-conquer approach with the local segmentbased approach, thereby combining the strengths of those two strategies.
TL;DR: Previous efforts using evolutionary algorithms (EAs) for MSA were extended and three new alignment operators were introduced and tested within the framework of protein sequence alignment, showing the degree to which EAs can enhance the results of Clustal X.
Abstract: Multiple sequence alignment (MSA) is a central problem in bioinformatics. In this study, we extended previous efforts using evolutionary algorithms (EAs) for MSA. Candidate solutions in the initial population were derived from the well-known alignment program Clustal X. Evolutionary computation was then used to evolve increasingly appropriate solutions. Three new alignment operators were introduced and tested within the framework of protein sequence alignment. Statistics on alignment quality were generated with respect to selected alignment benchmarks from the BAliBASE database using the BLOSUM 62 substitution matrix. Our results indicate the degree to which EAs can enhance the results of Clustal X. Moreover, the experimental results show that the commonly used sum-of-pairs scoring scheme sometimes fails to correlate higher scoring alignments with increase in alignment quality in terms of the BAliBASE sum-of-pairs score.
TL;DR: The results indicate that the novel alignment strategy could be helpful for extending the application of highly reliable methods for fold identification and homology modeling to a huge number of homologous proteins of low sequence similarity.
TL;DR: This work is concerned with efficient methods for practical biomolecular sequence comparison, focusing on global and local alignment algorithms and analyses the classical approaches of Needleman & Wunsch and Smith & Waterman as well as efficient alternatives; in particular, the algorithms recently designed by Crochemore, Landau and Ziv-Ukelson that use compression techniques to achieve sub-quadratic time complexity.
Abstract: iii Ninguém mais do que meus pais merecem a minha gratidão por todo o amor, apoio, força e exemplo de vida que sempre me ofereceram. iv Preface The discovery of the DNA structure in 1953 has dramatically changed how biology is studied. It has opened a new frontier in the development of this exciting science. Biologists are working today to " decipher " the DNA of every form of life on earth, producing an extraordinary amount of data that needs to be analysed. No doubt this is why they are appealing to computer scientists and the expertise developed in the last decades on information storage, retrieval and analysis. This merging of biology and computer science has created a new interdisciplinary field know as computational biology that explores the capacities of computers to gain knowledge from biological data. In fact, researches can learn a great deal about a biomolecular sequence by comparing it to already well-studied sequences. For this reason, sequence comparison is regarded as one of the most fundamental problems of computational biology, which is usually solved with a technique known as sequence alignment. This work is concerned with efficient methods for practical biomolecular sequence comparison, focusing on global and local alignment algorithms. It analyses the classical approaches of Needleman & Wunsch and Smith & Waterman as well as efficient alternatives; in particular, the algorithms recently designed by Crochemore, Landau and Ziv-Ukelson that use compression techniques to achieve sub-quadratic time complexity. Chapter 1 presents a brief introduction to the field of computational biology and the sequence comparison problem. Chapter 2 discusses how two sequences can be compared by finding the best alignment between them, and describes standard and alternative algorithms to compute an optimal alignment. Chapters 3, 4 and 5 are devoted to the design, implementation and evaluation of a library of computational biology algorithms developed as part of this work 1 with the aim of studying the alignment algorithms described in Chapter 2. Acknowledgments I would like to thank my supervisor, Professor Maxime Crochemore, for his guidance throughout the development of this project.
TL;DR: This chapter focuses on the use of existing bioinformatics approaches in the analysis of global-protein expression using two-dimensional electrophoresis and mass spectrometry and on the extension of bio informatics to include the analysis and management of the proteome data being produced.
Abstract: Publisher Summary The chapter focuses on the use of existing bioinformatics approaches in the analysis of global-protein expression using two-dimensional electrophoresis and mass spectrometry and on the extension of bioinformatics to include the analysis and management of the proteome data being produced. Bioinformatics will, of necessity, come to include the acquisition, analysis, and management of such protein-expression data, as well as the integration of those data with genome and protein sequence databases. The chapter discusses several bioinformatics tools, such as (1) sequence databases, (2) sequence analysis, and (3) annotation and proteomics tools—such as open reading frame (ORF) databases and proteomics, proteome (2DE) databases, and database integration. In the context of protein expression, the ORF sequences are used to select specific nucleotide sequences to include on the complementary DNA (cDNA) microarray chips used to quantify changes in specific messenger RNA (mRNA) abundance. The ORF sequences are also used as the starting material for protein arrays, providing the nucleic acid sequence that is incorporated into host cells to produce overexpression of the target proteins for the arrays. A comparative analysis of the metabolic processes of whole cells will be facilitated through the comparison of protein expression and genome sequence data from diverse cell types and the dream of computational cell modeling. Fully, and even partially, annotated genome databases have multiple uses in the broad context of proteomics, including protein structure and function prediction, as well as protein-expression analysis.
TL;DR: The implication is that high error levels need not be a barrier to the adoption of sequencing technologies that are in other respects promising, because most errors can be detected and corrected using a small number of reads.
Abstract: This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages, but at the cost of introducing many errors. We develop a Bayesian probabilistic model of the introduction of errors, and search for a sequence that has maximum posterior probability with respect to the model. We present results of extensive tests in which error-prone sequencing of real DNA was simulated. The results obtained using the new approach are compared to results obtained by deriving a consensus sequence from a multiple sequence alignment. We find that a significant improvement in accuracy is obtained using the new approach. The implication is that high error levels need not be a barrier to the adoption of sequencing technologies that are in other respects promising, because most errors can be detected and corrected using a small number of reads.
TL;DR: A new transitive alignment algorithm (MaxFlow) is developed, which generates accurate alignments between proteins deep in the twilight zone of sequence similarity, below 20% sequence identity, and proposes novel strategies for target prioritization using MaxFlow scores to predict the optimal templates in a superfamily.
Abstract: Structural genomics is the idea of covering protein space so that every protein sequence comes within model building distance of a protein of known structure. Unfortunately, reproducing the structural alignment of distantly related proteins is a difficult challenge to existing sequence alignment and motif search software. We have developed a new transitive alignment algorithm (MaxFlow), which generates accurate alignments between proteins deep in the twilight zone of sequence similarity, below 20% sequence identity. In particular, MaxFlow reliably identifies conserved core motifs between proteins which are only indirect PSI-Blast neighbours. Based on MaxFlow alignments, useful 3D models can be generated for all members of a superfamily from as few as a single structural template – despite hundreds of representatives at 40% sequence identity level and patchy detection of homology by PSI-Blast. We propose novel strategies for target prioritization using MaxFlow scores to predict the optimal templates in a superfamily. Our results support an increase in the granularity of covering protein space that has potentially enormous economic implications for planning the transition to the full production phase of structural genomics.
TL;DR: This paper considers the problem of inferring an original sequence from a number of erroneous copies of DNA sequences, and describes and compares two approaches that have recently been developed by the authors, concluding that the Steiner approach is better for this purpose because it is faster.
Abstract: This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages at the cost of an increased number of errors. We describe and compare two approaches that have recently been developed by the authors. The first approach searches for a sequence known as a Steiner string; the second searches for the most probable original sequence with respect to a simple Bayesian model of sequencing errors. We present the results of extensive tests in which erroneous copies of real DNA sequences were simulated and the algorithms were used to infer the original sequences. The results are used to compare the two approaches to each other and to a third, more conventional, approach based on multiple sequence alignment. We find that the Bayesian approach is superior to the Steiner approach, which in turn is superior to the alignment approach. The two new algorithms can also be used to construct multiple sequence alignments. We show that the two methods produce alignments of approximately equal quality, and conclude that the Steiner approach is better for this purpose because it is faster. Both methods produce better alignments than a well-known multiple sequence alignment package, for the cases tested.
TL;DR: An encoding scheme that evolves the consensus sequence for multiple sequence alignment (MSA) with genetic algorithm (GA) such that the number of generations needed to find the optimal solution is approximately the same regardless of number of sequences.
Abstract: In this paper we present an approach that evolves the consensus sequence [25] for multiple sequence alignment (MSA) with genetic algorithm (GA). We have developed an encoding scheme such that the number of generations needed to find the optimal solution is approximately the same regardless the number of sequences. Instead it only depends on the length of the template and similarity between sequences. The objective function gives a sum-of-pairs (SP) score as the fitness values. We conducted some preliminary studies and compared our approach with the commonly used heuristic alignment program Clustal W. Results have shown that the GA can indeed scale and perform well.
TL;DR: Local alignment information is added to the weighted sum-of-pairs objective function to achieve better alignment from the biological viewpoint and the PHGA is extended to run in parallel on a cluster of machines instead of a multi-processor machine to speed it up.
Abstract: In previous work, we have proposed a parallel hybrid genetic algorithm (PHGA) which can find high quality solution from the mathematical viewpoint for the multiple protein sequence alignment. We present new improvements to the PHGA. Local alignment information is added to the weighted sum-of-pairs objective function to achieve better alignment from the biological viewpoint. We also extend our method to run in parallel on a cluster of machines instead of a multi-processor machine to speed it up.
TL;DR: The results of a machine learning approach to detect regions of poor alignment automatically are presented and compared with results obtained from Naive Bayes, C4.5 decision tree, SVM and support vector machine approaches.
Abstract: Phylogenetic analysis requires alignment of gene or protein sequences. Some regions of genes evolve fast and suffer numerous insertion and deletion events and cannot be aligned reliably with automatic alignment algorithms. Such regions of intrinsically uncertain alignment are currently detected and deleted manually before performing phylogenetic analysis. We present the results of a machine learning approach to detect regions of poor alignment automatically. We compare the results obtained from Naive Bayes (NB), C4.5 decision tree (C4.5) and support vector machine (SVM) approaches.