TL;DR: The automated use of sequence information in both single-stranded and helical regions yields better sensitivity/specificity ratios than descriptor-based programs and iterative searches can be conducted to enrich collections of homologous RNAs.
TL;DR: Normalized Local Alignment (NLA) as mentioned in this paper is based on fractional programming and its running time is O(n2log n) compared to the standard Smith-Waterman algorithm.
Abstract: The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.
TL;DR: This review intends to provide a guide to choosing the most efficient way to analyze a new sequence or to collect information on a gene or protein of interest by applying current publicly available databases and Web services.
Abstract: The development of efficient DNA sequencing methods has led to the achievement of the DNA sequence of entire genomes from (to date) 55 prokaryotes, 5 eukaryotic organisms and 10 eukaryotic chromosomes. Thus, an enormous amount of DNA sequence data is available and even more will be forthcoming in the near future. Analysis of this overwhelming amount of data requires bioinformatic tools in order to identify genes that encode functional proteins or RNA. This is an important task, considering that even in the well-studied Escherichia coli more than 30% of the identified open reading frames are hypothetical genes. Future challenges of genome sequence analysis will include the understanding of gene regulation and metabolic pathway reconstruction including DNA chip technology, which holds tremendous potential for biomedicine and the biotechnological production of valuable compounds. The overwhelming volume of information often confuses scientists.This review intends to provide a guide to choosing the most efficient way to analyze a new sequence or to collect information on a gene or protein of interest by applying current publicly available databases and Web services. Recently developed tools that allow functional assignment of genes, mainly based on sequence similarity of the deduced amino acid sequence, using the currently available and increasing biological databases will be discussed.
TL;DR: An algorithm designed to carry out multiple structure alignment and to detect recurring substructural motifs that is applicable to comparisons of RNA structures and to detection of a pharmacophore in a series of drug molecules is presented.
Abstract: Here we present an algorithm designed to carry out multiple structure alignment and to detect recurring substructural motifs. So far we have implemented it for comparison of protein structures. How...
TL;DR: Results with CASP4 targets show that, along with the correctness of sequence‐structure alignments, effective use of multiple template structures may significantly increase accuracy of the model structure.
TL;DR: A new sequence comparison algorithm (normalized local alignment) that reports the regions with maximum degree of similarity is proposed that is based on fractional programming and its running time is only 3-5 times slower than the standard Smith-Waterman algorithm.
Abstract: The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g., maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al., 1999 [33]. In this paper we propose a new sequence comparison algorithm (normalized local alignment) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2 log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.
TL;DR: The SFINX package is described, which allows many different sets of segmental or continuous‐curve sequence feature data, generated by individual external programs, to be viewed in combination alongside a sequence dot‐plot or a multiple alignment of database matches.
TL;DR: This newest version of Omiga™ allows for sequencing and polymerase chain reaction (PCR) primer prediction, a a functionality missing in earlier versions, and allows rapid searches for putative coding regions, and Basic Local Alignment Search Tool (BLAST) queries against public databases at the National Center for Biotechnology Information (NCBI).
Abstract: Computer-based sequence analysis, notation, and manipulation are a necessity for all molecular biologists working with any but the most simple DNA sequences. As sequence data become increasingly available, tools that can be used to manipulate and annotate individual sequences and sequence elements will become an even more vital implement in the molecular biologist's arsenal. The Omiga DNA and Protein Sequence Analysis Software tool, version 2.0 provides an effective and comprehensive tool for the analysis of both nucleic acid and protein sequences that runs on a standard PC available in every molecular biology laboratory. Omiga allows the import of sequences in several common formats. Upon importing sequences and assigning them to various projects, Omiga allows the user to produce, analyze, and edit sequence alignments. Sequences may also be queried for the presence of restriction sites, sequence motifs, and other sequence features, all of which can be added into the notations accompanying each sequence. This newest version of Omiga also allows for sequencing and polymerase chain reaction (PCR) primer prediction, a functionality missing in earlier versions. Finally, Omiga allows rapid searches for putative coding regions, and Basic Local Alignment Search Tool (BLAST) queries against public databases at the National Center for Biotechnology Information (NCBI).
TL;DR: An alternative approach to cluster related proteins without the need for an a priori threshold is described, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity.
Abstract: Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.
TL;DR: The unerlying mathematical model and the dynamic programming algorithm technique for the valicdation of a DNA sequence against a (DNA) map, an ordered restriction map obtained through an optical mapping process and is augmented with statistical information which will ne used to place (or not) the sequence in the genome.
Abstract: This paper describes the unerlying mathematical model and the dynamic programming algorithm technique for the valicdation of a (DNA) sequence against a (DNA) map. The sequence can be obtained from a variety of sources (r,g, GenBAnk, Sanger''s Lab, or Celera P.E.) and it is assumed to be written out as a string of nucleotides. The map is an ordered restriction map obtained through an optical mapping process and is augmented with statistical information which will ne used to place (or not) the sequence in the genome. Our approach has many other applications beyond validation: e.g. map-based sequence assembly, phasing sequence contigs, detecting and closing gaps and annotation of partially sequenced genomes to find open reading frames, genes and synteny groups. We tested our system by checking various maps against publicly available sequence data for Plasmodium falciparum.
TL;DR: By exploiting the rapid increase in available sequence data, the definition of medically relevant protein targets has been improved by a combination of differential genome analysis and analysis of individual proteins.
TL;DR: In this paper, a genetic algorithm to solve multiple sequence alignments is presented and several data sets are tested and the experimental results are compared with other methods, they find their approach could obtain good performance in the data sets with high similarity and long sequences.
Abstract: Abstract.Multiple sequence alignment is an important tool in molecular sequence analysis. This paper presents genetic algorithms to solve multiple sequence alignments. Several data sets are tested and the experimental results are compared with other methods. We find our approach could obtain good performance in the data sets with high similarity and long sequences.The software can be found in http://rsdb.csie.ncu.edu.tw/tools/msa.htm.
TL;DR: This work outlines the approach to identifying the protein kinases of C. elegans from the genomic sequence, and describes new tools it has developed for analysis, management and visualization of genomic data.
Abstract: With the availability of the nearly complete genomic sequence of C. elegans, the first multicellular organism to be sequenced, molecular biology has definitely entered the postgenomic era. Annotation of the genomic sequence, which refers to identifying the genes and other biologically relevant sections of the genome, is an important and nontrivial next step. A first-pass annotation will be necessarily incomplete but will drive further biological experiments, which in turn will help to annotate the genome better. Given the scale of the genome sequence analysis, it is clear that the annotation should be automated as much as possible without sacrificing the quality of analysis. In this work, we outline our approach to identifying the protein kinases of C. elegans from the genomic sequence. We describe new tools we have developed for analysis, management and visualization of genomic data. By developing modular and scalable solutions, this study has provided a framework for future analysis of the Drosophila and human genomes.
TL;DR: The focus of this thesis is on algorithms for the optimal alignment of two or three sequences of biological data, particularly DNA sequences, with particular emphasis on space and time complexity.
Abstract: Sequence alignment is an important tool for describing relationships between sequences. Many sequence alignment algorithms exist, differing in efficiency, and in their models of the sequences and of the relationship between sequences. The focus of this thesis is on algorithms for the optimal alignment of two or three sequences of biological data, particularly DNA sequences. The algorithms are discussed with particular emphasis on space and time complexity. A divide-and-conquer method is presented for use with a number of different alignment algorithms. This method may be used to reduce the space complexity of an alignment algorithm with little or no effect to the time complexity. The advantages of this divide-and-conquer method include its simplicity and the ease with which it can be applied to many different alignment algorithms. These advantages are demonstrated by using the divide-and-conquer method in conjunction with several known alignment algorithms. An efficient alignment algorithm is presented for the important problem of optimally aligning three sequences using a linear function for costing gaps in the alignment. For sequences of length n, and a minimum edit cost of d, this new algorithm has a time complexity of O(d + n). The algorithm is further developed by using the aforementioned divide-andconquer method to improve its space complexity. This combination results in a time and space efficient algorithm, while also illustrating the usefulness of the divide-and-conquer method. It is important when aligning sequences to correctly account for any non-randomness that is significant in the sequences. For example, if certain statistical patterns appear throughout sequences from a certain family, it is important to make use of this information when aligning sequences from this family. Common, unsurprising, patterns provide less evidence for the relatedness of sequences than more surprising regions provide. A new algorithm is presented to align optimally two non-random sequences. For a particular sequence model, this new algorithm apportions weight to every part of the alignment dependent on the importance of that part as determined by the sequence model. This algorithm is then developed further so that it can be used to infer whether two non-random sequences are related.
TL;DR: The simultaneous alignment of three or more nucleotide or amino acid is among the most important tools for analyzing biological sequences and an essential pre-requisite to phylogenetic reconstruction.
Abstract: The simultaneous alignment of three or more nucleotide or amino acid is among the most important tools for analyzing biological sequences Multiple alignments are used to find characteristic motifs and conserved regions in protein families; to help demonstrarte homology between new sequences and existing families; to improve the prediction of secondary and tertiary structure of new sequences; and an essential pre-requisite to phylogenetic reconstruction The fact that the multiple sequence alignment problem is of high complexity has led to the development of different algorithms These algorithms fall into two categories namely the greedy ones that rely on pairwise alignment and those that attempt to align all the sequences simultaneously
TL;DR: In this article, DNA databases, homology search tools and sequence alignment methods are surveyed and the concept of distance between genes and how to calculate this measure using DNA or amino acid sequences and several commonly used techniques for phylogenetic analysis and tree evaluation are described.
Abstract: Recent advances in deoxyribonucleic acid (DNA) sequencing technology have produced a massive amount of nucleotide sequences, which are stored in DNA databanks and genomic data repositories. Furthermore, comprehensive analyses of transcriptional and genomic elements have uncovered an elaborate system of gene expression that broadens our understanding of fundamental biological phenomena. The analysis of DNA data has therefore become essential to predict gene function or detect regulatory motifs through comparative studies. In this article, DNA databases, homology search tools and sequence alignment methods are surveyed. The concept of distance between genes and how to calculate this measure using DNA or amino acid sequences and introducing several commonly used techniques for phylogenetic analysis and tree evaluation are also described.
Key concepts
Advances in DNA sequencing technology have produced an unprecedented amount of sequence data.
The DNA Data Bank of Japan (DDBJ), the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) are the three major sequence data repositories. They exchange data periodically, and maintain various services for data search and retrieval.
Similarity searching, alignment of sequences, prediction of function and reconstruction of the evolutionary history (phylogenetic tree) of a group of species are among the most commonly used techniques for sequence analysis.
BLAST (similarity searching), ClustalW (sequence alignment), Pfam (protein domains) and TRANSFAC (transcription factors) are popular tools and resources.
The genetic distance, a measure of evolutionary similarity, is usually calculated as the number of nucleotide or amino acid differences (substitutions) among sequences. Nucleotide substitutions are synonymous (not affecting the codified amino acid) or nonsynonymous (triggering an amino acid change).
Distance- and character-based methods can be used to reconstruct phylogenetic trees. Distance-based methods reconstruct the tree from an estimation of the evolutionary distance among taxa. Character-based methods derive the phylogeny directly from the observable state of characters in the taxa.
The bootstrap method is commonly used to determine the quality of an inferred phylogeny.
Keywords:
DNA databank;
genome projects;
similarity search;
evolutionary distance;
molecular phylogeny
TL;DR: A database search method that is based on phylogenetic trees - treesearch is introduced, which results in a generalization of established probabilistic methods such as pairwise sequence alignment, multiple sequence alignments, and profile searches.
Abstract: Database searching and phylogenetic tree reconstruction are two major fields of computational sequence analysis. This thesis introduces a combination of both: a database search method that is based on phylogenetic trees - treesearch. A given protein family is described by its multiple alignment and its phylogenetic tree. A database sequence that is tested for membership in the family is tentatively inserted into that tree. The result of this operation determines how well the sequence fits into the family. The idea is realized in the distance based context of phylogeny. To assess the performance of the method in terms of sensitivity and selectivity, it is compared to profiles (ISREC pfsearch), two implementations of hidden Markov models (HMMER hmmsearch and SAM hmmscore), and to the family pairwise search (FPS) method. The comparison is based on a novel evaluation tool, which was also developed during this work. All methods are presented in a new unified functional framework of database searching. The analysis is complemented by extensive simulations. The treesearch idea is also transferred to the probabilistic context of phylogeny, which results in a generalization of established probabilistic methods such as pairwise sequence alignment, multiple sequence alignment, and profile searches.
TL;DR: A comparison of sequence and 3D structure-based phylogenies for all the families suggests that only a few families have a radical difference in the two kinds of dendrograms.
Abstract: The database PALI (Phylogeny and ALIgnment of homologous protein structures) consists of families of protein domains of known three-dimensional (3D) structure. In a PALI family, every member has been structurally aligned with every other member (pairwise) and also simultaneous superposition (multiple) of all the members has been performed. The database also contains 3D structure-based and structure-dependent sequence similarity-based phylogenetic dendrograms for all the families. The PALI release used in the present analysis comprises 225 families derived largely from the HOMSTRAD and SCOP databases. The quality of the multiple rigid-body structural alignments in PALI was compared with that obtained from COMPARER, which encodes a procedure based on properties and relationships. The alignments from the two procedures agreed very well and variations are seen only in the low sequence similarity cases often in the loop regions. A validation of Direct Pairwise Alignment (DPA) between two proteins is provided by comparing it with Pairwise alignment extracted from Multiple Alignment of all the members in the family (PMA). In general, DPA and PMA are found to vary rarely. The ready availability of pairwise alignments allows the analysis of variations in structural distances as a function of sequence similarities and number of topologically equivalent $C\alpha$ atoms. The structural distance metric used in the analysis combines root mean square deviation (r.m.s.d.) and number of equivalences, and is shown to vary similarly to r.m.s.d. The correlation between sequence similarity and structural similarity is poor in pairs with low sequence similarities. A comparison of sequence and 3D structure-based phylogenies for all the families suggests that only a few families have a radical difference in the two kinds of dendrograms. The difference could occur when the sequence similarity among the homologues is low or when the structures are subjected to evolutionary pressure for the retention of function. The PALI database is expected to be useful in furthering our understanding of the relationship between sequences and structures of homologous proteins and their evolution.
TL;DR: The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.
Abstract: MOTIVATION The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment. RESULTS The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation. AVAILABILITY Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk. CONTACT geoff@ebi.ac.uk
TL;DR: An integrated multiple alignment system bringing together sequence data, knowledge-based systems and prediction methods with their inherent unreliability will provide an ideal workbench for the validation, propagation and presentation of this information in a format that is concise, clear and intuitive.
TL;DR: PALI (release 1.2) contains three-dimensional structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity.
Abstract: PALI (release 1.2) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families. The data set of homologous protein structures has been derived by consulting the SCOP database (release 1.50) and the data set comprises 604 families of homologous proteins involving 2739 protein domain structures with each family made up of at least two members. Each member in a family has been structurally aligned with every other member in the same family (pairwise alignment) and all the members in the family are also aligned using simultaneous superposition (multiple alignment). The structural alignments are performed largely automatically, with manual interventions especially in the cases of distantly related proteins, using the program STAMP (version 4.2). Every family is also associated with two dendrograms, calculated using PHYLIP (version 3.5), one based on a structural dissimilarity metric defined for every pairwise alignment and the other based on similarity of topologically equivalent residues. These dendrograms enable easy comparison of sequence and structure-based relationships among the members in a family. Structure-based alignments with the details of structural and sequence similarities, superposed coordinate sets and dendrograms can be accessed conveniently using a web interface. The database can be queried for protein pairs with sequence or structural similarities falling within a specified range. Thus PALI forms a useful resource to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity. PALI also contains over 653 ‘orphans’ (single member families). Using the web interface involving PSI_BLAST and PHYLIP it is possible to associate the sequence of a new protein with one of the families in PALI and generate a phylogenetic tree combining the query sequence and proteins of known 3-D structure. The database with the web interfaced search and dendrogram generation tools can be accessed at http://pauling.mbu.iisc.ernet.in/~pali.
TL;DR: The MPAlign (Multiple Pairwise Alignment) web interface is a collection of Perl scripts that retrieves sequences from the Los Alamos HIV sequence database based on a number of search parameters and pairwise-aligned to a model sequence using the Hidden Markov Model-based program HMMER.
Abstract: Motivation: The amount of HIV-1 sequence data generated (presently around 42 000 sequences, of which more than 22 000 are from the V3 region of the viral envelope) presents a challenge for anyone working on the analysis of these data. A major problem is obtaining the region of interest from the stored sequences, which often contain but are not limited to that region. In addition, multiple alignment programs generally cannot deal with the large numbers of sequences that are available for many HIV-1 regions. We set out to provide our users with a tool that will retrieve and create an initial alignment of the HIV sequences that are available for a given genomic region. Results: The MPAlign (M ultiple Pairwise Alignment) web interface is a collection of Perl scripts that retrieves sequences from the Los Alamos HIV sequence database based on a number of search parameters. All sequences were pairwise-aligned to a model sequence using the Hidden Markov Model-based program HMMER. The HMMER model is general enough to accommodate virtually all HIV1 sequences stored in the database. To create a multiple sequence alignment, gaps were inserted into the sequences during retrieval, so that they are aligned to one another. Retrieving and aligning the almost 560 gp120 sequences (∼1500 nt) stored in the database is at least 1500 times faster than a similar Clustal alignment. Availability: At http://www.hiv.lanl.gov/ Contact: Brian Gaschen at bkg@lanl.gov
TL;DR: NeitherMD combines the advantages of the column-scoring techniques with the sensitivity of methods incorporating residue similarity scores, and incorporates ab initio sequence information, such as the number, length and similarity of the sequences to be aligned.
TL;DR: This work extracts the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins using alignments based on their common structures and observes a distribution of gaps that can be fitted with a multiexponential with four distinct components.
Abstract: Protein sequence alignment has become a widely used method in the study of newly sequenced proteins. Most sequence alignment methods use an affine gap penalty to assign scores to insertions and deletions. Although affine gap penalties represent the relative ease of extending a gap compared with initializing a gap, it is still an obvious oversimplification of the real processes that occur during sequence evolution. To improve the efficiency of sequence alignment methods and to obtain a better understanding of the process of sequence evolution, we wanted to find a more accurate model of insertions and deletions in homologous proteins. In this work, we extract the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins (sequence identity < 25%) using alignments based on their common structures. We observe a distribution of gaps that can be fitted with a multiexponential with four distinct components. The results suggest new approaches to modeling insertions and deletions in sequence alignments.
TL;DR: It is shown that, despite their low overall sequence similarity, a sequence alignment manually adjusted to take into account all the local similarities and the insertions/deletions and duplications/rearrangements described in the literature for viroids and viroid-like satellite RNA constitutes a data set suitable for a phylogenetic reconstruction.
Abstract: The proposed monophyletic origin of a group of subviral plant pathogens (viroids and viroid-like satellite RNAs), as well as the phylogenetic relationships and the resulting taxonomy of these entities, has been recently questioned. The criticism comes from the (apparent) lack of sequence similarity among these RNAs necessary to reliably infer a phylogeny. Here we show that, despite their low overall sequence similarity, a sequence alignment manually adjusted to take into account all the local similarities and the insertions/deletions and duplications/rearrangements described in the literature for viroids and viroid-like satellite RNA, along with the use of an appropriate estimator of genetic distances, constitutes a data set suitable for a phylogenetic reconstruction. When the likelihood-mapping method was applied to this data set, the tree-likeness obtained was higher than that corresponding to a sequence alignment that does not take into consideration the local similarities. In addition, bootstrap analysis also supports the major groups previously proposed and the reconstruction is consistent with the biological properties of this RNAs.
TL;DR: A computer program that aligns spliced sequences to genomic sequences, using local alignment algorithms and heuristics to put together a global spliced alignment, which shows how Spidey was used to align reference sequences to known genomic sequences and then to the draft human genome.
Abstract: We have developed a computer program that aligns spliced sequences to genomic sequences, using local alignment algorithms and heuristics to put together a global spliced alignment. Spidey can produce reliable alignments quickly, even when confronted with noise from alternative splicing, polymorphisms, sequencing errors, or evolutionary divergence. We show how Spidey was used to align reference sequences to known genomic sequences and then to the draft human genome, to align mRNAs to gene clusters, and to align mouse mRNAs to human genomic sequence. We compared Spidey to two other spliced alignment programs; Spidey generally performed quite well in a very reasonable amount of time.