Top 28 papers published in the topic of Alignment-free sequence analysis in 2001

Showing papers on "Alignment-free sequence analysis published in 2001"

Journal Article•10.1006/JMBI.2001.5102•

Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles

[...]

Daniel Gautheret¹, André Lambert²•Institutions (2)

French Institute of Health and Medical Research¹, Centre national de la recherche scientifique²

09 Nov 2001-Journal of Molecular Biology

TL;DR: The automated use of sequence information in both single-stranded and helical regions yields better sensitivity/specificity ratios than descriptor-based programs and iterative searches can be conducted to enrich collections of homologous RNAs.

...read moreread less

331 citations

Journal Article•10.1093/BIOINFORMATICS/17.4.327•

A new approach to sequence comparison: normalized sequence alignment.

[...]

Abdullah N. Arslan¹, Ömer Eğecioğlu, Pavel A. Pevzner•Institutions (1)

University of California, Santa Barbara¹

01 Apr 2001-Bioinformatics

TL;DR: Normalized Local Alignment (NLA) as mentioned in this paper is based on fractional programming and its running time is O(n2log n) compared to the standard Smith-Waterman algorithm.

...read moreread less

Abstract: The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.

...read moreread less

96 citations

Journal Article•10.1007/S00253-001-0844-0•

Bioinformatic tools for DNA/protein sequence analysis, functional assignment of genes and protein classification.

[...]

Bernd H. A. Rehm¹•Institutions (1)

University of Münster¹

01 Dec 2001-Applied Microbiology and Biotechnology

TL;DR: This review intends to provide a guide to choosing the most efficient way to analyze a new sequence or to collect information on a gene or protein of interest by applying current publicly available databases and Web services.

...read moreread less

Abstract: The development of efficient DNA sequencing methods has led to the achievement of the DNA sequence of entire genomes from (to date) 55 prokaryotes, 5 eukaryotic organisms and 10 eukaryotic chromosomes. Thus, an enormous amount of DNA sequence data is available and even more will be forthcoming in the near future. Analysis of this overwhelming amount of data requires bioinformatic tools in order to identify genes that encode functional proteins or RNA. This is an important task, considering that even in the well-studied Escherichia coli more than 30% of the identified open reading frames are hypothetical genes. Future challenges of genome sequence analysis will include the understanding of gene regulation and metabolic pathway reconstruction including DNA chip technology, which holds tremendous potential for biomedicine and the biotechnological production of valuable compounds. The overwhelming volume of information often confuses scientists.This review intends to provide a guide to choosing the most efficient way to analyze a new sequence or to collect information on a gene or protein of interest by applying current publicly available databases and Web services. Recently developed tools that allow functional assignment of genes, mainly based on sequence similarity of the deduced amino acid sequence, using the currently available and increasing biological databases will be discussed.

...read moreread less

92 citations

Journal Article•10.1089/106652701300312896•

MUSTA - A General, Efficient, Automated Method for Multiple Structure Alignment and Detection of Common Motifs: Application to Proteins

[...]

Nathaniel Leibowitz¹, Ruth Nussinov, Haim J. Wolfson•Institutions (1)

Tel Aviv University¹

01 Jan 2001-Journal of Computational Biology

TL;DR: An algorithm designed to carry out multiple structure alignment and to detect recurring substructural motifs that is applicable to comparisons of RNA structures and to detection of a pharmacophore in a series of drug molecules is presented.

...read moreread less

Abstract: Here we present an algorithm designed to carry out multiple structure alignment and to detect recurring substructural motifs. So far we have implemented it for comparison of protein structures. How...

...read moreread less

77 citations

Journal Article•10.1002/PROT.10008•

Comparative modeling of CASP4 target proteins: Combining results of sequence search with three‐dimensional structure assessment

[...]

Česlovas Venclovas¹•Institutions (1)

Lawrence Livermore National Laboratory¹

01 Jan 2001-Proteins

TL;DR: Results with CASP4 targets show that, along with the correctness of sequence‐structure alignments, effective use of multiple template structures may significantly increase accuracy of the model structure.

...read moreread less

Abstract: Comparative modeling aims at constructing molecular models for proteins of unknown structure, by using known structures of related proteins as templates. To test the comparative modeling approach reported here, predictions for 13 target proteins were submitted during the fourth round of “blind” protein structure prediction experiment (CASP4; http://PredictionCenter.llnl.gov/casp4). Sequence identity between these target proteins and the closest known structures ranged from 13 to 58%, indicating a broad spectrum of prediction difficulty. Although this broad difficulty range required addressing a variety of issues, the most important proved to be sequence-structure alignment for distant homology targets. The alignment step was based on structure-based evaluation of alignment variants produced mainly with PSI-BLAST intermediate sequence search procedure (PSI-BLAST-ISS). Although a fraction of correctly aligned residues in resulting models was markedly better than the average in all cases, for distant homology targets it was still considerably below the estimated achievable level. Results with CASP4 targets show that, along with the correctness of sequence-structure alignments, effective use of multiple template structures may significantly increase accuracy of the model structure. Improvement in this area should also result in more accurate loop modeling and side-chain prediction. Proteins 2001;Suppl 5:47–54. © 2002 Wiley-Liss, Inc.

...read moreread less

48 citations

Proceedings Article•10.1145/369133.369146•

A new approach to sequence comparison: normalized sequence alignment

[...]

Abdullah N. Arslan¹, Ömer Eğecioğlu¹, Pavel A. Pevzner²•Institutions (2)

University of California, Santa Barbara¹, University of California, San Diego²

22 Apr 2001

TL;DR: A new sequence comparison algorithm (normalized local alignment) that reports the regions with maximum degree of similarity is proposed that is based on fractional programming and its running time is only 3-5 times slower than the standard Smith-Waterman algorithm.

...read moreread less

Abstract: The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g., maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al., 1999 [33]. In this paper we propose a new sequence comparison algorithm (normalized local alignment) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2 log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.

...read moreread less

39 citations

Journal Article•10.1002/PROT.1146•

Integrated graphical analysis of protein sequence features predicted from sequence composition.

[...]

Erik L. L. Sonnhammer¹, John C. Wootton²•Institutions (2)

Karolinska Institutet¹, National Institutes of Health²

15 Nov 2001-Proteins

TL;DR: The SFINX package is described, which allows many different sets of segmental or continuous‐curve sequence feature data, generated by individual external programs, to be viewed in combination alongside a sequence dot‐plot or a multiple alignment of database matches.

...read moreread less

Abstract: Several protein sequence analysis algorithms are based on properties of amino acid composition and repetitiveness. These include methods for prediction of secondary structure elements, coiled-coils, transmembrane segments or signal peptides, and for assignment of low-complexity, nonglobular, or intrinsically unstructured regions. The quality of such analyses can be greatly enhanced by graphical software tools that present predicted sequence features together in context and allow judgment to be focused simultaneously on several different types of supporting information. For these purposes, we describe the SFINX package, which allows many different sets of segmental or continuous-curve sequence feature data, generated by individual external programs, to be viewed in combination alongside a sequence dot-plot or a multiple alignment of database matches. The implementation is currently based on extensions to the graphical viewers Dotter and Blixem and scripts that convert data from external programs to a simple generic data definition format called SFS. We describe applications in which dot-plots and flanking database matches provide valuable contextual information for analyses based on compositional and repetitive sequence features. The system is also useful for comparing results from algorithms run with a range of parameters to determine appropriate values for defaults or cutoffs for large-scale genomic analyses. Proteins 2001;45:262–273. © 2001 Wiley-Liss, Inc.

...read moreread less

39 citations

Journal Article•10.1385/MB:19:1:097•

Omiga: a PC-based sequence analysis tool.

[...]

Jeffrey A. Kramer¹•Institutions (1)

Pharmacia¹

01 Sep 2001-Molecular Biotechnology

TL;DR: This newest version of Omiga™ allows for sequencing and polymerase chain reaction (PCR) primer prediction, a a functionality missing in earlier versions, and allows rapid searches for putative coding regions, and Basic Local Alignment Search Tool (BLAST) queries against public databases at the National Center for Biotechnology Information (NCBI).

...read moreread less

Abstract: Computer-based sequence analysis, notation, and manipulation are a necessity for all molecular biologists working with any but the most simple DNA sequences. As sequence data become increasingly available, tools that can be used to manipulate and annotate individual sequences and sequence elements will become an even more vital implement in the molecular biologist's arsenal. The Omiga DNA and Protein Sequence Analysis Software tool, version 2.0 provides an effective and comprehensive tool for the analysis of both nucleic acid and protein sequences that runs on a standard PC available in every molecular biology laboratory. Omiga allows the import of sequences in several common formats. Upon importing sequences and assigning them to various projects, Omiga allows the user to produce, analyze, and edit sequence alignments. Sequences may also be queried for the presence of restriction sites, sequence motifs, and other sequence features, all of which can be added into the notations accompanying each sequence. This newest version of Omiga also allows for sequencing and polymerase chain reaction (PCR) primer prediction, a functionality missing in earlier versions. Finally, Omiga allows rapid searches for putative coding regions, and Basic Local Alignment Search Tool (BLAST) queries against public databases at the National Center for Biotechnology Information (NCBI).

...read moreread less

23 citations

Journal Article•10.1093/PROTEIN/14.4.209•

Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics

[...]

Alex C.W. May¹•Institutions (1)

National Institute for Medical Research¹

01 Apr 2001-Protein Engineering

TL;DR: An alternative approach to cluster related proteins without the need for an a priori threshold is described, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity.

...read moreread less

Abstract: Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.

...read moreread less

20 citations

Genomics via Optical Mapping IV: Sequence Validation via Optical Map Matching

[...]

Marco Antoniotti, Thomas Anantharaman, Salvatore Paxia, Bud Mishra

1 Mar 2001

TL;DR: The unerlying mathematical model and the dynamic programming algorithm technique for the valicdation of a DNA sequence against a (DNA) map, an ordered restriction map obtained through an optical mapping process and is augmented with statistical information which will ne used to place (or not) the sequence in the genome.

...read moreread less

Abstract: This paper describes the unerlying mathematical model and the dynamic programming algorithm technique for the valicdation of a (DNA) sequence against a (DNA) map. The sequence can be obtained from a variety of sources (r,g, GenBAnk, Sanger''s Lab, or Celera P.E.) and it is assumed to be written out as a string of nucleotides. The map is an ordered restriction map obtained through an optical mapping process and is augmented with statistical information which will ne used to place (or not) the sequence in the genome. Our approach has many other applications beyond validation: e.g. map-based sequence assembly, phasing sequence contigs, detecting and closing gaps and annotation of partially sequenced genomes to find open reading frames, genes and synteny groups. We tested our system by checking various maps against publicly available sequence data for Plasmodium falciparum.

...read moreread less

16 citations

Journal Article•10.1016/S0097-8485(01)00095-X•

Medical target prediction from genome sequence: combining different sequence analysis algorithms with expert knowledge and input from artificial intelligence approaches.

[...]

Thomas Dandekar¹, Thomas Dandekar², Fuli Du², R. Heiner Schirmer², Steffen Schmidt² - Show less +1 more•Institutions (2)

University of Freiburg¹, Heidelberg University²

01 Dec 2001-Computational Biology and Chemistry

TL;DR: By exploiting the rapid increase in available sequence data, the definition of medically relevant protein targets has been improved by a combination of differential genome analysis and analysis of individual proteins.

...read moreread less

Proceedings Article•

A Genetic Algorithm for Multiple Sequence Alignment.

[...]

Jorng-Tzong Horng, Ching-Mei Lin, Bing-He Yang, Cheng-Yan Kao

1 Jan 2001

TL;DR: In this paper, a genetic algorithm to solve multiple sequence alignments is presented and several data sets are tested and the experimental results are compared with other methods, they find their approach could obtain good performance in the data sets with high similarity and long sequences.

...read moreread less

Abstract: Abstract.Multiple sequence alignment is an important tool in molecular sequence analysis. This paper presents genetic algorithms to solve multiple sequence alignments. Several data sets are tested and the experimental results are compared with other methods. We find our approach could obtain good performance in the data sets with high similarity and long sequences.The software can be found in http://rsdb.csie.ncu.edu.tw/tools/msa.htm.

...read moreread less

Journal Article•10.1002/1097-4644(20010201)80:2<181::AID-JCB30>3.0.CO;2-1•

Informatics issues in large‐scale sequence analysis: Elucidating the protein kinases of C. elegans

[...]

Jonathan Bingham, Greg D. Plowman, Sucha Sudarsanam

01 Feb 2001-Journal of Cellular Biochemistry

TL;DR: This work outlines the approach to identifying the protein kinases of C. elegans from the genomic sequence, and describes new tools it has developed for analysis, management and visualization of genomic data.

...read moreread less

Abstract: With the availability of the nearly complete genomic sequence of C. elegans, the first multicellular organism to be sequenced, molecular biology has definitely entered the postgenomic era. Annotation of the genomic sequence, which refers to identifying the genes and other biologically relevant sections of the genome, is an important and nontrivial next step. A first-pass annotation will be necessarily incomplete but will drive further biological experiments, which in turn will help to annotate the genome better. Given the scale of the genome sequence analysis, it is clear that the annotation should be automated as much as possible without sacrificing the quality of analysis. In this work, we outline our approach to identifying the protein kinases of C. elegans from the genomic sequence. We describe new tools we have developed for analysis, management and visualization of genomic data. By developing modular and scalable solutions, this study has provided a framework for future analysis of the Drosophila and human genomes.

...read moreread less

Dissertation•10.4225/03/59C9F6DB2A2D2•

Algorithms for Sequence Alignment

[...]

David Richard Powell

1 Jan 2001

TL;DR: The focus of this thesis is on algorithms for the optimal alignment of two or three sequences of biological data, particularly DNA sequences, with particular emphasis on space and time complexity.

...read moreread less

Abstract: Sequence alignment is an important tool for describing relationships between sequences. Many sequence alignment algorithms exist, differing in efficiency, and in their models of the sequences and of the relationship between sequences. The focus of this thesis is on algorithms for the optimal alignment of two or three sequences of biological data, particularly DNA sequences. The algorithms are discussed with particular emphasis on space and time complexity. A divide-and-conquer method is presented for use with a number of different alignment algorithms. This method may be used to reduce the space complexity of an alignment algorithm with little or no effect to the time complexity. The advantages of this divide-and-conquer method include its simplicity and the ease with which it can be applied to many different alignment algorithms. These advantages are demonstrated by using the divide-and-conquer method in conjunction with several known alignment algorithms. An efficient alignment algorithm is presented for the important problem of optimally aligning three sequences using a linear function for costing gaps in the alignment. For sequences of length n, and a minimum edit cost of d, this new algorithm has a time complexity of O(d + n). The algorithm is further developed by using the aforementioned divide-andconquer method to improve its space complexity. This combination results in a time and space efficient algorithm, while also illustrating the usefulness of the divide-and-conquer method. It is important when aligning sequences to correctly account for any non-randomness that is significant in the sequences. For example, if certain statistical patterns appear throughout sequences from a certain family, it is important to make use of this information when aligning sequences from this family. Common, unsurprising, patterns provide less evidence for the relatedness of sequences than more surprising regions provide. A new algorithm is presented to align optimally two non-random sequences. For a particular sequence model, this new algorithm apportions weight to every part of the alignment dependent on the importance of that part as determined by the sequence model. This algorithm is then developed further so that it can be used to infer whether two non-random sequences are related.

...read moreread less

Proceedings Article•

Parallel genetic algorithm for performance-driven sequence alignment

[...]

L. A. Anbarasu¹, V. Sundararajan¹, P. Narayanasamy²•Institutions (2)

Centre for Development of Advanced Computing¹, Anna University²

7 Jul 2001

TL;DR: The simultaneous alignment of three or more nucleotide or amino acid is among the most important tools for analyzing biological sequences and an essential pre-requisite to phylogenetic reconstruction.

...read moreread less

Abstract: The simultaneous alignment of three or more nucleotide or amino acid is among the most important tools for analyzing biological sequences Multiple alignments are used to find characteristic motifs and conserved regions in protein families; to help demonstrarte homology between new sequences and existing families; to improve the prediction of secondary and tertiary structure of new sequences; and an essential pre-requisite to phylogenetic reconstruction The fact that the multiple sequence alignment problem is of high complexity has led to the development of different algorithms These algorithms fall into two categories namely the greedy ones that rely on pairwise alignment and those that attempt to align all the sequences simultaneously

...read moreread less

Using a complex model of sequence evolution to evaluate and improve phylogenetic methods

[...]

Mark Travis Holder

1 Dec 2001

Reference Entry•10.1002/9780470015902.A0001798.PUB2•

DNA Sequence Analysis

[...]

Takashi Gojobori¹, So Nakagawa¹, Jose C. Clemente¹•Institutions (1)

National Institute of Genetics¹

25 Apr 2001

TL;DR: In this article, DNA databases, homology search tools and sequence alignment methods are surveyed and the concept of distance between genes and how to calculate this measure using DNA or amino acid sequences and several commonly used techniques for phylogenetic analysis and tree evaluation are described.

...read moreread less

Abstract: Recent advances in deoxyribonucleic acid (DNA) sequencing technology have produced a massive amount of nucleotide sequences, which are stored in DNA databanks and genomic data repositories. Furthermore, comprehensive analyses of transcriptional and genomic elements have uncovered an elaborate system of gene expression that broadens our understanding of fundamental biological phenomena. The analysis of DNA data has therefore become essential to predict gene function or detect regulatory motifs through comparative studies. In this article, DNA databases, homology search tools and sequence alignment methods are surveyed. The concept of distance between genes and how to calculate this measure using DNA or amino acid sequences and introducing several commonly used techniques for phylogenetic analysis and tree evaluation are also described. Key concepts Advances in DNA sequencing technology have produced an unprecedented amount of sequence data. The DNA Data Bank of Japan (DDBJ), the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) are the three major sequence data repositories. They exchange data periodically, and maintain various services for data search and retrieval. Similarity searching, alignment of sequences, prediction of function and reconstruction of the evolutionary history (phylogenetic tree) of a group of species are among the most commonly used techniques for sequence analysis. BLAST (similarity searching), ClustalW (sequence alignment), Pfam (protein domains) and TRANSFAC (transcription factors) are popular tools and resources. The genetic distance, a measure of evolutionary similarity, is usually calculated as the number of nucleotide or amino acid differences (substitutions) among sequences. Nucleotide substitutions are synonymous (not affecting the codified amino acid) or nonsynonymous (triggering an amino acid change). Distance- and character-based methods can be used to reconstruct phylogenetic trees. Distance-based methods reconstruct the tree from an estimation of the evolutionary distance among taxa. Character-based methods derive the phylogeny directly from the observable state of characters in the taxa. The bootstrap method is commonly used to determine the quality of an inferred phylogeny. Keywords: DNA databank; genome projects; similarity search; evolutionary distance; molecular phylogeny

...read moreread less

Database searching with phylogenetic trees

[...]

Marc Rehmsmeier

1 Jan 2001

TL;DR: A database search method that is based on phylogenetic trees - treesearch is introduced, which results in a generalization of established probabilistic methods such as pairwise sequence alignment, multiple sequence alignments, and profile searches.

...read moreread less

Abstract: Database searching and phylogenetic tree reconstruction are two major fields of computational sequence analysis. This thesis introduces a combination of both: a database search method that is based on phylogenetic trees - treesearch. A given protein family is described by its multiple alignment and its phylogenetic tree. A database sequence that is tested for membership in the family is tentatively inserted into that tree. The result of this operation determines how well the sequence fits into the family. The idea is realized in the distance based context of phylogeny. To assess the performance of the method in terms of sensitivity and selectivity, it is compared to profiles (ISREC pfsearch), two implementations of hidden Markov models (HMMER hmmsearch and SAM hmmscore), and to the family pairwise search (FPS) method. The comparison is based on a novel evaluation tool, which was also developed during this work. All methods are presented in a new unified functional framework of database searching. The analysis is complemented by extensive simulations. The treesearch idea is also transferred to the probabilistic context of phylogeny, which results in a generalization of established probabilistic methods such as pairwise sequence alignment, multiple sequence alignment, and profile searches.

...read moreread less

Journal Article•10.1038/35066030•

Self-improving GBuilder

[...]

Jane Alfred

01 Apr 2001-Nature Reviews Genetics

Journal Article•10.1093/PROTEIN/14.4.219•

Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins.

[...]

S. Balaji¹, Narayanaswamy Srinivasan•Institutions (1)

Indian Institute of Science¹

01 Apr 2001-Protein Engineering

TL;DR: A comparison of sequence and 3D structure-based phylogenies for all the families suggests that only a few families have a radical difference in the two kinds of dendrograms.

...read moreread less

Abstract: The database PALI (Phylogeny and ALIgnment of homologous protein structures) consists of families of protein domains of known three-dimensional (3D) structure. In a PALI family, every member has been structurally aligned with every other member (pairwise) and also simultaneous superposition (multiple) of all the members has been performed. The database also contains 3D structure-based and structure-dependent sequence similarity-based phylogenetic dendrograms for all the families. The PALI release used in the present analysis comprises 225 families derived largely from the HOMSTRAD and SCOP databases. The quality of the multiple rigid-body structural alignments in PALI was compared with that obtained from COMPARER, which encodes a procedure based on properties and relationships. The alignments from the two procedures agreed very well and variations are seen only in the low sequence similarity cases often in the loop regions. A validation of Direct Pairwise Alignment (DPA) between two proteins is provided by comparing it with Pairwise alignment extracted from Multiple Alignment of all the members in the family (PMA). In general, DPA and PMA are found to vary rarely. The ready availability of pairwise alignments allows the analysis of variations in structural distances as a function of sequence similarities and number of topologically equivalent $C\alpha$ atoms. The structural distance metric used in the analysis combines root mean square deviation (r.m.s.d.) and number of equivalences, and is shown to vary similarly to r.m.s.d. The correlation between sequence similarity and structural similarity is poor in pairs with low sequence similarities. A comparison of sequence and 3D structure-based phylogenies for all the families suggests that only a few families have a radical difference in the two kinds of dendrograms. The difference could occur when the sequence similarity among the homologues is low or when the structures are subjected to evolutionary pressure for the retention of function. The PALI database is expected to be useful in furthering our understanding of the relationship between sequences and structures of homologous proteins and their evolution.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/17.12.1158•

Estimation of P-values for global alignments of protein sequences

[...]

Caleb Webber¹, Geoffrey J. Barton¹•Institutions (1)

European Bioinformatics Institute¹

01 Dec 2001-Bioinformatics

TL;DR: The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.

...read moreread less

Abstract: MOTIVATION The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment. RESULTS The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation. AVAILABILITY Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk. CONTACT geoff@ebi.ac.uk

...read moreread less

Journal Article•10.1016/S0378-1119(01)00461-9•

Multiple alignment of complete sequences (MACS) in the post-genomic era

[...]

Odile Lecompte¹, Julie D. Thompson¹, Frédéric Plewniak¹, Jean-Claude Thierry¹, Olivier Poch¹ - Show less +1 more•Institutions (1)

French Institute of Health and Medical Research¹

30 May 2001-Gene

TL;DR: An integrated multiple alignment system bringing together sequence data, knowledge-based systems and prediction methods with their inherent unreliability will provide an ideal workbench for the validation, propagation and presentation of this information in a format that is concise, clear and intuitive.

...read moreread less

Journal Article•10.1093/NAR/29.1.61•

PALI—a database of Phylogeny and ALIgnment of homologous protein structures

[...]

S. Balaji¹, S. Sujatha, Sanjeev Kumar, Narayanaswamy Srinivasan•Institutions (1)

Indian Institute of Science¹

01 Jan 2001-Nucleic Acids Research

TL;DR: PALI (release 1.2) contains three-dimensional structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity.

...read moreread less

Abstract: PALI (release 1.2) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families. The data set of homologous protein structures has been derived by consulting the SCOP database (release 1.50) and the data set comprises 604 families of homologous proteins involving 2739 protein domain structures with each family made up of at least two members. Each member in a family has been structurally aligned with every other member in the same family (pairwise alignment) and all the members in the family are also aligned using simultaneous superposition (multiple alignment). The structural alignments are performed largely automatically, with manual interventions especially in the cases of distantly related proteins, using the program STAMP (version 4.2). Every family is also associated with two dendrograms, calculated using PHYLIP (version 3.5), one based on a structural dissimilarity metric defined for every pairwise alignment and the other based on similarity of topologically equivalent residues. These dendrograms enable easy comparison of sequence and structure-based relationships among the members in a family. Structure-based alignments with the details of structural and sequence similarities, superposed coordinate sets and dendrograms can be accessed conveniently using a web interface. The database can be queried for protein pairs with sequence or structural similarities falling within a specified range. Thus PALI forms a useful resource to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity. PALI also contains over 653 ‘orphans’ (single member families). Using the web interface involving PSI_BLAST and PHYLIP it is possible to associate the sequence of a new protein with one of the families in PALI and generate a phylogenetic tree combining the query sequence and proteins of known 3-D structure. The database with the web interfaced search and dendrogram generation tools can be accessed at http://pauling.mbu.iisc.ernet.in/~pali.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/17.5.415•

Retrieval and on-the-fly alignment of sequence fragments from the HIV database.

[...]

Brian Gaschen¹, Carla Kuiken¹, Bette T. Korber¹, Brian T. Foley¹•Institutions (1)

Los Alamos National Laboratory¹

01 May 2001-Bioinformatics

TL;DR: The MPAlign (Multiple Pairwise Alignment) web interface is a collection of Perl scripts that retrieves sequences from the Los Alamos HIV sequence database based on a number of search parameters and pairwise-aligned to a model sequence using the Hidden Markov Model-based program HMMER.

...read moreread less

Abstract: Motivation: The amount of HIV-1 sequence data generated (presently around 42 000 sequences, of which more than 22 000 are from the V3 region of the viral envelope) presents a challenge for anyone working on the analysis of these data. A major problem is obtaining the region of interest from the stored sequences, which often contain but are not limited to that region. In addition, multiple alignment programs generally cannot deal with the large numbers of sequences that are available for many HIV-1 regions. We set out to provide our users with a tool that will retrieve and create an initial alignment of the HIV sequences that are available for a given genomic region. Results: The MPAlign (M ultiple Pairwise Alignment) web interface is a collection of Perl scripts that retrieves sequences from the Los Alamos HIV sequence database based on a number of search parameters. All sequences were pairwise-aligned to a model sequence using the Hidden Markov Model-based program HMMER. The HMMER model is general enough to accommodate virtually all HIV1 sequences stored in the database. To create a multiple sequence alignment, gaps were inserted into the sequences during retrieval, so that they are aligned to one another. Retrieving and aligning the almost 560 gp120 sequences (∼1500 nt) stored in the database is at least 1500 times faster than a similar Clustal alignment. Availability: At http://www.hiv.lanl.gov/ Contact: Brian Gaschen at bkg@lanl.gov

...read moreread less

Journal Article•10.1006/JMBI.2001.5187•

Towards a reliable objective function for multiple sequence alignments.

[...]

Julie D. Thompson¹, Frédéric Plewniak¹, Raymond Ripp¹, Jean-Claude Thierry¹, Olivier Poch¹ - Show less +1 more•Institutions (1)

French Institute of Health and Medical Research¹

07 Dec 2001-Journal of Molecular Biology

TL;DR: NeitherMD combines the advantages of the column-scoring techniques with the sensitivity of methods incorporating residue similarity scores, and incorporates ab initio sequence information, such as the number, length and similarity of the sequences to be aligned.

...read moreread less

Journal Article•10.1002/PROT.1129•

Distribution of Indel lengths.

[...]

Bin Qian¹, Richard A. Goldstein¹•Institutions (1)

University of Michigan¹

01 Oct 2001-Proteins

TL;DR: This work extracts the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins using alignments based on their common structures and observes a distribution of gaps that can be fitted with a multiexponential with four distinct components.

...read moreread less

Abstract: Protein sequence alignment has become a widely used method in the study of newly sequenced proteins. Most sequence alignment methods use an affine gap penalty to assign scores to insertions and deletions. Although affine gap penalties represent the relative ease of extending a gap compared with initializing a gap, it is still an obvious oversimplification of the real processes that occur during sequence evolution. To improve the efficiency of sequence alignment methods and to obtain a better understanding of the process of sequence evolution, we wanted to find a more accurate model of insertions and deletions in homologous proteins. In this work, we extract the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins (sequence identity < 25%) using alignments based on their common structures. We observe a distribution of gaps that can be fitted with a multiexponential with four distinct components. The results suggest new approaches to modeling insertions and deletions in sequence alignments.

...read moreread less

Journal Article•10.1007/S002390010203•

Phylogenetic analysis of viroid and viroid-like satellite RNAs from plants: a reassessment.

[...]

Santiago F. Elena¹, Joaquín Dopazo², Marcos de la Peña³, Ricardo Flores³, Theodor O. Diener⁴, Andrés Moya¹ - Show less +2 more•Institutions (4)

University of Valencia¹, Carlos III Health Institute², Polytechnic University of Valencia³, University of Maryland Biotechnology Institute⁴

01 Aug 2001-Journal of Molecular Evolution

TL;DR: It is shown that, despite their low overall sequence similarity, a sequence alignment manually adjusted to take into account all the local similarities and the insertions/deletions and duplications/rearrangements described in the literature for viroids and viroid-like satellite RNA constitutes a data set suitable for a phylogenetic reconstruction.

...read moreread less

Abstract: The proposed monophyletic origin of a group of subviral plant pathogens (viroids and viroid-like satellite RNAs), as well as the phylogenetic relationships and the resulting taxonomy of these entities, has been recently questioned. The criticism comes from the (apparent) lack of sequence similarity among these RNAs necessary to reliably infer a phylogeny. Here we show that, despite their low overall sequence similarity, a sequence alignment manually adjusted to take into account all the local similarities and the insertions/deletions and duplications/rearrangements described in the literature for viroids and viroid-like satellite RNA, along with the use of an appropriate estimator of genetic distances, constitutes a data set suitable for a phylogenetic reconstruction. When the likelihood-mapping method was applied to this data set, the tree-likeness obtained was higher than that corresponding to a sequence alignment that does not take into consideration the local similarities. In addition, bootstrap analysis also supports the major groups previously proposed and the reconstruction is consistent with the biological properties of this RNAs.

...read moreread less

Journal Article•10.1101/GR.195301•

Spidey: a tool for mRNA-to-genomic alignments.

[...]

Sarah J. Wheelan¹, Deanna M. Church¹, James Ostell¹•Institutions (1)

National Institutes of Health¹

01 Nov 2001-Genome Research

TL;DR: A computer program that aligns spliced sequences to genomic sequences, using local alignment algorithms and heuristics to put together a global spliced alignment, which shows how Spidey was used to align reference sequences to known genomic sequences and then to the draft human genome.

...read moreread less

Abstract: We have developed a computer program that aligns spliced sequences to genomic sequences, using local alignment algorithms and heuristics to put together a global spliced alignment. Spidey can produce reliable alignments quickly, even when confronted with noise from alternative splicing, polymorphisms, sequencing errors, or evolutionary divergence. We show how Spidey was used to align reference sequences to known genomic sequences and then to the draft human genome, to align mRNAs to gene clusters, and to align mouse mRNAs to human genomic sequence. We compared Spidey to two other spliced alignment programs; Spidey generally performed quite well in a very reasonable amount of time.

...read moreread less