Top 73 papers published in the topic of Alignment-free sequence analysis in 2004

Showing papers on "Alignment-free sequence analysis published in 2004"

MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment

[...]

Sudhir Kumar¹, Koichiro Tamura², Masatoshi Nei³•Institutions (3)

Biodesign Institute¹, Tokyo Metropolitan University², Pennsylvania State University³

01 Jun 2004-Briefings in Bioinformatics

TL;DR: An overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA is provided.

...read moreread less

Abstract: With its theoretical basis firmly established in molecular evolutionary and population genetics, the comparative DNA and protein sequence analysis plays a central role in reconstructing the evolutionary histories of species and multigene families, estimating rates of molecular evolution, and inferring the nature and extent of selective forces shaping the evolution of genes and genomes. The scope of these investigations has now expanded greatly owing to the development of high-throughput sequencing techniques and novel statistical and computational methods. These methods require easy-to-use computer programs. One such effort has been to produce Molecular Evolutionary Genetics Analysis (MEGA) software, with its focus on facilitating the exploration and analysis of the DNA and protein sequence variation from an evolutionary perspective. Currently in its third major release, MEGA3 contains facilities for automatic and manual sequence alignment, web-based mining of databases, inference of the phylogenetic trees, estimation of evolutionary distances and testing evolutionary hypotheses. This paper provides an overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA.

...read moreread less

12,730 citations

Journal Article•10.1002/PROT.20308•

Fold recognition by combining sequence profiles derived from evolution and from depth‐dependent structural alignment of fragments

[...]

Hongyi Zhou¹, Yaoqi Zhou¹•Institutions (1)

University at Buffalo¹

02 Nov 2004-Proteins

TL;DR: The resulting method, called SP3, is found to be the most sensitive and accurate single‐method server in all benchmarks tested where other methods are available for comparison and its accuracy rivals some of the consensus methods such as ShotGun‐INBGU, Pmodeller3, Pcons4, and ROBETTA.

...read moreread less

Abstract: Recognizing structural similarity without significant sequence identity has proved to be a challenging task. Sequence-based and structure-based methods as well as their combinations have been developed. Here, we propose a fold-recognition method that incorporates structural information without the need of sequence-to-structure threading. This is accomplished by generating sequence profiles from protein structural fragments. The structure-derived sequence profiles allow a simple integration with evolution-derived sequence profiles and secondary-structural information for an optimized alignment by efficient dynamic programming. The resulting method (called SP(3)) is found to make a statistically significant improvement in both sensitivity of fold recognition and accuracy of alignment over the method based on evolution-derived sequence profiles alone (SP) and the method based on evolution-derived sequence profile and secondary structure profile (SP(2)). SP(3) was tested in SALIGN benchmark for alignment accuracy and Lindahl, PROSPECTOR 3.0, and LiveBench 8.0 benchmarks for remote-homology detection and model accuracy. SP(3) is found to be the most sensitive and accurate single-method server in all benchmarks tested where other methods are available for comparison (although its results are statistically indistinguishable from the next best in some cases and the comparison is subjected to the limitation of time-dependent sequence and/or structural library used by different methods.). In LiveBench 8.0, its accuracy rivals some of the consensus methods such as ShotGun-INBGU, Pmodeller3, Pcons4, and ROBETTA. SP(3) fold-recognition server is available on http://theory.med.buffalo.edu.

...read moreread less

272 citations

Journal Article•10.1101/GR.2657504•

A novel method for multiple alignment of sequences with repeated and shuffled elements

[...]

Benjamin J. Raphael¹, Degui Zhi¹, Haixu Tang¹, Pavel A. Pevzner¹•Institutions (1)

University of California, San Diego¹

01 Nov 2004-Genome Research

TL;DR: ABA (A-Bruijn alignment), a new method for multiple alignment of biological sequences, that represents an alignment as a directed graph, possibly containing cycles, provides more flexibility than does a traditional alignment matrix or the recently introduced partial order alignment (POA) graph.

...read moreread less

Abstract: We describe ABA (A-Bruijn alignment), a new method for multiple alignment of biological sequences. The major difference between ABA and existing multiple alignment methods is that ABA represents an alignment as a directed graph, possibly containing cycles. This representation provides more flexibility than does a traditional alignment matrix or the recently introduced partial order alignment (POA) graph by allowing a larger class of evolutionary relationships between the aligned sequences. Our graph representation is particularly well-suited to the alignment of protein sequences with shuffled and/or repeated domain structure, and allows one to construct multiple alignments of proteins containing (1) domains that are not present in all proteins, (2) domains that are present in different orders in different proteins, and (3) domains that are present in multiple copies in some proteins. In addition, ABA is useful in the alignment of genomic sequences that contain duplications and inversions. We provide several examples illustrating the applications of ABA.

...read moreread less

191 citations

Journal Article•10.1093/NAR/GKH382•

3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment

[...]

Olivier Poirot¹, Karsten Suhre, Chantal Abergel, Eamonn O'Toole, Cedric Notredame - Show less +1 more•Institutions (1)

Centre national de la recherche scientifique¹

01 Jul 2004-Nucleic Acids Research

TL;DR: 3DCoffee@igs is a web-based tool dedicated to the computation of high-quality multiple sequence alignments (MSAs) and makes it possible to mix protein sequences and structures in order to increase the accuracy of the alignments.

...read moreread less

Abstract: This paper presents 3DCoffee@igs, a web-based tool dedicated to the computation of high-quality multiple sequence alignments (MSAs). 3D-Coffee makes it possible to mix protein sequences and structures in order to increase the accuracy of the alignments. Structures can be either provided as PDB identifiers or directly uploaded into the server. Given a set of sequences and structures, pairs of structures are aligned with SAP while sequence-structure pairs are aligned with Fugue. The resulting collection of pairwise alignments is then combined into an MSA with the T-Coffee algorithm. The server and its documentation are available from http://igs-server.cnrs-mrs.fr/Tcoffee/.

...read moreread less

178 citations

Journal Article•10.1093/BIOINFORMATICS/BTH116•

Align-m---a new algorithm for multiple alignment of highly divergent sequences

[...]

Ivo Van Walle¹, Ignace Lasters, Lode Wyns¹•Institutions (1)

Vrije Universiteit Brussel¹

12 Jun 2004-Bioinformatics

TL;DR: Align-m is a new program that uses a non-progressive local approach to guide a global alignment of highly divergent sequences and has comparable or slightly higher accuracy in terms of correctly aligned residues, especially for distantly related sequences.

...read moreread less

Abstract: Motivation: Multiple alignment of highly divergent sequences is a challenging problem for which available programs tend to show poor performance. Generally, this is due to a scoring function that does not describe biological reality accurately enough or a heuristic that cannot explore solution space efficiently enough. In this respect, we present a new program, Align-m, that uses a non-progressive local approach to guide a global alignment. Results: Two large test sets were used that represent the entire SCOP classification and cover sequence similarities between 0 and 50% identity. Performance was compared with the publicly available algorithms ClustalW, T-Coffee and DiAlign. In general, Align-m has comparable or slightly higher accuracy in terms of correctly aligned residues, especially for distantly related sequences. Importantly, it aligns much fewer residues incorrectly, with average differences of over 15% compared with some of the other algorithms. Availability: Align-m and the test sets are available at http://bioinformatics.vub.ac.be

...read moreread less

172 citations

Journal Article•10.1021/AC035258X•

High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results.

[...]

Brian C. Searle¹, Surendra Dasari², Mark Turner¹, Ashok P. Reddy², Dongseok Choi², Phillip A. Wilmarth², Ashley L. McCormack¹, Larry L. David², Srinivasa R. Nagalla¹ - Show less +5 more•Institutions (2)

Oregon Health & Science University¹, Oregon National Primate Research Center²

10 Mar 2004-Analytical Chemistry

TL;DR: A novel, mass-based approach to sequence alignment, implemented as a program called OpenSea, which can identify more peptides and proteins than commonly used database-searching programs while accurately locating sequence variation sites and unanticipated posttranslational modifications in a high-throughput environment.

...read moreread less

Abstract: With the increasing availability of de novo sequencing algorithms for interpreting high-mass accuracy tandem mass spectrometry (MS/MS) data, there is a growing need for programs that accurately identify proteins from de novo sequencing results. De novo sequences derived from tandem mass spectra of peptides often contain ambiguous regions where the exact amino acid order cannot be determined. One problem this poses for sequence alignment algorithms is the difficulty in distinguishing discrepancies due to de novo sequencing errors from actual genomic sequence variation and posttranslational modifications. We present a novel, mass-based approach to sequence alignment, implemented as a program called OpenSea, to resolve these problems. In this approach, de novo and database sequences are interpreted as masses of residues, and the masses, rather than the amino acid codes, are compared. To provide further flexibility, the masses can be aligned in groups, which can resolve many de novo sequencing errors. The per...

...read moreread less

153 citations

Journal Article•10.1093/BIOINFORMATICS/BTH137•

Development of joint application strategies for two microbial gene finders

[...]

Alice C. McHardy, Alexander Goesmann, Alfred Pühler¹, Folker Meyer•Institutions (1)

Bielefeld University¹

01 Jul 2004-Bioinformatics

TL;DR: Joint application strategies that combine the strengths of two microbial gene finders to improve the overall gene finding performance and results in a significant improvement in specificity while there is similarity in sensitivity to Glimmer.

...read moreread less

Abstract: Motivation: As a starting point in annotation of bacterial genomes, gene finding programs are used for the prediction of functional elements in the DNA sequence. Due to the faster pace and increasing number of genome projects currently underway, it is becoming especially important to have performant methods for this task. Results: This study describes the development of joint application strategies that combine the strengths of two microbial gene finders to improve the overall gene finding performance. Critica is very specific in the detection of similarity-supported genes as it uses a comparative sequence analysis-based approach. Glimmer employs a very sophisticated model of genomic sequence properties and is sensitive also in the detection of organism-specific genes. Based on a data set of 113 microbial genome sequences, we optimized a combined application approach using different parameters with relevance to the gene finding problem. This results in a significant improvement in specificity while there is similarity in sensitivity to Glimmer. The improvement is especially pronounced for GC rich genomes. The method is currently being applied for the annotation of several microbial genomes. Availability: The methods described have been implemented within the gene prediction component of the GenDB genome annotation system.

...read moreread less

83 citations

Journal Article•10.1101/GR.2648404•

An intermediate grade of finished genomic sequence suitable for comparative analyses.

[...]

01 Nov 2004-Genome Research

TL;DR: The generation and quality of an intermediate grade of finished genomic sequence (termed comparative-grade finished sequence), which is tailored for use in multispecies sequence comparisons, is described, which is very high quality and reflects 99% of the total sequence.

...read moreread less

Abstract: The strategy of “shotgun sequencing” (Sanger et al. 1982; Wilson and Mardis 1997b; Green 2001) has emerged as the most cost-effective approach for the de novo generation of large amounts of genomic sequence data. Whether applied on individual large-insert clones (C. elegans Sequencing Consortium 1998; International Human Genome Sequencing Consortium 2001), whole genomes (Adams et al. 2000; Venter et al. 2001; Aparicio et al. 2002; Mouse Genome Sequencing Consortium 2002), or a combination of both (Rat Genome Sequencing Project Consortium 2004), shotgun-sequencing strategies are typically performed in two broad phases. In the initial “shotgun” phase, highly redundant sequence data are obtained by generating sequence reads from one or both insert ends of randomly selected subclones derived from the starting DNA (large-insert clone or whole genome). This phase involves high-throughput methodologies and is responsible for generating the great majority of the actual sequence. In the second “finishing” phase, the assembled sequence emanating from the shotgun phase is analyzed and refined, with additional sequence data typically generated to attain long-range continuity and to improve accuracy. Sequence finishing is a low-throughput, craftsman-like process that involves highly skilled personnel performing both computational and experimental procedures in a customized fashion; as a result, it is also relatively expensive. For sequencing the human genome, the Human Genome Project appropriately set very high standards with respect to the quality of the finished sequence (Felsenfeld et al. 1999; International Human Genome Sequencing Consortium 2001; see www.genome.wustl.edu/Overview/finrulesname.php?G16=1). Specifically, there was a rigorous set of standards that ensured consistency among different sequencing centers and a well-defined quality specification that required a low error rate (less than one error per 10,000 bases), the absence of gaps, and confirmation of the final sequence by comparison with a restriction enzyme digest-based fingerprint of each clone. Implementation of these standards yielded a remarkably accurate human genome sequence (International Human Genome Sequencing Consortium 2004), which has provided a powerful foundation for subsequent annotation efforts (Stein 2001; Ashurst and Collins 2003), comparisons with other species' sequences (Aparicio et al. 2002; Mouse Genome Sequencing Consortium 2002; Rat Genome Sequencing Project Consortium 2004), and efforts to untangle complex genomic structures, such as segmental duplications (Bailey et al. 2002). However, achieving such high standards required a considerable investment in sequence finishing, estimated to have been 30%–40% of the total cost. At present and with the recent decline in the costs of producing shotgun-sequence data, the resources required to perform such high-quality sequence finishing now correspond to 40%–70% of the total cost (data not shown). It is well recognized that the quality of the sequence generated for the human genome, which we refer to as human-grade finished sequence, is substantially better than that available at the end of the shotgun phase. The latter full-shotgun draft sequence is simply derived from the automated assembly of the full collection of shotgun sequence reads (e.g., that providing greater than eightfold average sequence coverage). It is important to point out that in the progression from full-shotgun to human-grade finished sequence, there is not a linear relationship between the associated additional costs and the enhancement in sequence quality. Indeed, early in this progression, significant gains in quality can be achieved with even small amounts of additional effort (Wilson and Mardis 1997b; Gordon et al. 2001), whereas in later stages, large amounts of effort are often required to accomplish even small quality improvements. In contemplating the sequencing of additional vertebrate genomes beyond the first pair of high-quality reference sequences (i.e., those of the human [International Human Genome Sequencing Consortium 2001, 2004] and mouse [Mouse Genome Sequencing Consortium 2002] genomes), the relative value of sequence finishing is of great interest. Specifically, understanding the relationship between overall sequence quality and the ability to extract relevant information by comparative analyses becomes important, especially in the context of analyzing sequences from multiple species. Motivated to generate genomic sequence from multiple species suitable for comparative analyses (Margulies et al. 2003a,b; Thomas et al. 2003), we sought to investigate whether an intermediate grade of finished sequence could be produced that was both cost-effective and appropriate in terms of quality. Toward that end, we have established an approach for generating what we call comparative-grade finished sequence. Here we report details about comparative-grade finished sequence, as generated on a large scale for bacterial-artificial chromosome (BAC) clones (Shizuya et al. 1992; Birren et al. 1998). In addition, we assess the relative quality of this sequence and the effort and costs associated with producing it.

...read moreread less

83 citations

Proceedings Article•10.1109/CSB.2004.120•

MUSCLE: multiple sequence alignment with improved accuracy and speed

[...]

R.C. Edgar¹•Institutions (1)

University of California, Berkeley¹

16 Aug 2004

TL;DR: MUSCLE is a new program for creating multiple alignments of protein sequences that gives average accuracy statistically indistinguishable from T-Coffee and is the fastest published method for large numbers of sequences, able to align 5,000 sequences of length 300 in 7 minutes on a desktop computer.

...read moreread less

Abstract: We present MUSCLE, a new program for creating multiple alignments of protein sequences. MUSCLE achieves the highest scores so far reported on four alignment benchmarks: Balibase, PREFAB, SABmark and SMART, achieving accuracy from 1% to 2.5% higher than T-Coffee and execution times that are generally lower than CLUSTALW for typical input data. With options designed for high-throughput applications, MUSCLE gives average accuracy statistically indistinguishable from T-Coffee and is the fastest published method for large numbers of sequences, able to align 5,000 sequences of length 300 in 7 minutes on a desktop computer. MUSCLE is freely available at http://www.drive5.com/muscle.

...read moreread less

78 citations

Journal Article•10.1093/BIOINFORMATICS/BTH055•

Multiple sequence alignment in parallel on a workstation cluster

[...]

Justin Ebedes¹, Amitava Datta¹•Institutions (1)

University of Western Australia¹

01 May 2004-Bioinformatics

TL;DR: It is shown that parallelizing the ClustalW algorithm can result in significant speedup, and experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs.

...read moreread less

Abstract: Summary: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. Availability: The software is available upon request from the second author.

...read moreread less

62 citations

Journal Article•10.1023/B:GENP.0000023684.05565.78•

Multiple Sequence Alignment with Evolutionary Computation

[...]

Conrad Shyu¹, Luke Sheneman¹, James A. Foster¹•Institutions (1)

University of Idaho¹

01 Jun 2004-Genetic Programming and Evolvable Machines

TL;DR: A brief review of current work in the area of multiple sequence alignment (MSA) for DNA and protein sequences using evolutionary computation (EC) and two novel approaches for inferring MSA using genetic algorithms are presented.

...read moreread less

Abstract: In this paper we provide a brief review of current work in the area of multiple sequence alignment (MSA) for DNA and protein sequences using evolutionary computation (EC). We detail the strengths and weaknesses of EC techniques for MSA. In addition, we present two novel approaches for inferring MSA using genetic algorithms. Our first novel approach utilizes a GA to evolve an optimal guide tree in a progressive alignment algorithm and serves as an alternative to the more traditional heuristic techniques such as neighbor-joining. The second novel approach facilitates the optimization of a consensus sequence with a GA using a vertically scalable encoding scheme in which the number of iterations needed to find the optimal solution is approximately the same regardless the number of sequences being aligned. We compare both of our novel approaches to the popular progressive alignment program Clustal W. Experiments have confirmed that EC constitutes an attractive and promising alternative to traditional heuristic algorithms for MSA.

...read moreread less

Journal Article•10.1016/S1672-0229(04)02014-5•

Recent applications of Hidden Markov Models in computational biology.

[...]

Khar Heng Choo¹, Joo Chuan Tong¹, Louxin Zhang¹•Institutions (1)

National University of Singapore¹

01 May 2004-Genomics, Proteomics & Bioinformatics

TL;DR: This paper examines recent developments and applications of Hidden Markov Models to various problems in computational biology, including multiple sequence alignment, homology detection, protein sequences classification, and genomic annotation.

...read moreread less

Journal Article•10.1142/S0219720004000818•

Local sequence-structure motifs in rna

[...]

Rolf Backofen, Sebastian Will

01 Dec 2004-Journal of Bioinformatics and Computational Biology

TL;DR: A new general definition of locality for sequence-structure alignments that is biologically motivated and efficiently tractable is suggested and it is proved that the defined locality means connectivity by atomic and non-atomic bonds.

...read moreread less

Abstract: Ribonuclic acid (RNA) enjoys increasing interest in molecular biology; despite this interest fundamental algorithms are lacking, e.g. for identifying local motifs. As proteins, RNA molecules have a distinctive structure. Therefore, in addition to sequence information, structure plays an important part in assessing the similarity of RNAs. Furthermore, common sequence-structure features in two or several RNA molecules are often only spatially local, where possibly large parts of the molecules are dissimilar. Consequently, we address the problem of comparing RNA molecules by computing an optimal local alignment with respect to sequence and structure information. While local alignment is superior to global alignment for identifying local similarities, no general local sequence-structure alignment algorithms are currently known. We suggest a new general definition of locality for sequence-structure alignments that is biologically motivated and efficiently tractable. To show the former, we discuss locality of RNA and prove that the defined locality means connectivity by atomic and non-atomic bonds. To show the latter, we present an efficient algorithm for the newly defined pairwise local sequence-structure alignment (lssa) problem for RNA. For molecules of lengthes n and m, the algorithm has worst-case time complexity of O(n2·m2·max(n,m)) and a space complexity of only O(n·m). An implementation of our algorithm is available at . Its runtime is competitive with global sequence-structure alignment.

...read moreread less

Journal Article•10.1261/RNA.5168504•

BayesFold: Rational 2° folds that combine thermodynamic, covariation, and chemical data for aligned RNA sequences

[...]

Rob Knight¹, Amanda Birmingham, Michael Yarus•Institutions (1)

University of Colorado Boulder¹

01 Sep 2004-RNA

TL;DR: On a gapped alignment of 86 tRNA Phe sequences each 77 bases long, BayesFold takes 31 sec to perform the calculations; the best structure contained 95% of the base pairs in the true structure, and thetrue structure was ranked second.

...read moreread less

Abstract: BayesFold is a Web application that folds an alignment of closely related sequences and evaluates hypotheses about their shared structure. It uses Bayes’s Theorem to combine information from several sources, including chemical mapping (if available), thermodynamic folding, and observed sequence variations. Its method provides a rational basis for integrating results, even when these methods conflict. On a gapped alignment of 86 tRNA Phe sequences each 77 bases long, BayesFold takes 31 sec to perform the calculations; the best structure contained 95% of the base pairs in the true structure, and the true structure was ranked second. Notably, similar results come from random samples of only 10 sequences from the alignment (running time 3 sec), suggesting that remarkably few sequences are required for good results. In contrast, folding single sequences with BayesFold produced structures 9.6 bp different, or with the Vienna package, 13.4 bp different, from the true structure . Similar results were obtained for other families of tRNAs. We especially recommend BayesFold for alignments of 3–50 closely related sequences, such as the sequence families frequently found in SELEX. In addition to providing a convenient way to explore the effects of each of the criteria on the plausibility of different structures, BayesFold also makes it easy to produce publicationquality secondary-structure graphics. The Web interface, available at http://bayes.colorado.edu/fold/, includes the flexibility to thread any of the sequences (or the consensus sequence) through any of the structures, including the one judged most probable.

...read moreread less

Book Chapter•10.1016/S0091-679X(04)77012-0•

The zebrafish genome project: sequence analysis and annotation.

[...]

Kerstin Jekosch¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jan 2004-Methods in Cell Biology

TL;DR: This chapter focuses on the sequence analysis and annotation of the zebrafish genome project, which provides the basis for extensive comparative genomics and hence the improvement of the annotation of already existing genomes from other model organisms.

...read moreread less

Abstract: Publisher Summary This chapter focuses on the sequence analysis and annotation of the zebrafish genome project. Rapid advances in zebrafish genetics have led to an increasing need for a genome sequence to facilitate interpretation of data. An annotated zebrafish genome sequence is immensely informative for both forward and reverse genetics. It also provides the basis for extensive comparative genomics and hence the improvement of the annotation of already existing genomes from other model organisms, and is also a valuable tool for phylogenetic and evolutionary research. The strategy to generate finished genomic sequence is based on the construction of a physical map of bacterial clone inserts and subsequent identification of a minimal overlapping set from which clones are selected for sequencing. The physical map for the zebrafish genome has been built from four clone libraries, using restriction digest fingerprinting and alignment to mapped markers as described for the human and mouse genomes. To provide information about biological function, each annotated gene is given a type and a name.

...read moreread less

Journal Article•10.1002/PROT.20299•

A generalized affine gap model significantly improves protein sequence alignment accuracy

[...]

Marcus A. Zachariah¹, Gavin E. Crooks¹, Stephen R. Holbrook², Steven E. Brenner², Steven E. Brenner¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

23 Nov 2004-Proteins

TL;DR: Evaluation of alignment quality showed that the generalized affine model aligns fewer residue pairs than the traditional affines but achieves significantly higher per‐residue accuracy, and it is concluded that generalized affines gap costs should be used when alignment accuracy carries more importance than aligned sequence length.

...read moreread less

Abstract: Sequence alignment underpins common tasks in molecular biology, including genome annotation, molecular phylogenetics, and homology modeling. Fundamental to sequence alignment is the placement of gaps, which represent character insertions or deletions. We assessed the ability of a generalized affine gap cost model to reliably detect remote protein homology and to produce high-quality alignments. Generalized affine gap alignment with optimal gap parameters performed as well as the traditional affine gap model in remote homology detection. Evaluation of alignment quality showed that the generalized affine model aligns fewer residue pairs than the traditional affine model but achieves significantly higher per-residue accuracy. We conclude that generalized affine gap costs should be used when alignment accuracy carries more importance than aligned sequence length.

...read moreread less

Journal Article•10.1007/S00114-004-0542-8•

Comparative genomics: methods and applications.

[...]

Bernhard Haubold, Thomas Wiehe¹•Institutions (1)

University of Cologne¹

25 Jun 2004-Naturwissenschaften

TL;DR: It is argued that the most fruitful method of understanding the functional content of genomes is to study them in the context of related genomic sequences, and such a study may reveal selection, a fundamental pointer to biological relevance.

...read moreread less

Abstract: Interpreting the functional content of a given genomic sequence is one of the central challenges of biology today. Perhaps the most promising approach to this problem is based on the comparative method of classic biology in the modern guise of sequence comparison. For instance, protein-coding regions tend to be conserved between species. Hence, a simple method for distinguishing a functional exon from the chance absence of stop codons is to investigate its homologue from closely related species. Predicting regulatory elements is even more difficult than exon prediction, but again, comparisons pinpointing conserved sequence motifs upstream of translation start sites are helping to unravel gene regulatory networks. In addition to interspecific studies, intraspecific sequence comparison yields insights into the evolutionary forces that have acted on a species in the past. Of particular interest here is the identification of selection events such as selective sweeps. Both intra- and interspecific sequence comparisons are based on a variety of computational methods, including alignment, phylogenetic reconstruction, and coalescent theory. This article surveys the biology and the central computational ideas applied in recent comparative genomics projects. We argue that the most fruitful method of understanding the functional content of genomes is to study them in the context of related genomic sequences. In particular, such a study may reveal selection, a fundamental pointer to biological relevance.

...read moreread less

Journal Article•10.2165/00822942-200403020-00008•

A Sequence Alignment-Independent Method for Protein Classification

[...]

John K. Vries¹, Rajan Munshi¹, Dror Tobi¹, Judith Klein-Seetharaman¹, Judith Klein-Seetharaman², Panayiotis V. Benos¹, Ivet Bahar¹ - Show less +3 more•Institutions (2)

University of Pittsburgh¹, Carnegie Mellon University²

01 Jan 2004-Applied Bioinformatics

TL;DR: An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed and showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.

...read moreread less

Abstract: homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (204) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.

...read moreread less

Journal Article•10.1101/GR.3152604•

EAnnot: a genome annotation tool using experimental evidence.

[...]

Li Ding¹, Aniko Sabo¹, Nicolas Berkowicz¹, Rekha Meyer¹, Yoram Shotland¹, Mark R. Johnson¹, Kymberlie H. Pepin¹, Richard K. Wilson¹, John Spieth¹ - Show less +5 more•Institutions (1)

Washington University in St. Louis¹

01 Dec 2004-Genome Research

TL;DR: Manual annotation of human chromosome 6 is compared with annotation performed by EAnnot in order to assess the latter's accuracy and can be used to rapidly obtain an automated gene set.

...read moreread less

Abstract: The sequence of any genome becomes most useful for biological experimentation when a complete and accurate gene set is available. Gene prediction programs offer an efficient way to generate an automated gene set. Manual annotation, when performed by experienced annotators, is more accurate and complete than automated annotation. However, it is a laborious and expensive process, and by its nature, introduces a degree of variability not found with automated annotation. EAnnot (Electronic Annotation) is a program originally developed for manually annotating the human genome. It combines the latest bioinformatics tools to extract and analyze a wide range of publicly available data in order to achieve fast and reliable automatic gene prediction and annotation. EAnnot builds gene models based on mRNA, EST, and protein alignments to genomic sequence, attaches supporting evidence to the corresponding genes, identifies pseudogenes, and locates poly(A) sites and signals. Here, we compare manual annotation of human chromosome 6 with annotation performed by EAnnot in order to assess the latter's accuracy. EAnnot can readily be applied to manual annotation of other eukaryotic genomes and can be used to rapidly obtain an automated gene set.

...read moreread less

Journal Article•10.1093/NAR/GKI076•

Polymorphix: a sequence polymorphism database

[...]

Eric Bazin¹, Laurent Duret, Simon Penel, Nicolas Galtier•Institutions (1)

University of Montpellier¹

17 Dec 2004-Nucleic Acids Research

TL;DR: Polymorphix is an ACNUC structured database allowing both simple and complex queries for population genomic studies, and contains within-species homologous sequence families built using EMBL/GenBank under suitable similarity and bibliographic criteria.

...read moreread less

Abstract: Within-species sequence variation data are of special interest since they contain information about recent population/species history, and the molecular evolutionary forces currently in action in natural populations. These data, however, are presently dispersed within generalist databases, and are difficult to access. To solve this problem, we have developed Polymorphix, a database dedicated to sequence polymorphism. It contains within-species homologous sequence families built using EMBL/GenBank under suitable similarity and bibliographic criteria. Polymorphix is an ACNUC structured database allowing both simple and complex queries for population genomic studies. Alignments within families as well as phylogenetic trees can be download. When available, outgroups are included in the alignment. Polymorphix contains sequences from the nuclear, mitochondrial and chloroplastic genomes of every eukaryote species represented in EMBL. It can be accessed by a web interface (http://pbil.univ-lyon1.fr/polymorphix/query.php).

...read moreread less

Journal Article•10.1093/NAR/GKI043•

S4: structure-based sequence alignments of SCOP superfamilies

[...]

James Casbon¹, Mansoor A. S. Saqi¹•Institutions (1)

University of London¹

17 Dec 2004-Nucleic Acids Research

TL;DR: The database is described and examples showing how the automatically generated S4 alignments compare favourably to hand-crafted alignments are given.

...read moreread less

Abstract: S4 is an automatically generated database of multiple structure-based sequence alignments of protein superfamilies in the SCOP database. All structural domains that do not share more than 40% sequence identity as defined by the ASTRAL compendium of protein structures are included. The alignments are constructed using pairwise structural alignments to generate residue equivalences that are then integrated into multiple alignments using sequence alignment tools. We describe the database and give examples showing how the automatically generated S4 alignments compare favourably to hand-crafted alignments. Available at: http://compbio.mds.qmw.ac.uk/S4.html.

...read moreread less

Journal Article•10.1016/S1570-8667(03)00078-9•

Parametric multiple sequence alignment and phylogeny construction

[...]

David Fernández-Baca¹, Timo Seppäläinen², Giora Slutzki¹•Institutions (2)

Iowa State University¹, University of Wisconsin-Madison²

01 Jun 2004-Journal of Discrete Algorithms

TL;DR: It is shown that many of the usual formulations of these problems fall within the same integer parametric framework, implying that the number of distinct optima obtained as the parameters are varied across their ranges is polynomially bounded in the length and number of sequences.

...read moreread less

Journal Article•10.1186/1471-2105-5-192•

ABC: software for interactive browsing of genomic multiple sequence alignment data

[...]

Gregory M. Cooper¹, Senthil A. G. Singaravelu¹, Arend Sidow¹•Institutions (1)

Stanford University¹

08 Dec 2004-BMC Bioinformatics

TL;DR: The ABC is a lightweight, stand-alone, and flexible graphical user interface for browsing genomic multiple sequence alignments of specific loci, up to hundreds of kilobases or a few megabases in length.

...read moreread less

Abstract: Background Alignment and comparison of related genome sequences is a powerful method to identify regions likely to contain functional elements. Such analyses are data intensive, requiring the inclusion of genomic multiple sequence alignments, sequence annotations, and scores describing regional attributes of columns in the alignment. Visualization and browsing of results can be difficult, and there are currently limited software options for performing this task.

...read moreread less

Journal Article•10.1093/NAR/GKH386•

AGenDA: gene prediction by cross-species sequence comparison

[...]

Leila Taher, Oliver Rinner, Saurabh Garg, Alexander Sczyrba, Burkhard Morgenstern - Show less +1 more

01 Jul 2004-Nucleic Acids Research

TL;DR: A WWW-based software program for homology-based gene prediction at BiBiServ (Bielefeld Bioinformatics Server), which takes pairs of evolutionary related genomic sequences as input data and searches for conserved splicing signals and start/stop codons near regions of local sequence conservation.

...read moreread less

Abstract: Automatic gene prediction is one of the major challenges in computational sequence analysis. Traditional approaches to gene finding rely on statistical models derived from previously known genes. By contrast, a new class of comparative methods relies on comparing genomic sequences from evolutionary related organisms to each other. These methods are based on the concept of phylogenetic footprinting: they exploit the fact that functionally important regions in genomic sequences are usually more conserved than non-functional regions. We created a WWW-based software program for homology-based gene prediction at BiBiServ (Bielefeld Bioinformatics Server). Our tool takes pairs of evolutionary related genomic sequences as input data, e.g. from human and mouse. The server runs CHAOS and DIALIGN to create an alignment of the input sequences and subsequently searches for conserved splicing signals and start/stop codons near regions of local sequence conservation. Genes are predicted based on local homology information and splice signals. The server returns predicted genes together with a graphical representation of the underlying alignment. The program is available at http://bibiserv.TechFak.Uni-Bielefeld.DE/agenda/.

...read moreread less

Journal Article•10.1186/1471-2105-5-167•

Improvement of alignment accuracy utilizing sequentially conserved motifs.

[...]

Saikat Chakrabarti¹, Nitin Bhardwaj², Prem A. Anand³, Ramanathan Sowdhamini¹•Institutions (3)

National Centre for Biological Sciences¹, Indian Institutes of Technology², International Institute of Information Technology³

28 Oct 2004-BMC Bioinformatics

TL;DR: An alignment algorithm that combines progressive dynamic algorithm, local substructure alignment and iterative refinement to achieve an improved, user-interactive tool that allows the user to fix conserved regions in equivalent position in the alignment thereby reducing the chance of global misalignment to a great extent.

...read moreread less

Abstract: Multiple sequence alignment algorithms are very important tools in molecular biology today. Accurate alignment of proteins is central to several areas such as homology modelling, docking studies, understanding evolutionary trends and study of structure-function relationships. In recent times, improvement of existing progressing programs and implementation of new iterative algorithms have made a significant change in this field. We report an alignment algorithm that combines progressive dynamic algorithm, local substructure alignment and iterative refinement to achieve an improved, user-interactive tool. Large-scale benchmarking studies show that this FMALIGN server produces alignments that, aside from preservation of functional and structural conservation, have accuracy comparable to other popular multiple alignment programs. The FMALIGN server allows the user to fix conserved regions in equivalent position in the alignment thereby reducing the chance of global misalignment to a great extent. FMALIGN is available at http://caps.ncbs.res.in/FMALIGN/Home.html

...read moreread less

Book Chapter•10.1007/978-3-540-30549-1_19•

Modelling-Alignment for non-random sequences

[...]

David R. Powell¹, Lloyd Allison¹, Trevor I. Dix¹•Institutions (1)

Monash University¹

4 Dec 2004

TL;DR: A new and general method, modelling-alignment, is described, which incorporates population models into the alignment process, which can lead to changes in the rank-order of matches between a query sequence and a collection of sequences, compared to results from standard algorithms.

...read moreread less

Abstract: Populations of biased, non-random sequences may cause standard alignment algorithms to yield false-positive matches and false-negative misses A standard significance test based on the shuffling of sequences is a partial solution, applicable to populations that can be described by simple models Masking-out low information content intervals throws information away We describe a new and general method, modelling-alignment: Population models are incorporated into the alignment process, which can (and should) lead to changes in the rank-order of matches between a query sequence and a collection of sequences, compared to results from standard algorithms The new method is general and places very few conditions on the nature of the models that can be used with it We apply modelling-alignment to local alignment, global alignment, optimal alignment, and the relatedness problem.

...read moreread less

Journal Article•10.1093/NAR/GKI109•

PartiGeneDB--collating partial genomes.

[...]

José M. Peregrín-Alvarez¹, Andrew Yam, Gaya Sivakumar, John Parkinson•Institutions (1)

University of Toronto¹

04 Mar 2004-Nucleic Acids Research

TL;DR: PartiGeneDB facilitates regular incremental updates of new sequence datasets associated with both new and exisitng species, and contains the assembled partial genomes derived from 1.83 million sequences associated with 247 different eukaryotes.

...read moreread less

Abstract: Owing to the high costs involved, only 28 eukaryotic genomes have been fully sequenced to date. On the other hand, an increasing number of projects have been initiated to generate survey sequence data for a large number of other eukaryotic organisms. For the most part, these data are poorly organized and difficult to analyse. Here, we present PartiGeneDB (http://www.partigenedb.org), a publicly available database resource, which collates and processes these sequence datasets on a species-specific basis to form non-redundant sets of gene objects-which we term partial genomes. Users may query the database to identify particular genes of interest either on the basis of sequence similarity or via the use of simple text searches for specific patterns of BLAST annotation. Alternatively, users can examine entire partial genome datasets on the basis of relative expression of gene objects or by the use of an interactive Java-based tool (SimiTri), which displays sequence similarity relationships for a large number of sequence objects in a single graphic. PartiGeneDB facilitates regular incremental updates of new sequence datasets associated with both new and exisitng species. PartiGeneDB currently contains the assembled partial genomes derived from 1.83 million sequences associated with 247 different eukaryotes.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/BTH258•

Algorithms for sequence analysis via mutagenesis

[...]

Jonathan M. Keith, Peter Adams, Darryn Bryant, Duncan A. E. Cochran, Gita H. Lala, Keith Mitchelson - Show less +2 more

12 Oct 2004-Bioinformatics

TL;DR: A number of algorithms for analysing and interpreting data generated by sequence analysis via mutagenesis, a technique that renders a wide range of problematic DNAs amenable to sequencing.

...read moreread less

Abstract: Motivation: Despite many successes of conventional DNA sequencing methods, some DNAs remain difficult or impossible to sequence. Unsequenceable regions occur in the genomes of many biologically important organisms, including the human genome. Such regions range in length from tens to millions of bases, and may contain valuable information such as the sequences of important genes. The authors have recently developed a technique that renders a wide range of problematic DNAs amenable to sequencing. The technique is known as sequence analysis via mutagenesis (SAM). This paper presents a number of algorithms for analysing and interpreting data generated by this technique. Results: The essential idea of SAM is to infer the target sequence using the sequences of mutants derived from the target. We describe three algorithms used in this process. The first algorithm predicts the number of mutants that will be required to infer the target sequence with a desired level of accuracy. The second algorithm infers the target sequence itself, using the mutant sequences. The third algorithm assigns quality values to each inferred base. The algorithms are illustrated using mutant sequences generated in the laboratory. Availability: Software will be made available upon request.

...read moreread less

Journal Article•10.1002/PROT.20359•

NdPASA: a novel pairwise protein sequence alignment algorithm that incorporates neighbor-dependent amino acid propensities.

[...]

Junwen Wang¹, Jin An Feng¹•Institutions (1)

Temple University¹

22 Dec 2004-Proteins

TL;DR: In this article, neighbor-dependent propensity of amino acids is used as a unique parameter for pairwise sequence alignment, which can be used to measure the preference of an amino acid pair adopting a particular secondary structure conformation.

...read moreread less

Abstract: Sequence alignment has become one of the essential bioinformatics tools in biomedical research. Existing sequence alignment methods can produce reliable alignments for homologous proteins sharing a high percentage of sequence identity. The performance of these methods deteriorates sharply for the sequence pairs sharing less than 25% sequence identity. We report here a new method, NdPASA, for pairwise sequence alignment. This method employs neighbor-dependent propensities of amino acids as a unique parameter for alignment. The values of neighbor-dependent propensity measure the preference of an amino acid pair adopting a particular secondary structure conformation. NdPASA optimizes alignment by evaluating the likelihood of a residue pair in the query sequence matching against a corresponding residue pair adopting a particular secondary structure in the template sequence. Using superpositions of homologous proteins derived from the PSI-BLAST analysis and the Structural Classification of Proteins (SCOP) classification of a nonredundant Protein Data Bank (PDB) database as a gold standard, we show that NdPASA has improved pairwise alignment. Statistical analyses of the performance of NdPASA indicate that the introduction of sequence patterns of secondary structure derived from neighbor-dependent sequence analysis clearly improves alignment performance for sequence pairs sharing less than 20% sequence identity. For sequence pairs sharing 13-21% sequence identity, NdPASA improves the accuracy of alignment over the conventional global alignment (GA) algorithm using the BLOSUM62 by an average of 8.6%. NdPASA is most effective for aligning query sequences with template sequences whose structure is known. NdPASA can be accessed online at http://astro.temple.edu/feng/Servers/BioinformaticServers.htm.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/BTH380•

The UniMarker (UM) method for synteny mapping of large genomes

[...]

Ben-Yang Liao, Yu-Jung Chang¹, Jan-Ming Ho¹, Ming-Jing Hwang•Institutions (1)

Academia Sinica¹

22 Nov 2004-Bioinformatics

TL;DR: A novel method is developed that does not require sequence alignment for synteny mapping of two large genomes, such as the human and mouse, and is very fast and able to map the two mammalian genomes in one day of computing time on a single Pentium IV personal computer.

...read moreread less

Abstract: Motivation: Synteny mapping, or detecting regions that are orthologous between two genomes, is a key step in studies of comparative genomics. For completely sequenced genomes, this is increasingly accomplished by whole-genome sequence alignment. However, such methods are computationally expensive, especially for large genomes, and require rather complicated post-processing procedures to filter out non-orthologous sequence matches. Results: We have developed a novel method that does not require sequence alignment for synteny mapping of two large genomes, such as the human and mouse. In this method, the occurrence spectra of genome-wide unique 16mer sequences present in both the human and mouse genome are used to directly detect orthologous genomic segments. Being sequence alignment-free, the method is very fast and able to map the two mammalian genomes in one day of computing time on a single Pentium IV personal computer. The resulting human--mouse synteny map was shown to be in excellent agreement with those produced by the Mouse Genome Sequencing Consortium (MGSC) and by the Ensembl team; furthermore, the syntenic relationship of segments found only by our method was supported by BLASTZ sequence alignment. Availability: The source code of our method and the resulting human--mouse synteny maps have been placed at http://synteny.ibms.sinica.edu.tw/ for free access. Supplementary information: Seven supplementary figures can be found at the same website.

...read moreread less