Multiple alignment-free sequence comparison
TL;DR: Although for real data, all of the statistics show a similar performance, on simulated data the Shepp- type statistics are in some instances outperformed by star-type statistics.
read more
Abstract: Motivation: Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics con- tains, first, Cand C S , extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C � , C S and C geo , averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. Results: Our investigation uses both simulated data as well as cis- regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free stat- istics are more sensitive to contamination in the data than the pairwise average statistics. Availability: Our implementation of the five statistics is available as R package named 'multiAlignFree' at be http://www-rcf.usc.edu/ � fsun/Programs/multiAlignFree/multiAlignFreemain.html. Contact: reinert@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Alignment-free sequence comparison: benefits, applications, and tools
TL;DR: This work provides a guide to the currently available alignment-free sequence analysis tools and addresses questions about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research.
539
Alignment-Free Sequence Analysis and Applications
Jie Ren,Xin Bai,Xin Bai,Yang Young Lu,Kujin Tang,Ying Wang,Gesine Reinert,Fengzhu Sun,Fengzhu Sun +8 more
TL;DR: A review of word-count based approaches for alignment-free sequence analysis can be found in this article, where the authors provide an updated review of these applications and other related developments of word count-based approaches.
Molecular homology and multiple-sequence alignment: an analysis of concepts and practice
TL;DR: This work presents examples of molecular-data levels at which homology might be considered, and proposes terminology with which to better describe and discuss molecular homology at these levels, and sheds light on the multitude of automated procedures that have been created for multiple-sequence alignment.
Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data
Saulo Alves Aflitos,Edouard Severing,Gabino F. Sanchez-Perez,Sander Peters,Hans de Jong,Dick de Ridder +5 more
TL;DR: Cnidaria as discussed by the authors is a tool for clustering genomic and transcriptomic data with no limitation on genome size or phylogenetic distances, achieving 100% identification accuracy at supra-species level and 78% accuracy for species level.
On the comparison of regulatory sequences with multiple resolution Entropic Profiles.
Matteo Comin,Morris Antonello +1 more
TL;DR: An alignment-free statistic is proposed, called EP2∗$EP^{*}_{2}$, that is based on multiple resolution patterns derived from the Entropic Profiles (EPs), that is highly successful in discriminating functionally related enhancers and, in almost all experiments, outperforms fixed-resolution methods.
References
A general method applicable to the search for similarities in the amino acid sequence of two proteins
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
13.2K
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
11.3K
ROCR: visualizing classifier performance in R
TL;DR: UNLABELLED ROCR is a package for evaluating and visualizing the performance of scoring classifiers in the statistical language R that features over 25 performance measures that can be freely combined to create two-dimensional performance curves.
3.3K
•Book
The Regulatory Genome: Gene Regulatory Networks In Development And Evolution
Eric H. Davidson
- 30 May 2006
TL;DR: The "Regulatory Genome" for Animal Development and Gene Regulatory Networks: The Roots of Causality and Diversity in Animal Evolution are presented.
1.1K
•Journal Article
ChIP-seq Identification of Weakly Conserved Heart Enhancers
TL;DR: This paper used ChIP-seq with the enhancer-associated protein p300 from mouse embryonic day 11.5 heart tissue to identify over three thousand candidate heart enhancers genome-wide.
407