BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments
TL;DR: Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences.
read more
Abstract: The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step. Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity. BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/
.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Physiological and evolutionary implications of tetrameric photosystem I in cyanobacteria.
Meng Li,Meng Li,Alexandra Calteau,Dmitry A. Semchonok,Thomas A. Witt,Jonathan Nguyen,Nathalie Sassoon,Egbert J. Boekema,Julian P. Whitelegge,Muriel Gugger,Barry D. Bruce +10 more
TL;DR: It is suggested that tetrameric PSI is an adaptation to high light intensity, and that change in PsaL leads to monomerization of trimeric PSI, supporting the hypothesis of tetramerIC PSI being the evolutionary intermediate in the transition from cyanobacterial trimericPSI to monomers in plants and algae.
Metagenomic insights into the effect of thermal hydrolysis pre-treatment on microbial community of an anaerobic digestion system
TL;DR: In this article, the effect of THP pre-treatment on microbial diversity, interspecies interactions, and metabolism in anaerobic digestion (AD) systems remain largely unknown, however, the authors have shown that THP sludge significantly reduced the microbial diversity and shaped the microbial community structure, and resulted in more intense microbial interactions.
46
The Oxymonad Genome Displays Canonical Eukaryotic Complexity in the Absence of a Mitochondrion
Anna Karnkowska,Anna Karnkowska,Sebastian C. Treitli,Ondřej Brzoň,Lukáš Novák,Vojtěch Vacek,Petr Soukal,Lael D. Barlow,Emily K. Herman,Shweta V. Pipaliya,Tomáš Pánek,David Žihala,Romana Petrželková,Anzhelika Butenko,Laura Eme,Laura Eme,Courtney W. Stairs,Courtney W. Stairs,Andrew J. Roger,Marek Eliáš,Joel B. Dacks,Vladimír Hampl +21 more
TL;DR: An extensive analysis of the M. exilis genome finds that the genome structure and content is similar in complexity to other eukaryotes and less 'reduced' than genomes of some other protists from the Metamonada group to which it belongs.
Ancient origins of arthropod moulting pathway components.
TL;DR: Its key elements evolved much earlier than previously thought and are present in non-moulting lophotrochozoans and deuterostomes and in the non-bilaterian ctenophore Mnemiopsis leidyi.
46
A Pan-Genomic Approach to Understand the Basis of Host Adaptation in Achromobacter.
Julie Jeukens,Luca Freschi,Antony T. Vincent,Jean-Guillaume Emond-Rheault,Irena Kukavica-Ibrulj,Steve J. Charette,Roger C. Levesque +6 more
TL;DR: The goals of this first genus-wide comparative genomics study were to clarify the taxonomy of this genus and identify genomic features associated with pathogenicity and host adaptation, and to contribute to the understanding of opportunistic pathogen evolution.
45
References
A mathematical theory of communication
TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
74.4K
MUSCLE: multiple sequence alignment with high accuracy and high throughput
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
45.1K
Confidence limits on phylogenies: an approach using the bootstrap.
TL;DR: The recently‐developed statistical method known as the “bootstrap” can be used to place confidence intervals on phylogenies and shows significant evidence for a group if it is defined by three or more characters.
43.1K
•Journal Article
The mathematical theory of communication
Claude E. Shannon,Warren Weaver +1 more
TL;DR: The Mathematical Theory of Communication (MTOC) as discussed by the authors was originally published as a paper on communication theory more than fifty years ago and has since gone through four hardcover and sixteen paperback printings.
36.2K
The meaning and use of the area under a receiver operating characteristic (ROC) curve.
TL;DR: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented and it is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a random chosen non-diseased subject.
21.8K