BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments
TL;DR: Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences.
read more
Abstract: The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step. Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity. BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/
.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Globally distributed marine Gemmatimonadota have unique genomic potentials
Xianzhe Gong,Le Xu,Marguerite V. Langwig,Zhiyi Chen,Shujie Huang,Duo Zhao,Lei Su,Yan Zhang,Christopher Francis,Jihua Liu,Jiangtao Li,Brett J. Baker +11 more
TL;DR: Gemmatimonadota bacteria, globally distributed in marine environments, exhibit unique genomic potentials, but their metabolic capabilities and ecological roles remain poorly understood, warranting further research to elucidate their functions and significance.
5
The complete mitochondrial genome of Costapex baldwinae (Gastropoda: Neogastropoda: Turbinelloidea: Costellariidae) from the Caribbean Deep-Sea.
TL;DR: The complete mitochondrial genome sequence of Costapex baldwinae, a Caribbean representative of a predominantly Indo-Pacific genus of gastropods that occurs on sunken wood at bathyal depts, was reported in this article.
Genome Analysis of a Novel Clade b Betabaculovirus Isolated from the Legume Pest Matsumuraeses phaseoli (Lepidoptera: Tortricidae).
Ruihao Shu,Qian Meng,Lin Miao,Hongbin Liang,Jun Chen,Yuan Xu,Luqiang Cheng,Luqiang Cheng,Wenyi Jin,Wenyi Jin,Qilian Qin,Huan Zhang +11 more
TL;DR: The complete genome of MaphGV, a novel granulovirus isolated from pathogenic M. phaseoli larvae that dwell in rolled leaves of Astragalus membranaceus, is reported and may be a potential biocontrol agent against the bean ravaging pest.
5
The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics
Luc Cornet,Benoit Durieu,F. Baert,Elizabet D’hooge,David Colignon,Loïc Meunier,Valeria Lupo,Ilse Cleenwerck,Heide-Marie Daniel,Leen Rigouts,Damien Sirjacobs,Stéphane Declerck,Peter Vandamme,Annick Wilmotte,Denis Baurain,Pierre Becker +15 more
TL;DR: The GEN-ERA toolbox can be used to infer completely reproducible comparative genomic and metabolic analyses on prokaryotes and small eukaryotes, and can be useful for other applications, as shown by a case study on Gloeobacterales.
5
A new method for rapid genome classification, clustering, visualization, and novel taxa discovery from metagenome
Zhong Wang,Harrison Ho,Rob Egan,Shijie Yao,Dongwan D. Kang,Jeff Froula,Volkan Sevim,Frederik Schulz,Jackie E. Shay,Derek N. Macklin,Kayla McCue,Rachel Orsini,Daniel Barich,Christopher J. Sedlacek,Wei Li,Rachael M. Morgan-Kiss,Tanja Woyke,Joan L. Slonczewski +17 more
TL;DR: An efficient software suite that estimates similarities between genomes based on their k-mer matches, and subsequently uses these similarities for classification, clustering, and visualization, and demonstrates that Genome Constellation can tackle the computational and algorithmic challenges in large-scale taxonomy analyses in metagenomics.
References
A mathematical theory of communication
TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
74.4K
MUSCLE: multiple sequence alignment with high accuracy and high throughput
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
45.1K
Confidence limits on phylogenies: an approach using the bootstrap.
TL;DR: The recently‐developed statistical method known as the “bootstrap” can be used to place confidence intervals on phylogenies and shows significant evidence for a group if it is defined by three or more characters.
43.1K
•Journal Article
The mathematical theory of communication
Claude E. Shannon,Warren Weaver +1 more
TL;DR: The Mathematical Theory of Communication (MTOC) as discussed by the authors was originally published as a paper on communication theory more than fifty years ago and has since gone through four hardcover and sixteen paperback printings.
36.2K
The meaning and use of the area under a receiver operating characteristic (ROC) curve.
TL;DR: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented and it is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a random chosen non-diseased subject.
21.8K