Knowledge-guided data mining on the standardized architecture of NRPS: subtypes, novel motifs, and sequence entanglements

doi:10.1101/2022.03.14.484258

Open AccessPosted Content10.1101/2022.03.14.484258

Knowledge-guided data mining on the standardized architecture of NRPS: subtypes, novel motifs, and sequence entanglements

- 17 Mar 2022

2

TL;DR: In this article , the motif-and-intermotif standardization of NRPS sequences was proposed to provide a consistent standard for annotating NRPS domains and modules, which has made data-driven discoveries challenging.

Abstract: Abstract Non-ribosomal peptide synthetase (NRPS) is a diverse family of biosynthetic enzymes for the assembly of bioactive peptides. Despite advances in microbial sequencing, the lack of a consistent standard for annotating NRPS domains and modules has made data-driven discoveries challenging. To address this, we introduced a standardized architecture for NRPS, by using known conserved motifs to partition typical domains. This motif-and-intermotif standardization allowed for systematic evaluations of sequence properties from a large number of NRPS pathways, resulting in the most comprehensive cross-kingdom C domain subtype classifications to date, as well as the discovery and experimental validation of novel conserved motifs with functional significance. Furthermore, our coevolution analysis revealed important barriers associated with reengineering NRPSs and uncovered the entanglement between phylogeny and substrate specificity in NRPS sequences. Our findings provide a comprehensive and statistically insightful analysis of NRPS sequences, opening avenues for future data-driven discoveries. Author Summary NRPS, a gigantic enzyme that produces diverse microbial secondary metabolites, provides a rich source for important medical products including antibiotics. Despite the extensive knowledge gained about its structure and the large amount of sequencing data available, the frequent failure of reengineering NRPS in synthetic biology highlights the fact that much is still unknown. In this work, we applied existing knowledge to data mining of NRPS sequences, using well-known conserved motifs to partition NRPS sequences into motif-intermotif architectures. This standardization allows for integrating large amounts of sequences from different sources, providing a comprehensive overview of NRPSs across different kingdoms. Our findings included new C domain subtypes, novel conserved motifs with implication in structural flexibility, and insights into why NRPSs are so difficult to reengineer. To facilitate researchers in related fields, we constructed an online platform “NRPS Motif Finder” for parsing the motif-and-intermotif architecture and C domain subtype classification ( http://www.bdainformatics.org/page?type=NRPSMotifFinder ). We believe that this knowledge-guided approach not only advances our understanding of NRPSs but also provides a useful methodology for data mining in large-scale biological sequences.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content•10.1101/2022.06.05.494846

Biosynthetic diversification of peptaibol mediates fungus-mycohost interactions

Jie Fan, +9 more

- 05 Jun 2022

- bioRxiv

TL;DR: This study elucidate fungus-mycohost specific interactions mediated by a family of polypeptides, i.e., peptaibols, and provides insights into the role of metabolic diversity of biosynthetic pathways in interfungal interactions.

...read moreread less

4

Journal Article•10.20944/preprints202407.0834.v1

Forging the Iron-Net: Towards a Quantitative Understanding of Microbial Communities via Siderophore-Mediated Interactions

Shaohua Gu, +14 more

- 11 Jul 2024

TL;DR: Researchers propose constructing an "iron-net" to understand microbial communities' iron interactions, leveraging siderophores and machine learning to manipulate microbiota, with potential applications in medicine, agriculture, and ecology.

...read moreread less

1

References

•Journal Article•10.1038/MSB.2011.75

Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega

Fabian Sievers, +11 more

- 01 Jan 2011

- Molecular Systems Biology

TL;DR: A new program called Clustal Omega is described, which can align virtually any number of protein sequences quickly and that delivers accurate alignments, and which outperforms other packages in terms of execution time and quality.

...read moreread less

15.5K

•Journal Article•10.1038/NPROT.2015.053

The Phyre2 web portal for protein modeling, prediction and analysis

Lawrence A. Kelley, +6 more

- 07 May 2015

- Nature Protocols

TL;DR: An updated protocol for Phyre2, which uses advanced remote homology detection methods to build 3D models, predict ligand binding sites and analyze the effect of amino acid variants for a user's protein sequence.

...read moreread less

9.8K

•Journal Article•10.1093/NAR/GKAA1100

UniProt: the universal protein knowledgebase in 2021

Alex Bateman, +132 more

- 08 Jan 2021

- Nucleic Acids Research

TL;DR: The UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal and a credit-based publication submission interface was developed.

...read moreread less

6.4K

•Book

Accelerated Profile HMM Searches

Sean R. Eddy

- 01 May 2015

TL;DR: An acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm, which computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment.

...read moreread less

6.3K

•Journal Article•10.1093/NAR/GKAA913

Pfam: The protein families database in 2021.

Jaina Mistry, +11 more

- 08 Jan 2021

- Nucleic Acids Research

TL;DR: The Pfam database is a widely used resource for classifying protein sequences into families and domains and the reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family.

...read moreread less

5.6K