Representative sequences

Topic Tools

Papers

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.

[...]

Mihaly Varadi¹, Stephen Anyango¹, Mandar Deshpande¹, Sreenath Nair¹, Cindy Natassia¹, Galabina Yordanova¹, David Yu Yuan¹, Oana Stroe¹, Gemma Wood¹, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John M. Jumper, Ellen Clancy, Richard E. Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard J. Kleywegt¹, Ewan Birney¹, Demis Hassabis, Sameer Velankar¹ - Show less +23 more•Institutions (1)

European Bioinformatics Institute¹

17 Nov 2021-Nucleic Acids Research

TL;DR: The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions.

...read moreread less

Abstract: The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions. Powered by AlphaFold v2.0 of DeepMind, it has enabled an unprecedented expansion of the structural coverage of the known protein-sequence space. AlphaFold DB provides programmatic access to and interactive visualization of predicted atomic coordinates, per-residue and pairwise model-confidence estimates and predicted aligned errors. The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded to cover most of the (over 100 million) representative sequences from the UniRef90 data set.

...read moreread less

5,715 citations

Journal Article•10.1093/BIOINFORMATICS/17.3.282•

Clustering of highly homologous sequences to reduce the size of large protein databases

[...]

Weizhong Li¹, Lukasz Jaroszewski, Adam Godzik•Institutions (1)

San Diego Supercomputer Center¹

01 Mar 2001-Bioinformatics

TL;DR: A fast and flexible program for clustering large protein databases at different sequence identity levels takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC.

...read moreread less

Abstract: Summary: We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560 000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches. Availability: The program is available from http: //bioinformatics.burnham-inst.org/cd-hi

...read moreread less

1,057 citations

Journal Article•10.1016/J.JMB.2003.08.057•

How well is enzyme function conserved as a function of pairwise sequence identity

[...]

Weidong Tian¹, Jeffrey Skolnick¹•Institutions (1)

State University of New York System¹

31 Oct 2003-Journal of Molecular Biology

TL;DR: This work classifies enzyme families based not only on sequence similarity, but also on functional similarity, and shows that by employing an enzyme family-specific sequence identity threshold above which 100% functional conservation is required, functional inference of unknown sequences can be accurately accomplished.

...read moreread less

444 citations

Journal Article•10.1128/MSPHEREDIRECT.00069-18•

A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection.

[...]

Norman Goodacre¹, Aisha A. AlJanahi¹, Subhiksha Nandakumar¹, Mike Mikailov², Arifa S. Khan¹ - Show less +1 more•Institutions (2)

Center for Biologics Evaluation and Research¹, Center for Devices and Radiological Health²

25 Apr 2018

TL;DR: A new reference viral database (RVDB) is developed that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size.

...read moreread less

Abstract: Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.

...read moreread less

227 citations

Journal Article•10.1093/DATABASE/BAZ155•

PLANiTS: a curated sequence reference dataset for plant ITS DNA metabarcoding

[...]

Elisa Banchi¹, Claudio G. Ametrano¹, Samuele Greco¹, David Stanković¹, Lucia Muggia¹, Alberto Pallavicini², Alberto Pallavicini³, Alberto Pallavicini¹ - Show less +4 more•Institutions (3)

University of Trieste¹, National Institute of Oceanography, India², Stazione Zoologica Anton Dohrn³

01 Jan 2020-Database

TL;DR: In this paper, the authors developed a script called "better clustering for QIIME" (bc4q) to ensure that the representative sequences are chosen according to the composition of the cluster at a different taxonomic level.

...read moreread less

Abstract: DNA metabarcoding combines DNA barcoding with high-throughput sequencing to identify different taxa within environmental communities. The ITS has already been proposed and widely used as universal barcode marker for plants, but a comprehensive, updated and accurate reference dataset of plant ITS sequences has not been available so far. Here, we constructed reference datasets of Viridiplantae ITS1, ITS2 and entire ITS sequences including both Chlorophyta and Streptophyta. The sequences were retrieved from NCBI, and the ITS region was extracted. The sequences underwent identity check to remove misidentified records and were clustered at 99% identity to reduce redundancy and computational effort. For this step, we developed a script called ‘better clustering for QIIME’ (bc4q) to ensure that the representative sequences are chosen according to the composition of the cluster at a different taxonomic level. The three datasets obtained with the bc4q script are PLANiTS1 (100 224 sequences), PLANiTS2 (96 771 sequences) and PLANiTS (97 550 sequences), and all are pre-formatted for QIIME, being this the most used bioinformatic pipeline for metabarcoding analysis. Being curated and updated reference databases, PLANiTS1, PLANiTS2 and PLANiTS are proposed as a reliable, pivotal first step for a general standardization of plant DNA metabarcoding studies. The bc4q script is presented as a new tool useful in each research dealing with sequences clustering. Database URL: https://github.com/apallavicini/bc4q; https://github.com/apallavicini/PLANiTS.

...read moreread less

98 citations

...

Expand

Topic Tools

Papers

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.

Clustering of highly homologous sequences to reduce the size of large protein databases

How well is enzyme function conserved as a function of pairwise sequence identity

A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection.

PLANiTS: a curated sequence reference dataset for plant ITS DNA metabarcoding

Related Topics (5)

Performance Metrics

No. of papers in the topic in previous years
Year	Papers
2021	1
2020	3
2019	2
2018	4
2017	3
2016	2