Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences.

doi:10.7717/PEERJ.545

Open AccessJournal Article10.7717/PEERJ.545

Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences.

Jai Ram Rideout, +19 more

- 21 Aug 2014

- PeerJ

- Vol. 2, Iss: 1

552

TL;DR: A performance-optimized algorithm for assigning marker gene sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis is presented and it is shown that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open- reference OTUpicking through comparisons on three well-studied datasets.

Abstract: We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to "classic" open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, "classic" open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of "classic" open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1038/NATURE24644

Programmable base editing of A•T to G•C in genomic DNA without DNA cleavage

Nicole M. Gaudelli, +12 more

- 23 Nov 2017

- Nature

TL;DR: Adenine base editors (ABEs) that mediate the conversion of A•T to G•C in genomic DNA are described and a transfer RNA adenosine deaminase is evolved to operate on DNA when fused to a catalytically impaired CRISPR–Cas9 mutant.

...read moreread less

4.6K

•Journal Article•10.1038/ISMEJ.2017.119

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.

Benjamin J. Callahan, +2 more

- 21 Jul 2017

- The ISME Journal

TL;DR: It is argued that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.

...read moreread less

2.8K

•Journal Article•10.1038/NATURE24621

A communal catalogue reveals Earth’s multiscale microbial diversity

Luke R. Thompson, +48 more

- 01 Nov 2017

- Nature

TL;DR: A meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project is presented, creating both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.

...read moreread less

2.2K

•Journal Article•10.1128/MSYSTEMS.00191-16

Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

Amnon Amir, +10 more

- 21 Apr 2017

TL;DR: A novel sub-operational-taxonomic-unit (sOTU) approach that uses error profiles to obtain putative error-free sequences from Illumina MiSeq and HiSeq sequencing platforms, Deblur, which substantially reduces computational demands relative to similar sOTU methods and does so with similar or better sensitivity and specificity.

...read moreread less

1.6K

...

Expand

References

•Journal Article•10.1038/NMETH.F.303

QIIME allows analysis of high-throughput community sequencing data.

J. Gregory Caporaso, +27 more

- 11 Apr 2010

- Nature Methods

TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.

...read moreread less

34.1K

•Journal Article•10.1093/BIOINFORMATICS/BTQ461

Search and clustering orders of magnitude faster than BLAST

Robert C. Edgar

- 01 Oct 2010

- Bioinformatics

TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.

...read moreread less

20.2K

•Journal Article

The Detection of Disease Clustering and a Generalized Regression Approach

Nathan Mantel

- 01 Feb 1967

- Cancer Research

TL;DR: The technic to be given below for imparting statistical validity to the procedures already in vogue can be viewed as a generalized form of regression with possible useful application to problems arising in quite different contexts.

...read moreread less

12.4K

•Journal Article•10.1128/AEM.03006-05

Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB

Todd Z. DeSantis, +9 more

- 01 Jul 2006

- Applied and Environmental Microbiology

TL;DR: A 16S rRNA gene database (http://greengenes.lbl.gov) was used to provide chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies as mentioned in this paper.

...read moreread less

10.7K

•Journal Article•10.1093/BIOINFORMATICS/BTL158

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Weizhong Li, +1 more

- 01 Jul 2006

- Bioinformatics

TL;DR: Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.

...read moreread less

10.7K