Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences.
Jai Ram Rideout,Jai Ram Rideout,Yan He,Jose A. Navas-Molina,William A. Walters,Luke K. Ursell,Sean M. Gibbons,Sean M. Gibbons,John Chase,Daniel McDonald,Antonio Gonzalez,Adam Robbins-Pianka,Jose C. Clemente,Jack A. Gilbert,Jack A. Gilbert,Susan M. Huse,Hongwei Zhou,Rob Knight,Rob Knight,J. Gregory Caporaso +19 more
TL;DR: A performance-optimized algorithm for assigning marker gene sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis is presented and it is shown that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open- reference OTUpicking through comparisons on three well-studied datasets.
read more
Abstract: We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to "classic" open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, "classic" open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of "classic" open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Programmable base editing of A•T to G•C in genomic DNA without DNA cleavage
Nicole M. Gaudelli,Alexis C. Komor,Holly A. Rees,Holly A. Rees,Holly A. Rees,Michael S. Packer,Ahmed H. Badran,Ahmed H. Badran,Ahmed H. Badran,David I. Bryson,David R. Liu,David R. Liu,David R. Liu +12 more
TL;DR: Adenine base editors (ABEs) that mediate the conversion of A•T to G•C in genomic DNA are described and a transfer RNA adenosine deaminase is evolved to operate on DNA when fused to a catalytically impaired CRISPR–Cas9 mutant.
Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.
TL;DR: It is argued that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.
A communal catalogue reveals Earth’s multiscale microbial diversity
Luke R. Thompson,Luke R. Thompson,Luke R. Thompson,Jon G. Sanders,Daniel McDonald,Amnon Amir,Joshua Ladau,Kenneth J. Locey,Robert J. Prill,Anupriya Tripathi,Sean M. Gibbons,Sean M. Gibbons,Gail Ackermann,Jose A. Navas-Molina,Stefan Janssen,Evguenia Kopylova,Yoshiki Vázquez-Baeza,Antonio Gonzalez,James T. Morton,Siavash Mirarab,Zhenjiang Zech Xu,Lingjing Jiang,Mohamed F. Haroon,Jad N. Kanbar,Qiyun Zhu,Se Jin Song,Tomasz Kosciolek,Nicholas A. Bokulich,Joshua P Lefler,Colin J. Brislawn,Gregory Humphrey,Sarah M. Owens,Jarrad T. Hampton-Marcell,Jarrad T. Hampton-Marcell,Donna Berg-Lyons,Valerie J. McKenzie,Noah Fierer,Noah Fierer,Jed A. Fuhrman,Aaron Clauset,Rick Stevens,Rick Stevens,Ashley Shade,Katherine S. Pollard,Kelly D. Goodwin,Janet K. Jansson,Jack A. Gilbert,Jack A. Gilbert,Rob Knight +48 more
TL;DR: A meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project is presented, creating both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.
Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns
Amnon Amir,Daniel McDonald,Jose A. Navas-Molina,Evguenia Kopylova,James T. Morton,Zhenjiang Zech Xu,Eric P. Kightley,Luke R. Thompson,Embriette R. Hyde,Antonio Gonzalez,Rob Knight +10 more
- 21 Apr 2017
TL;DR: A novel sub-operational-taxonomic-unit (sOTU) approach that uses error profiles to obtain putative error-free sequences from Illumina MiSeq and HiSeq sequencing platforms, Deblur, which substantially reduces computational demands relative to similar sOTU methods and does so with similar or better sensitivity and specificity.
Partial restoration of the microbiota of cesarean-born infants via vaginal microbial transfer
Maria Gloria Dominguez-Bello,Maria Gloria Dominguez-Bello,Kassandra M. De Jesús-Laboy,Nan Shen,Laura M. Cox,Amnon Amir,Antonio Gonzalez,Nicholas A. Bokulich,Se Jin Song,Se Jin Song,Marina Hoashi,Juana I Rivera-Viñas,Keimari Mendez,Rob Knight,Jose C. Clemente +14 more
TL;DR: It is demonstrated that vaginal microbes can be partially restored at birth in C-section–delivered babies.
References
QIIME allows analysis of high-throughput community sequencing data.
J. Gregory Caporaso,Justin Kuczynski,Jesse Stombaugh,Kyle Bittinger,Frederic D. Bushman,Elizabeth K. Costello,Noah Fierer,Antonio Gonzalez Peña,Julia K. Goodrich,Jeffrey I. Gordon,Gavin A. Huttley,Scott T. Kelley,Dan Knights,Jeremy E. Koenig,Ruth E. Ley,Catherine A. Lozupone,Daniel McDonald,Brian D. Muegge,Meg Pirrung,Jens Reeder,Joel Sevinsky,Peter J. Turnbaugh,William A. Walters,Jeremy Widmann,Tanya Yatsunenko,Jesse R. Zaneveld,Rob Knight,Rob Knight +27 more
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Search and clustering orders of magnitude faster than BLAST
TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.
•Journal Article
The Detection of Disease Clustering and a Generalized Regression Approach
TL;DR: The technic to be given below for imparting statistical validity to the procedures already in vogue can be viewed as a generalized form of regression with possible useful application to problems arising in quite different contexts.
12.4K
Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB
Todd Z. DeSantis,Philip Hugenholtz,Neils Larsen,Mark Rojas,Eoin L. Brodie,Keith Keller,Thomas Huber,Daniel Dalevi,Ping Hu,Gary L. Andersen +9 more
TL;DR: A 16S rRNA gene database (http://greengenes.lbl.gov) was used to provide chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies as mentioned in this paper.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Weizhong Li,Adam Godzik +1 more
TL;DR: Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.
10.7K
Related Papers (5)
J. Gregory Caporaso,Justin Kuczynski,Jesse Stombaugh,Kyle Bittinger,Frederic D. Bushman,Elizabeth K. Costello,Noah Fierer,Antonio Gonzalez Peña,Julia K. Goodrich,Jeffrey I. Gordon,Gavin A. Huttley,Scott T. Kelley,Dan Knights,Jeremy E. Koenig,Ruth E. Ley,Catherine A. Lozupone,Daniel McDonald,Brian D. Muegge,Meg Pirrung,Jens Reeder,Joel Sevinsky,Peter J. Turnbaugh,William A. Walters,Jeremy Widmann,Tanya Yatsunenko,Jesse R. Zaneveld,Rob Knight,Rob Knight +27 more