Sequence clustering

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1093/BIOINFORMATICS/BTQ461•

Search and clustering orders of magnitude faster than BLAST

[...]

01 Oct 2010-Bioinformatics

TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.

...read moreread less

Abstract: Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

20,273 citations

Journal Article•10.1093/BIOINFORMATICS/BTL158•

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

[...]

Weizhong Li¹, Adam Godzik¹•Institutions (1)

Sanford-Burnham Institute for Medical Research¹

01 Jul 2006-Bioinformatics

TL;DR: Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.

...read moreread less

Abstract: Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282--283, Bioinformatics, 18, 77--82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability: http://cd-hit.org Contact: [email protected]

...read moreread less

10,768 citations

Journal Article•10.1111/2041-210X.12073•

Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data

[...]

Johan Bengtsson-Palme¹, Martin Ryberg², Martin Hartmann, Sara Branco³, Zheng Wang⁴, Anna Godhe¹, Pierre De Wit¹, Marisol Sánchez-García⁵, Ingo Ebersberger⁶, Filipe de Sousa¹, Anthony S. Amend, Ari Jumpponen⁷, Martin Unterseher⁸, Erik Kristiansson⁹, Kessy Abarenkov¹⁰, Yann J. K. Bertrand¹, Kemal Sanli¹, K. Martin Eriksson⁹, Unni Vik¹¹, Vilmar Veldre, R. Henrik Nilsson¹ - Show less +17 more•Institutions (11)

University of Gothenburg¹, Uppsala University², University of California, Berkeley³, Yale University⁴, University of Tennessee⁵, Goethe University Frankfurt⁶, Kansas State University⁷, University of Greifswald⁸, Chalmers University of Technology⁹, American Museum of Natural History¹⁰, University of Oslo¹¹

01 Oct 2013-Methods in Ecology and Evolution

TL;DR: ITSx is introduced, a Perl‐based software tool to extract ITS1, 5.8S and ITS2 – as well as full‐length ITS sequences – from both Sanger and high‐throughput sequencing data sets, and is rich in features and written to be easily incorporated into automated sequence analysis pipelines.

...read moreread less

Abstract: Summary 1. The nuclear ribosomal internal transcribed spacer (ITS) region is the primary choice for molecular identification of fungi. Its two highly variable spacers (ITS1 and ITS2) are usually species specific, whereas the intercalary 5.8S gene is highly conserved. For sequence clustering and BLAST searches, it is often advantageous to rely on either one of the variable spacers but not the conserved 5.8S gene. To identify and extract ITS1 and ITS2 from large taxonomic and environmental data sets is, however, often difficult, and many ITS sequences are incorrectly delimited in the public sequence databases. 2. We introduce ITSx, a Perl-based software tool to extract ITS1, 5.8S and ITS2 – as well as full-length ITS sequences – from both Sanger and high-throughput sequencing data sets. ITSx uses hidden Markov models computed from large alignments of a total of 20 groups of eukaryotes, including fungi, metazoans and plants, and the sequence extraction is based on the predicted positions of the ribosomal genes in the sequences. 3. ITSx has a very high proportion of true-positive extractions and a low proportion of false-positive extractions. Additionally, process parallelization permits expedient analyses of very large data sets, such as a one million sequence amplicon pyrosequencing data set. ITSx is rich in features and written to be easily incorporated into automated sequence analysis pipelines. 4. ITSx paves the way for more sensitive BLAST searches and sequence clustering operations for the ITS region in eukaryotes. The software also permits elimination of non-ITS sequences from any data set. This is particularly useful for amplicon-based next-generation sequencing data sets, where insidious non-target sequences are often found among the target sequences. Such non-target sequences are difficult to find by other means and would contribute noise to diversity estimates if left in the data set.

...read moreread less

1,236 citations

Journal Article•10.1093/BIOINFORMATICS/BTT054•

RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads.

[...]

Petr Novák, Pavel Neumann, Jiří Pech, Jaroslav Steinhaisl, Jiří Macas - Show less +1 more

15 Mar 2013-Bioinformatics

TL;DR: RepeatExplorer as mentioned in this paper is a collection of software tools for characterization of repetitive elements which is accessible via web interface and uses graph-based sequence clustering algorithm to facilitate de novo repeat identification without the need for reference databases of known elements.

...read moreread less

Abstract: Motivation: Repetitive DNA makes up large portions of plant and animal nuclear genomes, yet it remains the least characterized genome component in most species studied so far. Although the recent availability of high throughput sequencing data provides necessary resources for in-depth investigation of genomic repeats, its utility is hampered by the lack of specialized bioinformatics tools and appropriate computational resources that would enable large-scale repeat analysis to be run by biologically-oriented researchers. Results: Here we present RepeatExplorer, a collection of software tools for characterization of repetitive elements which is accessible via web interface. A key component of the server is the computational pipeline employing a graph-based sequence clustering algorithm to facilitate de novo repeat identification without the need for reference databases of known elements. Since the algorithm uses short sequences randomly sampled from the genome as input, it is ideal for analyzing next generation sequence reads. Additional tools are provided to aid in classification of identified repeats, investigate phylogenetic relationships of retroelements and perform comparative analysis of repeat composition between multiple species. The server allows to analyze several million sequence reads which typically results in identification of most high and medium copy repeats in higher plant genomes. Implementation and availability: RepeatExplorer was implemented within the Galaxy environment and set up on a public server at http://repeatexplorer.umbr.cas.cz/. Source code and instructions for local installation are available at http://w3lamc.umbr.cas.cz/lamc/ resources.php.

...read moreread less

694 citations

Journal Article•10.1093/BIOINFORMATICS/18.1.77•

Tolerating some redundancy significantly speeds up clustering of large protein databases.

[...]

Weizhong Li¹, Lukasz Jaroszewski¹, Adam Godzik¹•Institutions (1)

Sanford-Burnham Institute for Medical Research¹

01 Jan 2002-Bioinformatics

TL;DR: This paper shows that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program's speed 100 times and implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in approximately 5 days.

...read moreread less

Abstract: Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI NonRedundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ∼1 h and at 75% identity in ∼1 day on a 1 GHz Linux PC (Li et al., Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ∼ 5d ays. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%. Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@burnham-inst.org; adam@burnhaminst.org

...read moreread less

505 citations

...

Expand

Year	Papers
2022	1
2021	23
2020	13
2019	20
2018	21
2017	18

Topic Tools

Papers published on a yearly basis

Papers

Search and clustering orders of magnitude faster than BLAST

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data

RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads.

Tolerating some redundancy significantly speeds up clustering of large protein databases.

Related Topics (5)

Performance Metrics