A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

doi:10.1093/BIOINFORMATICS/BTR011

Open AccessJournal Article10.1093/BIOINFORMATICS/BTR011

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Guillaume Marçais, +1 more

- 01 Mar 2011

- Bioinformatics

- Vol. 27, Iss: 6, pp 764-770

4.1K

TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.

Abstract: Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1038/s41597-023-02164-5

Chromosome-level genome assembly of bean flower thrips Megalurothrips usitatus (Thysanoptera: Thripidae)

Ling Ma, +8 more

- 03 May 2023

- Scientific Data

TL;DR: In this paper , the authors generated a chromosome-level Megalurothrips usitatus genome using a combination of PacBio long read and Hi-C technologies and the assembled genome was 238.14 Mb with a scaffold N50 of 13.85 Mb.

...read moreread less

12

Journal Article•10.1016/J.SCIENTA.2015.12.025

The chromosomal distribution of repetitive DNA sequences in Chrysanthemum boreale revealed a characterization in its genome

Abigail Rubiato Cuyacot, +8 more

- 26 Jan 2016

- Scientia Horticulturae

TL;DR: FISH was performed by using the mitotic metaphase chromosome of C. boreale with well-identified repeats and C 0 t DNA as probes to map the repetitive sequences of Chrysanthemum boreale.

...read moreread less

12

•Journal Article•10.1093/jhered/esab077

OUP accepted manuscript

04 Jan 2022

- Journal of Heredity

TL;DR: The haplotype-aware, draft genome assembly consisted of ~6.67 Giga bases (Gb), close to the holoploid genome size estimate of 7.56 Gb (± 0.44 SD) determined by flow cytometry as discussed by the authors .

...read moreread less

12

Journal Article•10.1016/j.molp.2023.08.013

Comparative and population genomics of buckwheat species reveal key determinants of flavor and fertility

Kaixuan Zhang, +25 more

- 01 Aug 2023

- Molecular Plant

TL;DR: The assembly of a chromosome-scale high-quality reference genome of F. esculentum var.

...read moreread less

12

•Journal Article•10.7717/PEERJ.8111

From command-line bioinformatics to bioGUI

Markus Joppich, +1 more

- 21 Nov 2019

- PeerJ

TL;DR: The bioGUI framework addresses two main problems of using CL bioinformatics applications: first, many tools work on UNIX-systems only, while many scientists use Microsoft Windows, and second, scientists refrain from using CL tools which could well support them in their research.

...read moreread less

12

...

Expand

References

•Journal Article•10.1093/NAR/GKH340

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Robert C. Edgar

- 01 Mar 2004

- Nucleic Acids Research

TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.

...read moreread less

45.1K

•Book

Introduction to Algorithms

Thomas H. Cormen, +2 more

- 01 Jan 1990

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.

...read moreread less

24.8K

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

Journal Article•10.1126/SCIENCE.287.5461.2196

A Whole-Genome Assembly of Drosophila

Eugene W. Myers, +28 more

- 24 Mar 2000

- Science

TL;DR: The quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it are reported on and should be of substantial value to the scientific community.

...read moreread less

1.6K