A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Guillaume Marçais,Carl Kingsford +1 more
TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.
read more
Abstract: Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.
Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.
Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
Contact: [email protected]
Supplementary information:Supplementary data are available at Bioinformatics online.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Chromosome-level genome assembly of bean flower thrips Megalurothrips usitatus (Thysanoptera: Thripidae)
Ling Ma,Qiao Yang Liu,Shujun Wei,Shanlin Liu,Lihua Tian,Fan Song,Yuange Duan,Wanzhi Cai,Hu Lin Li +8 more
TL;DR: In this paper , the authors generated a chromosome-level Megalurothrips usitatus genome using a combination of PacBio long read and Hi-C technologies and the assembled genome was 238.14 Mb with a scaffold N50 of 13.85 Mb.
The chromosomal distribution of repetitive DNA sequences in Chrysanthemum boreale revealed a characterization in its genome
Abigail Rubiato Cuyacot,So Youn Won,Sang Kun Park,Seong-Han Sohn,Jungho Lee,Jung Sun Kim,Hyun Hee Kim,Ki-Byung Lim,Yoon-Jung Hwang +8 more
TL;DR: FISH was performed by using the mitotic metaphase chromosome of C. boreale with well-identified repeats and C 0 t DNA as probes to map the repetitive sequences of Chrysanthemum boreale.
12
OUP accepted manuscript
04 Jan 2022
TL;DR: The haplotype-aware, draft genome assembly consisted of ~6.67 Giga bases (Gb), close to the holoploid genome size estimate of 7.56 Gb (± 0.44 SD) determined by flow cytometry as discussed by the authors .
12
Comparative and population genomics of buckwheat species reveal key determinants of flavor and fertility
Kaixuan Zhang,Yuqi He,Xiang Lu,Yaliang Shi,Hui Zhao,Xiaobo Li,Jinlong Li,Yang Liu,Yinan Ouyang,Yu Tang,Xue Ren,Xuemei Zhang,Wei-Fei Yang,Zhaoxia . Sun,Chunhua Zhang,Muriel Quinet,Zlata Luthar,Mateja Germ,Ivan Kreft,Dagmar Janovská,Vladimir Meglič,Barbara Pipan,Milen I. Georgiev,Bruno Studer,Mark A. Chapman,Meiliang Zhou +25 more
TL;DR: The assembly of a chromosome-scale high-quality reference genome of F. esculentum var.
12
From command-line bioinformatics to bioGUI
Markus Joppich,Ralf Zimmer +1 more
TL;DR: The bioGUI framework addresses two main problems of using CL bioinformatics applications: first, many tools work on UNIX-systems only, while many scientists use Microsoft Windows, and second, scientists refrain from using CL tools which could well support them in their research.
12
References
MUSCLE: multiple sequence alignment with high accuracy and high throughput
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
45.1K
•Book
Introduction to Algorithms
Thomas H. Cormen,Charles E. Leiserson,Ronald L. Rivest +2 more
- 01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
24.8K
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
A Whole-Genome Assembly of Drosophila
Eugene W. Myers,Granger G. Sutton,Arthur L. Delcher,Ian M. Dew,Dan P. Fasulo,Michael Flanigan,Saul A. Kravitz,Clark M. Mobarry,Knut Reinert,Karin A. Remington,Eric L. Anson,Randall Bolanos,Hui-Hsien Chou,Catherine Jordan,Aaron L. Halpern,Stefano Lonardi,Ellen M. Beasley,Rhonda C. Brandon,Lin Chen,Patrick J. Dunn,Zhongwu Lai,Yong Liang,Deborah R. Nusskern,Ming Zhan,Qing Zhang,Xiangqun Zheng,Gerald M. Rubin,Mark Raymond Adams,J. Craig Venter +28 more
TL;DR: The quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it are reported on and should be of substantial value to the scientific community.
1.6K