A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Guillaume Marçais,Carl Kingsford +1 more
TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.
read more
Abstract: Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.
Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.
Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
Contact: [email protected]
Supplementary information:Supplementary data are available at Bioinformatics online.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
CRISPR-Sirius: RNA scaffolds for signal amplification in genome imaging
Hanhui Ma,Li-Chun Tu,Ardalan Naseri,Yu-Chieh Chung,David Grunwald,Shaojie Zhang,Thoru Pederson +6 more
TL;DR: CRISPR-Sirius is introduced, based on octet arrays of aptamers conferring both enhanced guide RNA stability and brightness, and initial biological applications of this platform are provided.
157
Assembly of the draft genome of buckwheat and its applications in identifying agronomically useful genes.
Yasuo Yasui,Hideki Hirakawa,Mariko Ueno,Katsuhiro Matsui,Tomoyuki Katsube-Tanaka,Soo Jung Yang,Jotaro Aii,Shingo Sato,Masashi Mori +8 more
TL;DR: The database and draft genome sequence provide a valuable resource that can be used in efforts to develop buckwheat cultivars with superior agronomic traits and are used as a reference sequence for NGS-based markers.
Gene family expansions and contractions are associated with host range in plant pathogens of the genus Colletotrichum
Riccardo Baroncelli,Daniel Buchvaldt Amby,Antonio Zapparata,Sabrina Sarrocco,Giovanni Vannacci,Gaétan Le Floch,Richard J. Harrison,Eric B. Holub,Serenella A. Sukno,Surapareddy Sreenivasaprasad,Michael R. Thon +10 more
TL;DR: This study illustrates the plasticity of Colletotrichum genomes, and shows that major changes in host range are associated with relatively recent changes in gene content.
The biosynthetic pathway of the nonsugar, high-intensity sweetener mogroside V from Siraitia grosvenorii.
Maxim Itkin,Rachel Davidovich-Rikanati,Shahar Cohen,Vitaly Portnoy,Adi Doron-Faigenboim,Elad Oren,Shiri Freilich,Galil Tzuri,Nadine Baranes,Shmuel Shen,Marina Petreikov,Rotem Sertchook,Shifra Ben-Dor,Hugo E. Gottlieb,Alvaro G. Hernandez,David R. Nelson,Harry S. Paris,Yaakov Tadmor,Yosef Burger,Efraim Lewinsohn,Nurit Katzir,Arthur A. Schaffer +21 more
TL;DR: A comparison of the genomic organization and expression patterns of these Siraitia genes with the orthologs of other Cucurbitaceae implicates a strikingly coordinated expression of the pathway in the evolution of this species-specific and valuable metabolic pathway.
157
Red clover (Trifolium pratense L.) draft genome provides a platform for trait improvement
Jose J De Vega,Jose J De Vega,Sarah Ayling,Matthew J. Hegarty,Dave Kudrna,Jose Luis Goicoechea,Åshild Ergon,Odd Arne Rognli,Charlotte Jones,Martin T. Swain,René Geurts,Chunting Lang,Klaus F. X. Mayer,Stephan Rössner,Steven Yates,Steven Yates,K. J. Webb,Iain Donnison,Giles E. D. Oldroyd,Rod A. Wing,Mario Caccamo,Wayne Powell,Wayne Powell,Michael T. Abberton,Michael T. Abberton,Leif Skøt +25 more
TL;DR: Genotyping by sequencing of a synthetic population of 86 genotypes show that the number of markers required for genomics-based breeding approaches is tractable, making red clover a suitable candidate for association studies and genomic selection.
References
MUSCLE: multiple sequence alignment with high accuracy and high throughput
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
45.1K
•Book
Introduction to Algorithms
Thomas H. Cormen,Charles E. Leiserson,Ronald L. Rivest +2 more
- 01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
24.8K
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
A Whole-Genome Assembly of Drosophila
Eugene W. Myers,Granger G. Sutton,Arthur L. Delcher,Ian M. Dew,Dan P. Fasulo,Michael Flanigan,Saul A. Kravitz,Clark M. Mobarry,Knut Reinert,Karin A. Remington,Eric L. Anson,Randall Bolanos,Hui-Hsien Chou,Catherine Jordan,Aaron L. Halpern,Stefano Lonardi,Ellen M. Beasley,Rhonda C. Brandon,Lin Chen,Patrick J. Dunn,Zhongwu Lai,Yong Liang,Deborah R. Nusskern,Ming Zhan,Qing Zhang,Xiangqun Zheng,Gerald M. Rubin,Mark Raymond Adams,J. Craig Venter +28 more
TL;DR: The quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it are reported on and should be of substantial value to the scientific community.
1.6K