A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

doi:10.1101/2022.11.20.517297

Open AccessPosted Content10.1101/2022.11.20.517297

A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Gagandeep Singh, +7 more

- 06 Nov 2022

- bioRxiv

9

TL;DR: Zhang et al. as mentioned in this paper proposed a quantization-aware base calling neural architecture search (QABAS) framework to find the best bit-width precision for each neural network layer.

Abstract: Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can se-quence long fragments of a genome. Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The performance of basecalling has critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models from the speech recognition domain to perform basecalling without considering the compute demands of such models, which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. However, developing a very fast basecaller that can provide high accuracy requires a deep understanding of genome sequencing, machine learning, and hardware design. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Our experimental results on state-of-the-art computing systems show that RUBICALL is a fast, memory-efficient, and hardware-friendly basecaller. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.96× speedup with 2.97% higher accuracy. Compared to an expert-designed basecaller, RUBICALL provides a 141.15× speedup without losing accuracy while also achieving a 6.88× and 2.94× reduction in neural network model size and the number of parameters, respectively. We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models and can inspire independent future ideas.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1093/bioinformatics/btad272

RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes

Can Fırtına, +6 more

- 22 Jan 2023

- Bioinformatics

TL;DR: In this paper , the authors proposed a hash-based similarity search for read-and-write analysis of nanopore raw signals for large genomes using a hash value, regardless of the slight variations in these signals.

...read moreread less

22

Proceedings Article•10.1145/3577193.3593719

SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

Gagandeep Singh, +8 more

- 06 Mar 2023

TL;DR: SPARTA as discussed by the authors uses the MLIR (Multi-Level Intermediate Representation) compiler framework to accelerate the horizontal diffusion stencil by designing the first scaled-out spatial accelerator.

...read moreread less

9

•Posted Content•10.1101/2022.12.09.519749

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Meryem Banu Cavlak, +8 more

- 09 Dec 2022

- bioRxiv

TL;DR: TargetCall as mentioned in this paper proposes to discard reads that will not match the target reference (i.e., off-target reads) prior to base calling, which is the first fast and widely-applicable pre-base calling filter to eliminate the wasted computation in base calling.

...read moreread less

6

Journal Article•10.48550/arxiv.2310.04366

Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors

Taha Shahroodi, +8 more

- 06 Oct 2023

- arXiv.org

TL;DR: This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues, and leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities.

...read moreread less

5

Journal Article•10.1186/s13059-024-03181-2

RUBICON: a framework for designing efficient deep learning-based genomic basecallers

Gagandeep Singh, +7 more

- 06 Nov 2022

- Genome Biology

TL;DR: RUBICON is presented, a framework to develop efficient hardware-optimized basecallers and RUBICALL is developed, the first hardware-optimized mixed-precision basecaller that performs efficient basecalling, outperforming the state-of-the-art basecallers.

...read moreread less

5

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Journal Article•10.1093/BIOINFORMATICS/BTP352

The Sequence Alignment/Map format and SAMtools

Heng Li, +8 more

- 01 Aug 2009

- Bioinformatics

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

...read moreread less

60.7K

•Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

- 06 Jul 2015

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

43.7K

•Posted Content

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, +2 more

- 09 Mar 2015

- arXiv: Machine Learning

TL;DR: This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

...read moreread less

21.2K

•Journal Article•10.1093/BIOINFORMATICS/BTY191

Minimap2: pairwise alignment for nucleotide sequences

Heng Li

- 15 Sep 2018

- Bioinformatics

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.

...read moreread less

11.9K