Journal Article10.1109/TCSII.2020.3038897
A Memory-Efficient CNN Accelerator Using Segmented Logarithmic Quantization and Multi-Cluster Architecture
13
TL;DR: A segmented logarithmic (SegLog) quantization method is exploited to mitigate the on-chip memory and bandwidth requirements, thus accommodating more processing elements (PEs) in a given chip area to organize a reconfigurable multi-cluster architecture.
read more
Abstract: This brief presents a memory-efficient CNN accelerator design for resource-constrained devices in Internet of Things (IoT) and autonomous systems. A segmented logarithmic (SegLog) quantization method is exploited to mitigate the on-chip memory and bandwidth requirements, thus accommodating more processing elements (PEs) in a given chip area to organize a reconfigurable multi-cluster architecture. The evaluation results show that SegLog quantization can achieve $6.4\times $ model compression with less than 2.5% accuracy loss on various CNNs. An ASIC implementation with 168 PEs configuration is validated in a 40-nm CMOS process, with 2.54 TOPs/W energy efficiency and 0.8 mm2 chip area reported. The accelerator has also been implemented on FPGA with 1512 PEs configured and 468 kB on-chip memory, achieving a 1.29 GOPs/kB memory efficiency. Compared with the state-of-the-art accelerators, our ASIC implementation enhances area efficiency and arithmetic intensity by $1.94\times $ and $5.62\times $ , while the FPGA implementation achieves the memory efficiency improvement by a factor of $2.34\times $ .
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
An FPGA-Based Transformer Accelerator Using Output Block Stationary Dataflow for Object Recognition Applications
TL;DR: In this article , a transformer accelerator with an output block stationary (OBS) dataflow is proposed to minimize the repeated memory access by block-level and vector-level broadcasting while preserving a high digital signal processor (DSP) utilization rate, leading to higher energy efficiency.
10
Hardware-Friendly Logarithmic Quantization with Mixed-Precision for MobileNetV2
13 Jun 2022
TL;DR: In this paper , the authors proposed a novel logarithmic weight quantization considering the characteristics of MobileNetV2, and a mixed-precision quantization that minimizes accuracy loss by training the distribution range using the trainable parameter.
9
FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs With Dynamic Fixed-Point Representation
01 Jan 2022
TL;DR: FxP-QNet as discussed by the authors employs post-training self-distillation and network prediction error statistics to optimize the quantization of floating-point values into fixed-point numbers, and gradually adapts the quantisation level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements.
Energy-Efficient High-Speed ASIC Implementation of Convolutional Neural Network Using Novel Reduced Critical-Path Design
01 Jan 2022
TL;DR: In this paper , a bit-level-multiply-accumulator (BLMAC) with a modified Booth encoder and a Wallace reduction tree is proposed to reduce the critical path of the overall architecture.
6
Energy-Efficient High-Speed ASIC Implementation of Convolutional Neural Network Using Novel Reduced Critical-Path Design
TL;DR: This paper proposes a hardware-efficient, high-speed convolution block for ASIC implementation of the CNN algorithm using a novel bit-level-multiply-accumulator (BLMAC) with a modified Booth encoder and a Wallace reduction tree.
5
References
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
TL;DR: Eyeriss v2 as mentioned in this paper is a DNN accelerator architecture designed for running compact and sparse DNNs, which can process sparse data directly in the compressed domain for both weights and activations and therefore is able to improve both processing speed and energy efficiency with sparse models.
876
•Posted Content
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
TL;DR: Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs, is presented, which introduces a highly flexible on-chip network that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources.
628
•Proceedings Article
Post training 4-bit quantization of convolutional networks for rapid-deployment
Ron Banner,Yury Nahshan,Daniel Soudry +2 more
- 01 Jan 2019
TL;DR: This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset, and achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models.
LogNet: Energy-efficient neural networks using logarithmic computation
Edward H. Lee,Daisuke Miyashita,Elaina Chai,Boris Murmann,S. Simon Wong +4 more
- 05 Mar 2017
TL;DR: This work explores how logarithmic encoding of non-uniformly distributed weights and activations is preferred over linear encoding at resolutions of 4 bits and less and enables networks to achieve higher classification accuracies than fixed-point at low resolutions and eliminate bulky digital multipliers.
196
•Posted Content
Post-training 4-bit quantization of convolution networks for rapid-deployment
TL;DR: This article proposed a 4-bit post-training quantization approach, which does not require training the quantized model (fine-tuning), nor does it require the availability of the full dataset.
185