DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis
Sangkug Lym,Donghyuk Lee,Mike O'Connor,Niladrish Chatterjee,Mattan Erez +4 more
- 24 Mar 2019
- pp 293-303
34
TL;DR: L DeLTA is presented, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm.
read more
Abstract: Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Review on Prognostics Methods for Engineering Systems
Jian Guo,Zhaojun Li,Meiyan Li +2 more
TL;DR: The reviewed papers are classified into three major areas based on whether the physics of failure knowledge is incorporated for prognostics, i.e., the data-driven, physics-based, and hybrid prognostic methods.
139
Buddy compression: enabling larger memory for deep learning and HPC workloads on GPUs
Esha Choukse,Michael J. Sullivan,Mike O'Connor,Mattan Erez,Jeff Pool,David Nellans,Stephen W. Keckler +6 more
- 30 May 2020
TL;DR: Buddy Compression as mentioned in this paper is an architecture that makes use of compression to utilize a larger buddy-memory from the host or disaggregated memory, effectively increasing the memory capacity of the GPU.
39
Training Energy-Efficient Deep Spiking Neural Networks with Single-Spike Hybrid Input Encoding
Gourav Datta,Souvik Kundu,Peter A. Beerel +2 more
- 18 Jul 2021
TL;DR: In this article, a hybrid encoding scheme was proposed for low-latency energy-efficient SNNs, where the analog pixel values of an image were directly applied during the first timestep and a novel variant of spike temporal coding was used during subsequent timesteps.
38
•Posted Content
FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads.
Zhen Zheng,Pengzhan Zhao,Guoping Long,Feiwen Zhu,Kai Zhu,Wenyi Zhao,Lansong Diao,Jun Yang,Wei Lin +8 more
TL;DR: This work proposes FusionStitching, a Deep Learning compiler capable of fusing memory intensive operators, with varied data dependencies and non-homogeneous parallelism, into large GPU kernels to reduce global memory access and operation scheduling overhead automatically and tunes the optimal stitching scheme just-in-time with a domain-specific cost model efficiently.
29
Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores
Hyeonjin Kim,Sungwoo Ahn,Yunho Oh,Bogil Kim,Won Woo Ro,William J. Song +5 more
- 01 Oct 2020
TL;DR: A GPU architecture named Duplo is introduced that minimizes redundant memory accesses of convolutions in deep neural networks (DNNs) by leveraging compile-time information and microarchitectural supports to detect and eliminate redundantMemory accesses that repeatedly load the duplicates of data in the workspace matrix.
21
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
102.6K
•Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton +2 more
- 03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Deep learning
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
67K
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.