DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

doi:10.1109/ISPASS.2019.00041

Open AccessProceedings Article10.1109/ISPASS.2019.00041

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

Sangkug Lym, +4 more

- 24 Mar 2019

- pp 293-303

34

TL;DR: L DeLTA is presented, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1109/TR.2019.2957965

A Review on Prognostics Methods for Engineering Systems

Jian Guo, +2 more

- 01 Sep 2020

- IEEE Transactions on Reliability

TL;DR: The reviewed papers are classified into three major areas based on whether the physics of failure knowledge is incorporated for prognostics, i.e., the data-driven, physics-based, and hybrid prognostic methods.

...read moreread less

139

•Proceedings Article•10.1109/ISCA45697.2020.00080

Buddy compression: enabling larger memory for deep learning and HPC workloads on GPUs

Esha Choukse, +6 more

- 30 May 2020

TL;DR: Buddy Compression as mentioned in this paper is an architecture that makes use of compression to utilize a larger buddy-memory from the host or disaggregated memory, effectively increasing the memory capacity of the GPU.

...read moreread less

39

•Proceedings Article•10.1109/IJCNN52387.2021.9534306

Training Energy-Efficient Deep Spiking Neural Networks with Single-Spike Hybrid Input Encoding

Gourav Datta, +2 more

- 18 Jul 2021

TL;DR: In this article, a hybrid encoding scheme was proposed for low-latency energy-efficient SNNs, where the analog pixel values of an image were directly applied during the first timestep and a novel variant of spike temporal coding was used during subsequent timesteps.

...read moreread less

38

•Posted Content

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads.

Zhen Zheng, +8 more

- 23 Sep 2020

- arXiv: Distributed, Parallel, and Cluste...

TL;DR: This work proposes FusionStitching, a Deep Learning compiler capable of fusing memory intensive operators, with varied data dependencies and non-homogeneous parallelism, into large GPU kernels to reduce global memory access and operation scheduling overhead automatically and tunes the optimal stitching scheme just-in-time with a domain-specific cost model efficiently.

...read moreread less

29

Proceedings Article•10.1109/MICRO50266.2020.00065

Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores

Hyeonjin Kim, +5 more

- 01 Oct 2020

TL;DR: A GPU architecture named Duplo is introduced that minimizes redundant memory accesses of convolutions in deep neural networks (DNNs) by leveraging compile-time information and microarchitectural supports to detect and eliminate redundantMemory accesses that repeatedly load the duplicates of data in the workspace matrix.

...read moreread less

21

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

Journal Article•10.1038/NATURE14539

Deep learning

Yann LeCun, +4 more

- 28 May 2015

- Nature

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.

...read moreread less

67K

•Journal Article•10.1109/TPAMI.2016.2577031

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 01 Jun 2017

- IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

...read moreread less

64.4K