NVIDIA Tensor Core Programmability, Performance & Precision
Stefano Markidis,Steven W. D. Chien,Erwin Laure,Ivy Bo Peng,Jeffrey S. Vetter +4 more
- 21 May 2018
- pp 522-531
355
TL;DR: In this article, the performance of Tensor Cores on the NVIDIA Tesla V100 accelerator has been investigated and the precision loss due to matrix multiplication with half precision input has been reduced.
read more
Abstract: The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
Giang Nguyen,Stefan Dlugolinsky,Martin Bobák,Viet Tran,Álvaro López García,Ignacio Heredia,Peter Malik,Ladislav Hluchý +7 more
TL;DR: This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software that is capable of scaling computation effectively and efficiently in the era of Big Data.
759
Federated Learning in Edge Computing: A Systematic Survey
Haftay Gebreslasie Abreha,Mohammad Hayajneh,Mohamed Adel Serhani +2 more
TL;DR: A systematic survey of the literature on the implementation of FL in EC environments with a taxonomy to identify advanced solutions and other open problems is provided to help researchers better understand the connection between FL and EC enabling technologies and concepts.
A Siamese CNN for Image Steganalysis
TL;DR: This paper proposes an end-to-end, deep learning, novel solution for distinguishing steganography images from normal images that provides satisfying performance and adopts a Siamese, CNN-based architecture.
188
Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning
Weile Jia,Han Wang,Mohan Chen,Denghui Lu,Lin Lin,Roberto Car,Weinan E,Linfeng Zhang +7 more
- 01 Nov 2020
TL;DR: Deep Potential Molecular Dynamics (DPMD) as mentioned in this paper can simulate more than 1 nanosecond-long trajectory of over 100 million atoms per day using a highly optimized code (GPU DeePMD-kit) on the Summit supercomputer.
173
TResNet: High Performance GPU-Dedicated Architecture
Tal Ridnik,Hussam Lawen,Asaf Noy,Emanuel Ben,Baruch Gilad Sharir,Itamar Friedman +5 more
- 01 Jan 2021
TL;DR: TResNet as discussed by the authors introduces a series of architecture modifications that aim to boost neural networks' accuracy, while retaining their GPU training and inference efficiency, and achieves state-of-the-art results on a multi-label classification task, and perform well on object detection.
References
•Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton +2 more
- 03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
•Posted Content
TensorFlow: A system for large-scale machine learning
Martín Abadi,Paul Barham,Jianmin Chen,Zhifeng Chen,Andy Davis,Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Geoffrey Irving,Michael Isard,Manjunath Kudlur,Josh Levenberg,Rajat Monga,Sherry Moore,Derek G. Murray,Benoit Steiner,Paul A. Tucker,Vijay K. Vasudevan,Pete Warden,Martin Wicke,Yuan Yu,Xiaoqiang Zheng +21 more
TL;DR: The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
A million spiking-neuron integrated circuit with a scalable communication network and interface
Paul A. Merolla,John V. Arthur,Rodrigo Alvarez-Icaza,Andrew S. Cassidy,Jun Sawada,Filipp Akopyan,Bryan L. Jackson,Nabil Imam,Chen Guo,Yutaka Nakamura,Bernard Brezzo,Ivan Vo,Steven K. Esser,Rathinakumar Appuswamy,Brian Taba,Arnon Amir,Myron D. Flickner,William P. Risk,Rajit Manohar,Dharmendra S. Modha +19 more
TL;DR: Inspired by the brain’s structure, an efficient, scalable, and flexible non–von Neumann architecture is developed that leverages contemporary silicon technology and is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification.
4.2K
•Posted Content
In-Datacenter Performance Analysis of a Tensor Processing Unit
Norman P. Jouppi,Cliff Young,Nishant Patil,David A. Patterson,Gaurav Agrawal,Raminder Bajwa,Sarah Bates,Suresh Bhatia,Nan Boden,Albert T. Borchers,Rick Boyle,Pierre-luc Cantin,Clifford Chao,Christopher Aaron Clark,Jeremy Coriell,Michael J. Daley,Matt Dau,Jeffrey Dean,Ben Gelb,Tara Vazir Ghaemmaghami,Rajendra Gottipati,William John Gulland,Robert Hagmann,C. Richard Ho,Doug Hogberg,John Hu,Robert Hundt,D. Hurt,Julian Ibarz,Aaron Jaffey,Alek Jaworski,Alexander Kaplan,Khaitan Harshit,Andy Koch,Naveen Kumar,Steve Lacy,James Laudon,James Law,Diemthu Le,Chris Leary,Zhuyuan Liu,Kyle Lucke,Alan Lundin,Gordon MacKean,Adriana Maggiore,Maire Mahony,Kieran Miller,Rahul Nagarajan,Ravi Narayanaswami,Ray Ni,Kathy Nix,Thomas Norrie,Mark Omernick,Narayana Penukonda,Andrew Everett Phelps,Jonathan Ross,Matt Ross,Amir Salek,Emad Samadiani,Chris Severn,Gregory Sizikov,Matthew Snelham,Jed Souter,Dan Steinberg,Andy Swing,Mercedes Tan,Gregory Michael Thorson,Bo Tian,Horia Toma,Erick Tuttle,Vijay K. Vasudevan,Richard Walter,Walter Wang,Eric Wilcox,Doe Hyun Yoon +74 more
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
In-Datacenter Performance Analysis of a Tensor Processing Unit
Norman P. Jouppi,Cliff Young,Nishant Patil,David A. Patterson,Gaurav Agrawal,Raminder Bajwa,Sarah Bates,Suresh Bhatia,Nan Boden,Albert T. Borchers,Rick Boyle,Pierre-luc Cantin,Clifford Chao,Christopher Aaron Clark,Jeremy Coriell,Michael J. Daley,Matt Dau,Jeffrey Dean,Ben Gelb,Tara Vazir Ghaemmaghami,Rajendra Gottipati,William John Gulland,Robert Hagmann,C. Richard Ho,Doug Hogberg,John Hu,Robert Hundt,D. Hurt,Julian Ibarz,Aaron Jaffey,Alek Jaworski,Alexander Kaplan,Khaitan Harshit,Daniel Killebrew,Andy Koch,Naveen Kumar,Steve Lacy,James Laudon,James Law,Diemthu Le,Chris Leary,Zhuyuan Liu,Kyle Lucke,Alan Lundin,Gordon MacKean,Adriana Maggiore,Maire Mahony,Kieran Miller,Rahul Nagarajan,Ravi Narayanaswami,Ray Ni,Kathy Nix,Thomas Norrie,Mark Omernick,Narayana Penukonda,Andrew Everett Phelps,Jonathan Ross,Matt Ross,Amir Salek,Emad Samadiani,Chris Severn,Gregory Sizikov,Matthew Snelham,Jed Souter,Dan Steinberg,Andy Swing,Mercedes Tan,Gregory Michael Thorson,Bo Tian,Horia Toma,Erick Tuttle,Vijay K. Vasudevan,Richard Walter,Walter Wang,Eric Wilcox,Doe Hyun Yoon +75 more
- 24 Jun 2017
TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).
Related Papers (5)
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
[...]