NVIDIA Tensor Core Programmability, Performance & Precision

doi:10.1109/IPDPSW.2018.00091

Open AccessProceedings Article10.1109/IPDPSW.2018.00091

NVIDIA Tensor Core Programmability, Performance & Precision

Stefano Markidis, +4 more

- 21 May 2018

- pp 522-531

355

TL;DR: In this article, the performance of Tensor Cores on the NVIDIA Tesla V100 accelerator has been investigated and the precision loss due to matrix multiplication with half precision input has been reduced.

Abstract: The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1007/S10462-018-09679-Z

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Giang Nguyen, +7 more

- 19 Jan 2019

- Artificial Intelligence Review

TL;DR: This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software that is capable of scaling computation effectively and efficiently in the era of Big Data.

...read moreread less

759

•Journal Article•10.3390/s22020450

Federated Learning in Edge Computing: A Systematic Survey

Haftay Gebreslasie Abreha, +2 more

- 01 Jan 2022

- Sensors

TL;DR: A systematic survey of the literature on the implementation of FL in EC environments with a taxonomy to identify advanced solutions and other open problems is provided to help researchers better understand the connection between FL and EC enabling technologies and concepts.

...read moreread less

202

Journal Article•10.1109/TIFS.2020.3013204

A Siamese CNN for Image Steganalysis

Weike You, +2 more

- 01 Jan 2021

- IEEE Transactions on Information Forensi...

TL;DR: This paper proposes an end-to-end, deep learning, novel solution for distinguishing steganography images from normal images that provides satisfying performance and adopts a Siamese, CNN-based architecture.

...read moreread less

188

•Proceedings Article•10.1109/SC41405.2020.00009

Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning

Weile Jia, +7 more

- 01 Nov 2020

TL;DR: Deep Potential Molecular Dynamics (DPMD) as mentioned in this paper can simulate more than 1 nanosecond-long trajectory of over 100 million atoms per day using a highly optimized code (GPU DeePMD-kit) on the Summit supercomputer.

...read moreread less

173

•Proceedings Article•10.1109/WACV48630.2021.00144

TResNet: High Performance GPU-Dedicated Architecture

Tal Ridnik, +5 more

- 01 Jan 2021

TL;DR: TResNet as discussed by the authors introduces a series of architecture modifications that aim to boost neural networks' accuracy, while retaining their GPU training and inference efficiency, and achieves state-of-the-art results on a multi-label classification task, and perform well on object detection.

...read moreread less

170

...

Expand

References

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

Journal Article•10.1126/SCIENCE.1254642

A million spiking-neuron integrated circuit with a scalable communication network and interface

Paul A. Merolla, +19 more

- 08 Aug 2014

- Science

TL;DR: Inspired by the brain’s structure, an efficient, scalable, and flexible non–von Neumann architecture is developed that leverages contemporary silicon technology and is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification.

...read moreread less

4.2K

•Posted Content

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, +74 more

- 16 Apr 2017

- arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

4.1K

•Proceedings Article•10.1145/3079856.3080246

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, +75 more

- 24 Jun 2017

TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

...read moreread less

3.8K

...

Expand

NVIDIA Tensor Core Programmability, Performance & Precision

Chat with Paper

AI Agents for this Paper

Citations

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Federated Learning in Edge Computing: A Systematic Survey

A Siamese CNN for Image Steganalysis

Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning

TResNet: High Performance GPU-Dedicated Architecture

References

ImageNet Classification with Deep Convolutional Neural Networks

TensorFlow: A system for large-scale machine learning

A million spiking-neuron integrated circuit with a scalable communication network and interface

In-Datacenter Performance Analysis of a Tensor Processing Unit

In-Datacenter Performance Analysis of a Tensor Processing Unit

Related Papers (5)

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Deep learning