NVIDIA Tensor Core Programmability, Performance & Precision

doi:10.1109/IPDPSW.2018.00091

Open AccessProceedings Article10.1109/IPDPSW.2018.00091

NVIDIA Tensor Core Programmability, Performance & Precision

Stefano Markidis, +4 more

- 11 Mar 2018

- arXiv: Distributed, Parallel, and Cluste...

146

TL;DR: In this article, the authors investigate the precision loss due to matrix multiplication with half-precision input and show that matrix multiplication can be reduced at the cost of increased computation complexity.

Abstract: The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1007/S10462-018-09679-Z

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Giang Nguyen, +7 more

- 19 Jan 2019

- Artificial Intelligence Review

TL;DR: This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software that is capable of scaling computation effectively and efficiently in the era of Big Data.

...read moreread less

759

•Posted Content

MLPerf Training Benchmark.

Peter Mattson, +36 more

- 02 Oct 2019

- arXiv: Learning

TL;DR: MLPerf as discussed by the authors is an ML benchmark that overcomes three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time-to-solution exhibits high variance.

...read moreread less

274

•Journal Article•10.3390/s22020450

Federated Learning in Edge Computing: A Systematic Survey

Haftay Gebreslasie Abreha, +2 more

- 01 Jan 2022

- Sensors

TL;DR: A systematic survey of the literature on the implementation of FL in EC environments with a taxonomy to identify advanced solutions and other open problems is provided to help researchers better understand the connection between FL and EC enabling technologies and concepts.

...read moreread less

202

•Proceedings Article•10.1145/3357384.3358045

AIBox: CTR Prediction Model Training on a Single Node

Weijie Zhao, +5 more

- 03 Nov 2019

TL;DR: AIBox is presented, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs, and a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses.

...read moreread less

124

Proceedings Article•10.1109/ISCA45697.2020.00023

Think fast: a tensor streaming processor (TSP) for accelerating deep learning workloads

Dennis Abts, +30 more

- 30 May 2020

TL;DR: The TSP architecture is introduced, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations.

...read moreread less

93

...

Expand

References

Journal Article•10.1109/MM.2015.10

Always-on Vision Processing Unit for Mobile Applications

Brendan Barry, +7 more

- 27 Jan 2015

- IEEE Micro

TL;DR: The vision processing unit incorporates parallelism, instruction set architecture, and microarchitectural features to provide highly sustainable performance efficiency across a range of computational-Imaging and computer vision applications, including those with low latency requirements on the order of milliseconds.

...read moreread less

138

•Proceedings Article•10.1145/3148226.3148237

Investigating half precision arithmetic to accelerate dense linear system solvers

Azzam Haidar, +3 more

- 12 Nov 2017

TL;DR: This work shows for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers.

...read moreread less

78

Proceedings Article•10.1145/3126908.3126919

Low communication FMM-accelerated FFT on GPUs

Cris Cecka

- 12 Nov 2017

TL;DR: This work reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose, and presents a detailed and clear implementation strategy that relies heavily on existing library primitives.

...read moreread less

•Book

Deep Learning

Ian Goodfellow, +2 more

- 18 Nov 2016

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less