NVIDIA Tensor Core Programmability, Performance & Precision
TL;DR: In this article, the authors investigate the precision loss due to matrix multiplication with half-precision input and show that matrix multiplication can be reduced at the cost of increased computation complexity.
read more
Abstract: The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
Giang Nguyen,Stefan Dlugolinsky,Martin Bobák,Viet Tran,Álvaro López García,Ignacio Heredia,Peter Malik,Ladislav Hluchý +7 more
TL;DR: This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software that is capable of scaling computation effectively and efficiently in the era of Big Data.
759
•Posted Content
MLPerf Training Benchmark.
Peter Mattson,Christine Cheng,Cody Coleman,Greg Diamos,Paulius Micikevicius,David A. Patterson,Hanlin Tang,Gu-Yeon Wei,Peter Bailis,Victor Bittorf,David Brooks,Dehao Chen,Debojyoti Dutta,Udit Gupta,Kim Hazelwood,Andrew Hock,Xinyuan Huang,Atsushi Ike,Bill Jia,Daniel Kang,David Kanter,Naveen Kumar,Jeffery Liao,Guokai Ma,Deepak Narayanan,Tayo Oguntebi,Gennady Pekhimenko,Lillian Pentecost,Vijay Janapa Reddi,Taylor Robie,Tom St. John,Tsuguchika Tabaru,Carole-Jean Wu,Lingjie Xu,Yamazaki Masafumi,Cliff Young,Matei Zaharia +36 more
TL;DR: MLPerf as discussed by the authors is an ML benchmark that overcomes three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time-to-solution exhibits high variance.
274
Federated Learning in Edge Computing: A Systematic Survey
Haftay Gebreslasie Abreha,Mohammad Hayajneh,Mohamed Adel Serhani +2 more
TL;DR: A systematic survey of the literature on the implementation of FL in EC environments with a taxonomy to identify advanced solutions and other open problems is provided to help researchers better understand the connection between FL and EC enabling technologies and concepts.
AIBox: CTR Prediction Model Training on a Single Node
Weijie Zhao,Jingyuan Zhang,Deping Xie,Yulei Qian,Ronglai Jia,Ping Li +5 more
- 03 Nov 2019
TL;DR: AIBox is presented, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs, and a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses.
124
Think fast: a tensor streaming processor (TSP) for accelerating deep learning workloads
Dennis Abts,Jonathan Ross,Jonathan Sparling,Mark Wong-VanHaren,Max Baker,Tom Hawkins,Andrew Bell,John F. Thompson,Temesghen Kahsai,Garrin Kimmell,Jennifer Hwang,Rebekah Leslie-Hurd,Michael Bye,E. R. Creswick,Matt Boyd,Mahitha Venigalla,Evan Laforge,Jon Purdy,Purushotham Kamath,Dinesh Maheshwari,Michael Beidler,Geert Rosseel,Omar Ahmad,Gleb Gagarin,Richard Czekalski,Ashay Rane,Sahil Parmar,Jeff Werner,Jim Sproch,Adrian Macias,Brian Kurtz +30 more
- 30 May 2020
TL;DR: The TSP architecture is introduced, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations.
93
References
Always-on Vision Processing Unit for Mobile Applications
Brendan Barry,Cormac Brick,Fergal Connor,David Donohoe,David Moloney,Richard Richmond,Martin O'Riordan,Vasile Toma +7 more
TL;DR: The vision processing unit incorporates parallelism, instruction set architecture, and microarchitectural features to provide highly sustainable performance efficiency across a range of computational-Imaging and computer vision applications, including those with low latency requirements on the order of milliseconds.
138
Investigating half precision arithmetic to accelerate dense linear system solvers
Azzam Haidar,Panruo Wu,Stanimire Tomov,Jack Dongarra +3 more
- 12 Nov 2017
TL;DR: This work shows for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers.
Low communication FMM-accelerated FFT on GPUs
Cris Cecka
- 12 Nov 2017
TL;DR: This work reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose, and presents a detailed and clear implementation strategy that relies heavily on existing library primitives.
•Book
Deep Learning
Ian Goodfellow,Yoshua Bengio,Aaron Courville +2 more
- 18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Related Papers (5)
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Norman P. Jouppi,Cliff Young,Nishant Patil,David A. Patterson,Gaurav Agrawal,Raminder Bajwa,Sarah Bates,Suresh Bhatia,Nan Boden,Albert T. Borchers,Rick Boyle,Pierre-luc Cantin,Clifford Chao,Christopher Aaron Clark,Jeremy Coriell,Michael J. Daley,Matt Dau,Jeffrey Dean,Ben Gelb,Tara Vazir Ghaemmaghami,Rajendra Gottipati,William John Gulland,Robert Hagmann,C. Richard Ho,Doug Hogberg,John Hu,Robert Hundt,D. Hurt,Julian Ibarz,Aaron Jaffey,Alek Jaworski,Alexander Kaplan,Khaitan Harshit,Daniel Killebrew,Andy Koch,Naveen Kumar,Steve Lacy,James Laudon,James Law,Diemthu Le,Chris Leary,Zhuyuan Liu,Kyle Lucke,Alan Lundin,Gordon MacKean,Adriana Maggiore,Maire Mahony,Kieran Miller,Rahul Nagarajan,Ravi Narayanaswami,Ray Ni,Kathy Nix,Thomas Norrie,Mark Omernick,Narayana Penukonda,Andrew Everett Phelps,Jonathan Ross,Matt Ross,Amir Salek,Emad Samadiani,Chris Severn,Gregory Sizikov,Matthew Snelham,Jed Souter,Dan Steinberg,Andy Swing,Mercedes Tan,Gregory Michael Thorson,Bo Tian,Horia Toma,Erick Tuttle,Vijay K. Vasudevan,Richard Walter,Walter Wang,Eric Wilcox,Doe Hyun Yoon +75 more
- 24 Jun 2017