Optimization Techniques for GPU Programming
54
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
read more
Abstract: In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 articles published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
GPU Cluster Dynamics: Insights from Alibaba's 2023 Trace Release
Ahmad Siavashi,Mahmoud Momtazpour +1 more
- 01 Nov 2023
TL;DR: The Alibaba 2023 Trace Release provides valuable insights into the dynamics of GPU clusters, highlighting the intricate interplay between node and pod configurations and their associated metrics. The analysis reveals diverse configurations, including a balanced CPU-RAM distribution across nodes and a significant presence of latency-sensitive pods. However, the presence of Failed and Pending pods underscores the challenges in scheduling efficiency and resource allocation.
1
Towards a Benchmarking Suite for Kernel Tuners
Jacob O. Tørring,Ben van Werkhoven,Filip Petrovč,Floris-Jan Willemsen,Jiří Filipovič,Anne C. Elster +5 more
- 01 May 2023
TL;DR: A new benchmark suite for evaluating autotuners targeting GPUs is presented. The suite includes tunable GPU kernels and benchmarks that allow for comparisons between optimization algorithms and the examination of code optimization, search space difficulty, and performance portability.
Early-Adaptor: An Adaptive Framework forProactive UVM Memory Management
Hyunwuk Lee,Junsung Kim,Jiwon Lee,Myung Kuk Yoon,Won Woo Ro +4 more
- 01 Apr 2023
TL;DR: In this paper , the early-adaptor (EA) framework is proposed to dynamically control the prefetching aggressiveness based on the page fault history and page thrashing risk.
Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications
Stijn Heldens,Ben van Werkhoven +1 more
TL;DR: Kernel Launcher as mentioned in this paper is an easy-to-use C++ library that simplifies the creation of highly-tuned CUDA applications by capturing kernel launches, tuning the captured kernels for different setups, and integrating the tuning results back into applications using runtime compilation.
Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications
Stijn Heldens,Ben van Werkhoven +1 more
- 01 May 2023
TL;DR: Kernel Launcher is a C++ library that simplifies the creation of highly-tuned CUDA applications by automating kernel launch capture, tuning, and integration of results.
References
Deep learning
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
67K
A Proof for the Queuing Formula: L = λW
TL;DR: In this paper, it was shown that if the three means are finite and the corresponding stochastic processes strictly stationary, and if the arrival process is metrically transitive with nonzero mean, then L = λW.
2.7K
Roofline: an insightful visual performance model for multicore architectures
TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.
NVIDIA Tesla: A Unified Graphics and Computing Architecture
TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
Jonathan Ragan-Kelley,Connelly Barnes,Andrew Adams,Sylvain Paris,Frédo Durand,Saman Amarasinghe +5 more
- 16 Jun 2013
TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
1.2K