GPU code optimization using abstract kernel emulation and sensitivity analysis
Changwan Hong,Aravind Sukumaran-Rajam,Jinsung Kim,Prashant Singh Rawat,Sriram Krishnamoorthy,Louis-Noël Pouchet,Fabrice Rastello,P. Sadayappan +7 more
- 11 Jun 2018
- Vol. 53, Iss: 4, pp 736-751
TL;DR: An approach to GPU kernel optimization is developed by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck.
read more
Abstract: In this paper, we develop an approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernel's execution. The utility of the bottleneck analysis is demonstrated in two contexts: 1) Coupling the new bottleneck-driven optimization strategy with the OpenTuner auto-tuner: experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve the performance of code from state-of-the-art domain-specific code generators.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators
TL;DR: An automated DSE framework - AutoDSE - is incorporated that leverages bottleneck-guided gradient optimizer to systematically find a better design point and finds the bottleneck of the design in each step and focuses on high-impact parameters to overcome that, like the approach an expert would take.
72
Optimization Techniques for GPU Programming
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
54
Tools for top-down performance analysis of GPU-accelerated applications
Keren Zhou,Mark W. Krentel,John Mellor-Crummey +2 more
- 29 Jun 2020
TL;DR: Extensions to Rice University's HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications and to support fine-grain analysis and tuning are described.
22
Taming the "Monster": Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling
Shizhen Xu,Yuanchao Xu,Wei Xue,Xipeng Shen,Fang Zheng,Xiaomeng Huang,Guangwen Yang +6 more
- 21 May 2018
TL;DR: An effort for overcoming the complexities of program optimizations on SW26010, the heterogeneous many-core processor that powers Sunway TaihuLight, the world top one supercomputer, is presented, showing a precise, static performance model that achieves a high accuracy and speeds up the tuning process by as much as a factor of 43 while keeping the tuning quality loss below 6%.
12
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators
TL;DR: An automated DSE framework—AutoDSE—that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point and detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it is proposed.
11
References
NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations
Marat Valiev,Eric J. Bylaska,Niranjan Govind,Karol Kowalski,T.P. Straatsma,H. J. J. van Dam,Dunyou Wang,Jarek Nieplocha,Edoardo Aprà,Theresa L. Windus,W. A. de Jong +10 more
TL;DR: An overview of NWChem is provided focusing primarily on the core theoretical modules provided by the code and their parallel performance, as well as Scalable parallel implementations and modular software design enable efficient utilization of current computational architectures.
5.6K
OpenMP: an industry standard API for shared-memory programming
TL;DR: At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism) and leaves the base language unspecified.
3.8K
Roofline: an insightful visual performance model for multicore architectures
TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.
A practical automatic polyhedral parallelizer and locality optimizer
Uday Bondhugula,Albert Hartono,J. Ramanujam,P. Sadayappan +3 more
- 07 Jun 2008
TL;DR: An automatic polyhedral source-to-source transformation framework that can optimize regular programs for parallelism and locality simultaneously simultaneously and is implemented into a tool to automatically generate OpenMP parallel code from C program sections.
1K
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Sunpyo Hong,Hyesoon Kim +1 more
- 20 Jun 2009
TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.