GPU code optimization using abstract kernel emulation and sensitivity analysis

doi:10.1145/3192366.3192397

Open AccessProceedings Article10.1145/3192366.3192397

GPU code optimization using abstract kernel emulation and sensitivity analysis

Changwan Hong, +7 more

- 11 Jun 2018

- Vol. 53, Iss: 4, pp 736-751

10

TL;DR: An approach to GPU kernel optimization is developed by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

Atefeh Sohrabizadeh, +3 more

- 30 Sep 2020

- arXiv: Hardware Architecture

TL;DR: An automated DSE framework - AutoDSE - is incorporated that leverages bottleneck-guided gradient optimizer to systematically find a better design point and finds the bottleneck of the design in each step and focuses on high-impact parameters to overcome that, like the approach an expert would take.

...read moreread less

72

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

54

Proceedings Article•10.1145/3392717.3392752

Tools for top-down performance analysis of GPU-accelerated applications

Keren Zhou, +2 more

- 29 Jun 2020

TL;DR: Extensions to Rice University's HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications and to support fine-grain analysis and tuning are described.

...read moreread less

22

Proceedings Article•10.1109/IPDPS.2018.00086

Taming the "Monster": Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling

Shizhen Xu, +6 more

- 21 May 2018

TL;DR: An effort for overcoming the complexities of program optimizations on SW26010, the heterogeneous many-core processor that powers Sunway TaihuLight, the world top one supercomputer, is presented, showing a precise, static performance model that achieves a high accuracy and speeds up the tuning process by as much as a factor of 43 while keeping the tuning quality loss below 6%.

...read moreread less

12

•Journal Article•10.1145/3494534

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

Atefeh Sohrabizadeh, +3 more

- 12 Feb 2022

- ACM Transactions on Design Automation of...

TL;DR: An automated DSE framework—AutoDSE—that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point and detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it is proposed.

...read moreread less

11

References

•Journal Article•10.1016/J.CPC.2010.04.018

NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations

Marat Valiev, +10 more

- 01 Sep 2010

- Computer Physics Communications

TL;DR: An overview of NWChem is provided focusing primarily on the core theoretical modules provided by the code and their parallel performance, as well as Scalable parallel implementations and modular software design enable efficient utilization of current computational architectures.

...read moreread less

5.6K

Journal Article•10.1109/99.660313

OpenMP: an industry standard API for shared-memory programming

L. Dagum, +1 more

- 01 Jan 1998

TL;DR: At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism) and leaves the base language unspecified.

...read moreread less

3.8K

•Journal Article•10.1145/1498765.1498785

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, +2 more

- 01 Apr 2009

- Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

2.6K

Proceedings Article•10.1145/1375581.1375595

A practical automatic polyhedral parallelizer and locality optimizer

Uday Bondhugula, +3 more

- 07 Jun 2008

TL;DR: An automatic polyhedral source-to-source transformation framework that can optimize regular programs for parallelism and locality simultaneously simultaneously and is implemented into a tool to automatically generate OpenMP parallel code from C program sections.

...read moreread less

1K

•Proceedings Article•10.1145/1555754.1555775

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Sunpyo Hong, +1 more

- 20 Jun 2009

TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.

...read moreread less

749