Optimization Techniques for GPU Programming

doi:10.1145/3570638

Open AccessJournal Article10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

- Vol. 55, Iss: 11, pp 1-81

54

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1515/nanoph-2023-0759

Unleashing the potential: AI empowered advanced metasurface research

Yunlai Fu, +6 more

- 27 Feb 2024

- Nanophotonics

TL;DR: AI-powered advanced metasurface research explores the intersection of AI and metasurfaces, leveraging AI's computational power to design, analyze, and optimize metasurfaces for various applications.

...read moreread less

11

•Journal Article•10.1016/j.parco.2023.103019

GPU acceleration of Levenshtein distance computation between long strings

David Castells-Rufas

- 01 Jul 2023

- Parallel Computing

TL;DR: In this paper , a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains, are presented, which is the best ever reported.

...read moreread less

7

Journal Article•10.48550/arxiv.2402.04286

Progress and Opportunities of Foundation Models in Bioinformatics

Qing Li, +7 more

- 06 Feb 2024

- arXiv.org

TL;DR: A systematic investigation and summary of FMs in bioinformatics, tracing their evolution, current research status, and the methodologies employed, aiming to guide the research community in choosing appropriate FMs for their research needs.

...read moreread less

6

Journal Article•10.3390/su16041519

Sustainable Optimizing Performance and Energy Efficiency in Proof of Work Blockchain: A Multilinear Regression Approach

Meennapa Rukhiran, +2 more

- 10 Feb 2024

- Sustainability

TL;DR: The results reveal that strategically adjusting GPU hardware, software, and configuration can preserve substantial energy while preserving computational efficiency, and offer practical recommendations for optimizing the feature configurations of GPUs to reduce energy consumption, mitigate the environmental impacts of blockchain operations, and contribute to the current research on performance in PoW blockchain applications.

...read moreread less

6

Journal Article•10.2139/ssrn.4244720

GPU acceleration of Levenshtein distance computation between long strings

David Castells-Rufas

- 01 Apr 2023

- Parallel Computing

5

...

Expand

References

Proceedings Article•10.1109/SC.2014.68

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Joseph L. Greathouse, +1 more

- 16 Nov 2014

TL;DR: This work proposes a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs, and achieves an average speedup of 14.7× over existingCSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

...read moreread less

238

Proceedings Article•10.1145/1964179.1964184

Reducing branch divergence in GPU programs

Tianyi David Han, +1 more

- 05 Mar 2011

TL;DR: This work proposes two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence, and shows that they improve the performance of the synthetic benchmarks and that of the real-world application by 12% and 16% respectively.

...read moreread less

235

Journal Article•10.1109/TPDS.2010.107

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Byunghyun Jang, +3 more

- 01 Jan 2011

- IEEE Transactions on Parallel and Distri...

TL;DR: Techniques for enhancing the memory efficiency of applications on data-parallel architectures are presented, based on the analysis and characterization of memory access patterns in loop bodies; they target vectorization via data transformation to benefit vector-based architectures and algorithmic memory selection for scalar- based architectures.

...read moreread less

231

•Proceedings Article•10.1109/HPCA.2011.5749714

Thread block compaction for efficient SIMT control flow

Wilson W. L. Fung, +1 more

- 12 Feb 2011

TL;DR: This paper proposes and evaluates the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads, and shows that this compaction mechanism provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism.

...read moreread less

223

Journal Article•10.1142/S0129626411000187

High performance and scalable radix sorting: a case study of implementing dynamic parallelism for gpu computing

Duane Merrill, +1 more

- 21 Nov 2011

- Parallel Processing Letters

TL;DR: A family of very efficient parallel algorithms for radix sorting; and the authors' allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism are presented.

...read moreread less

220

...

Expand

Optimization Techniques for GPU Programming

Chat with Paper

AI Agents for this Paper

Citations

Unleashing the potential: AI empowered advanced metasurface research

GPU acceleration of Levenshtein distance computation between long strings

Progress and Opportunities of Foundation Models in Bioinformatics

Sustainable Optimizing Performance and Energy Efficiency in Proof of Work Blockchain: A Multilinear Regression Approach

GPU acceleration of Levenshtein distance computation between long strings

References

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Reducing branch divergence in GPU programs

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Thread block compaction for efficient SIMT control flow

High performance and scalable radix sorting: a case study of implementing dynamic parallelism for gpu computing

Related Papers (5)

GPU Acceleration Using CUDA Framework

SkelCL: a high-level extension of OpenCL for multi-GPU systems

High Performance Matrix Multiplication on General Purpose Graphics Processing Units

Graphics Processing Units and Open Computing Language for parallel computing

GPU accelerated fast FEM deformation simulation