Optimization Techniques for GPU Programming
54
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
read more
Abstract: In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 articles published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Unleashing the potential: AI empowered advanced metasurface research
Yunlai Fu,Xuxi Zhou,Yiwan Yu,Jiawang Chen,Shuming Wang,Shining Zhu,Zhenlin Wang +6 more
TL;DR: AI-powered advanced metasurface research explores the intersection of AI and metasurfaces, leveraging AI's computational power to design, analyze, and optimize metasurfaces for various applications.
GPU acceleration of Levenshtein distance computation between long strings
TL;DR: In this paper , a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains, are presented, which is the best ever reported.
7
Progress and Opportunities of Foundation Models in Bioinformatics
Qing Li,Zhihang Hu,Yixuan Wang,Lei Li,Yimin Fan,Irwin King,Le Song,Yu Li +7 more
TL;DR: A systematic investigation and summary of FMs in bioinformatics, tracing their evolution, current research status, and the methodologies employed, aiming to guide the research community in choosing appropriate FMs for their research needs.
6
Sustainable Optimizing Performance and Energy Efficiency in Proof of Work Blockchain: A Multilinear Regression Approach
Meennapa Rukhiran,Songwut Boonsong,Paniti Netinant +2 more
TL;DR: The results reveal that strategically adjusting GPU hardware, software, and configuration can preserve substantial energy while preserving computational efficiency, and offer practical recommendations for optimizing the feature configurations of GPUs to reduce energy consumption, mitigate the environmental impacts of blockchain operations, and contribute to the current research on performance in PoW blockchain applications.
6
References
Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication
Pham Nguyen Quang Anh,Rui Fan,Yonggang Wen +2 more
- 01 Jun 2016
TL;DR: This paper proposes two low cost methods to achieve perfect load balancing during the most expensive step in SpGEMM and shows how to eliminate nearly all random global memory accesses using shared memory based hash tables.
43
Load-balancing Sparse Matrix Vector Product Kernels on GPUs
Hartwig Anzt,Terry Cojean,Chen Yen-Chen,Jack Dongarra,Goran Flegar,Pratik Nayak,Stanimire Tomov,Yuhsiang M. Tsai,Weichung Wang +8 more
- 28 Mar 2020
TL;DR: A compressed sparse row (CSR) format suitable for unbalanced matrices is presented and a load-balancing kernel for the coordinate (COO) matrix format is provided and extended to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format.
Accelerating the dynamic programming for the optimal polygon triangulation on the GPU
Kazufumi Nishida,Koji Nakano,Yasuaki Ito +2 more
- 04 Sep 2012
TL;DR: An efficient parallel implementation of this O(n3)-time algorithm on the GPU solves the optimal polygon triangulation problem for a convex 16384-gon in 69.1 seconds on the NVIDIA GeForce GTX 580, while a conventional CPU implementation runs in 17105.5 seconds.
42
GPUexplore 2.0: Unleashing GPU Explicit-State Model Checking
Anton Wijs,Thomas Neele,Thomas Neele,Dragan Bošnački +3 more
- 09 Nov 2016
TL;DR: A new version of the GPU model checker, GPUexplore, running on state-of-the-art hardware can be more than 100 times faster than a sequential implementation for large models and is on average eight times faster compared to the previous version of this tool running on the same hardware.
42
Efficient warp execution in presence of divergence with collaborative context collection
Farzad Khorasani,Rajiv Gupta,Laxmi N. Bhuyan +2 more
- 05 Dec 2015
TL;DR: This work presents a software technique named Collaborative Context Collection (CCC), which improves the warp execution efficiency of real-world benchmarks by up to 56% and achieves an average speedup of 1.69× (maximum 3.08×) and proposes code transformations to enable applicability of CCC to variety of program segments with thread divergence.