Optimization Techniques for GPU Programming
54
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
read more
Abstract: In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 articles published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Unleashing the potential: AI empowered advanced metasurface research
Yunlai Fu,Xuxi Zhou,Yiwan Yu,Jiawang Chen,Shuming Wang,Shining Zhu,Zhenlin Wang +6 more
TL;DR: AI-powered advanced metasurface research explores the intersection of AI and metasurfaces, leveraging AI's computational power to design, analyze, and optimize metasurfaces for various applications.
GPU acceleration of Levenshtein distance computation between long strings
TL;DR: In this paper , a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains, are presented, which is the best ever reported.
7
Progress and Opportunities of Foundation Models in Bioinformatics
Qing Li,Zhihang Hu,Yixuan Wang,Lei Li,Yimin Fan,Irwin King,Le Song,Yu Li +7 more
TL;DR: A systematic investigation and summary of FMs in bioinformatics, tracing their evolution, current research status, and the methodologies employed, aiming to guide the research community in choosing appropriate FMs for their research needs.
6
Sustainable Optimizing Performance and Energy Efficiency in Proof of Work Blockchain: A Multilinear Regression Approach
Meennapa Rukhiran,Songwut Boonsong,Paniti Netinant +2 more
TL;DR: The results reveal that strategically adjusting GPU hardware, software, and configuration can preserve substantial energy while preserving computational efficiency, and offer practical recommendations for optimizing the feature configurations of GPUs to reduce energy consumption, mitigate the environmental impacts of blockchain operations, and contribute to the current research on performance in PoW blockchain applications.
6
References
Accelerating Cost Aggregation for Real-Time Stereo Matching
Jianbin Fang,Ana Lucia Varbanescu,Jie Shen,Henk Sips,Gorkem Saygili,Laurens van der Maaten +5 more
- 17 Dec 2012
TL;DR: This paper presents a generic representation and suitable implementations for three commonly used cost aggregators on many-core processors, and performs typical optimizations on the kernels, which leads to significant performance improvement (up to two orders of magnitude).
On-GPU Thread-Data Remapping for Branch Divergence Reduction
TL;DR: This work proposes the first on-GPU thread-data remapping scheme to achieve runtime on-the-spot branch divergence reduction, and implements three GPGPU frontier benchmarks from areas including computer vision, algorithmic trading and data analytics.
14
Acceleration of Bilateral Filtering Algorithm for Manycore and Multicore Architectures
Dinesh Agarwal,Sami Wilf,Abinashi Dhungel,Sushil K. Prasad +3 more
- 10 Sep 2012
TL;DR: This work has created a novel pair-symmetric algorithm to avoid redundant calculations in many core architectures and proposes architecture specific optimizations, such as exploiting the unique capabilities of special registers available in modern multicore architectures and the rearrangement of data access patterns as per the computations to exploit special purpose instructions.
14
Accelerated implementation of adaptive directional lifting-based discrete wavelet transform on GPU
TL;DR: The results show that the Slice method overcomes the limitation of high data dependency between the lifting steps and achieves more than 10 times speedup compared to the optimized CPU implementation for the ADL-based transform.
14
Reuse and Refactoring of GPU Kernels to Design Complex Applications
Santonu Sarkar,Sayantan Mitra,Ashok Srinivasan +2 more
- 10 Jul 2012
TL;DR: The contribution of this work lies in extending component based design research in a new direction, dealing with the performance impact of refactoring an application consisting of the composition of highly tuned kernels, and demonstrating that the techniques lead to over 50% improvement with some kernels.