Optimization techniques for OpenCL-based linear algebra routines

doi:10.1117/12.2050673

Proceedings Article10.1117/12.2050673

Optimization techniques for OpenCL-based linear algebra routines

Stephen Kozacik, +5 more

- 13 Jun 2014

- Proceedings of SPIE

- Vol. 9095, pp 65-70

2

TL;DR: Preliminary results from this work confirm that optimizations are not portable from one device to the next, and show the benefits of automatic tuning.

Abstract: The OpenCL standard for general-purpose parallel programming allows a developer to target highly parallel computations towards graphics processing units (GPUs), CPUs, co-processing devices, and field programmable gate arrays (FPGAs). The computationally intense domains of linear algebra and image processing have shown significant speedups when implemented in the OpenCL environment. A major benefit of OpenCL is that a routine written for one device can be run across many different devices and architectures; however, a kernel optimized for one device may not exhibit high performance when executed on a different device. For this reason kernels must typically be hand-optimized for every target device family. Due to the large number of parameters that can affect performance, hand tuning for every possible device is impractical and often produces suboptimal results. For this work, we focused on optimizing the general matrix multiplication routine. General matrix multiplication is used as a building block for many linear algebra routines and often comprises a large portion of the run-time. Prior work has shown this routine to be a good candidate for high-performance implementation in OpenCL. We selected several candidate algorithms from the literature that are suitable for parameterization. We then developed parameterized kernels implementing these algorithms using only portable OpenCL features. Our implementation queries device information supplied by the OpenCL runtime and utilizes this as well as user input to generate a search space that satisfies device and algorithmic constraints. Preliminary results from our work confirm that optimizations are not portable from one device to the next, and show the benefits of automatic tuning. Using a standard set of tuning parameters seen in the literature for the NVIDIA Fermi architecture achieves a performance of 1.6 TFLOPS on an AMD 7970 device, while automatically tuning achieves a peak of 2.7 TFLOPS

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

54

Proceedings Article•10.1117/12.2177410

Adaptive OpenCL libraries for platform portability

Paul Fox, +3 more

- 22 May 2015

- Proceedings of SPIE

TL;DR: The OpenCL API provides an abstract mechanism for massively parallel programming on a very wide range of hardware, including traditional CPUs, GPUs, accelerator devices, FPGAs, and more, but these different hardware architectures and platforms function quite differently.

...read moreread less

References

•Book Chapter•10.1007/978-3-642-19328-6_10

Accelerating GPU kernels for dense linear algebra

Rajib Nath, +2 more

- 22 Jun 2010

TL;DR: Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized.

...read moreread less

Proceedings Article•10.1109/SC.COMPANION.2012.59

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Kazuya Matsumoto, +2 more

- 10 Nov 2012

TL;DR: This paper has developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL that shows higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable.

...read moreread less

Proceedings Article•10.1145/2063384.2063431

Fast implementation of DGEMM on Fermi GPU

Guangming Tan, +5 more

- 12 Nov 2011

TL;DR: This paper presents a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture and chooses an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermani memory hierarchy.

...read moreread less

•Journal Article•10.1109/TPDS.2011.311

Autotuning GEMM Kernels for the Fermi GPU

Jakub Kurzak, +2 more

- 01 Nov 2012

- IEEE Transactions on Parallel and Distri...

TL;DR: This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking.

...read moreread less

Optimization techniques for OpenCL-based linear algebra routines

Chat with Paper

AI Agents for this Paper

Citations

Optimization Techniques for GPU Programming

Adaptive OpenCL libraries for platform portability

References

Accelerating GPU kernels for dense linear algebra

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Fast implementation of DGEMM on Fermi GPU

Autotuning GEMM Kernels for the Fermi GPU

Related Papers (5)

CLTune: A Generic Auto-Tuner for OpenCL Kernels

Generating Portable High-Performance Code via Multi-Dimensional Homomorphisms

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems

Tuning a Finite Difference Computation for Parallel Vector Processors

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs