What are the main reasons why GPUs are not performance portable?

their implementations still rely on manually optimized code or on device-specific compiler heuristics, which are not performance portable.

What is the main motivation for writing a high performance program?

The difficulty to achieve high performance motivates the need for new compilation techniques capable of automatically producing code close to manually optimized implementations from an easy to write high-level program.

How long does it take to produce high-performance code?

The authors argue that automatically producing high-performance code is possible if the authors start from a high-level functional program representation and keep it in the compiler pipeline for as long as possible.

What is the definition of matrix multiplication starting on line 6?

In the definition of matrix multiplication starting on line 6 the two input matrices A and B are applied to the map function on line 7 and 8.

What is the way to achieve high performance code from high-level programs?

programmers should write simple programs like the naı̈ve version and automatically obtain the performance of the hand-tuned one.

how to generate code using the opencl specific patterns?

It is easy to generate code using the OpenCL specific patterns, as optimization decisions are encoded explicitly and each primitive directly corresponds to a templated OpenCL code.

What is the performance of the clBLAS library?

The authors can see from figure 3 that the clBLAS library version performs 5× better than the naı̈ve version, the tuned library version 8× better and the hand-optimized version even 10× better.

Open AccessProceedings Article10.1145/2884045.2884046

Performance portable GPU code generation for matrix multiplication

Toomas Remmelg, +3 more

- 12 Mar 2016

- pp 22-31

TL;DR: This paper develops in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way and produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication.

Abstract: Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device.Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device-specific forms, from which OpenCL code is generated.In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -- but provably correct -- implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD's clBLAS library.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What are the contributions in "Performance portable gpu code generation for matrix multiplication" ?

Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication.. The authors argue that what is needed is a way to describe applications at a high-level without committing to particular implementations.. To this end, the authors developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way.. In this paper, the authors demonstrate how this approach produces high-performance OpenCL code for GPUs with a wellstudied, well-understood application: matrix multiplication.

2. What is the implementation of matrix multiplication in figure 2?

The implementation in figure 2 takes advantage of many hardware features such as vectorized loads and local memory, which involves the use of synchronization primitives.

3. What is the function that reorders the reads of the primitive?

The gather will reorder the memory reads of the following primitive while scatter will reorder the writes of the preceding primitive.

4. What does the IR say about the variable float4?

Declaring a variable of type float4 for instance, implies that the operations performed on this variable are executed by vector units.

Figure 2: Hand-optimized OpenCL kernel for fast matrix multiplication on an AMD GPU.

Figure 6: Example of two classical optimizations for matrix multiplication.

Figure 7: Transforming matrix multiplication by combining optimizations.

Figure 11: Performance evolution for randomized search of automatically generated OpenCL kernels. The dotted line represents 90% of the performance achieved by the best kernel. The dots mark how many kernels have to be tested to reach this performance with a confidence of 95%.

Figure 12: Performance of the best OpenCL kernels across platforms and data sizes.

Figure 10: Distribution of performance for generated kernels.

Citations

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

•Proceedings Article•10.1145/2968455.2968521

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation

Michel Steuwer, +2 more

- 01 Oct 2016

TL;DR: This paper shows how performance portability for matrix multiplication is achieved using a compiler approach based on a recently developed generic technique that combines a high-level programming model with a system of rewrite rules, resulting in high-performance code for very different types of architectures such as desktop and mobile GPUs.

...read moreread less

•Proceedings Article•10.1145/3092703.3092720

Compiler-assisted test acceleration on GPUs for embedded software

Vanya Yaneva, +2 more

- 10 Jul 2017

TL;DR: This paper demonstrates, for the first time, how test executions of embedded C programs can be automatically performed on a GPU, without involving the end user, through a compiler-assisted approach which automatically compiles the C program into GPU kernels for parallel execution of the input tests.

...read moreread less

•Proceedings Article•10.1145/3033019.3033023

Optimization space pruning without regrets

Ulysse Beaugnon, +4 more

- 05 Feb 2017

TL;DR: A novel approach to automatically discover the best performing code from a given set of possible implementations, involving a branch and bound algorithm with two distinctive features: an analytic performance model of a lower bound on the execution time, and the ability to estimate such bounds on a partially-specified implementation.

...read moreread less

•Proceedings Article•10.1145/3392717.3392746

BurstZ: a bandwidth-efficient scientific computing accelerator platform for large-scale data

Gongjin Sun, +2 more

- 29 Jun 2020

TL;DR: It is demonstrated that BurstZ can completely remove the communication bottleneck for accelerators, using a 3D stencil-code accelerator implemented on a prototype BurstZ implementation, and evaluated against hand-optimized implementations of stencil code accelerators of the same architecture.

...read moreread less

...

Expand

References

•Book

OpenCL Programming Guide

Aaftab Munshi, +4 more

- 07 Jul 2011

TL;DR: This is the first comprehensive, authoritative, and practical guide to OpenCL 1.1 specifically for working developers and software architects and shows how OpenCL can express a wide range of parallel algorithms, and offers complete reference material on both the API and OpenCL C programming language.

...read moreread less

434

•Journal Article•10.1145/2400682.2400713

Polyhedral parallel code generation for CUDA

Sven Verdoolaege, +5 more

- 20 Jan 2013

TL;DR: A novel source-to-source compiler called PPCG is presented, which introduces a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs.

...read moreread less

431

•Book Chapter•10.1007/978-3-642-11970-5_14

Automatic C-to-CUDA code generation for affine programs

Muthu Baskaran, +2 more

- 20 Mar 2010

TL;DR: An automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs, that is quite close to hand-optimizedCUDA code and considerably better than the benchmarks' performance on a multicore CPU.

...read moreread less

238

Journal Article•10.1145/362875.362879

Organizing matrices and matrix operations for paged memory systems

A. C. McKellar, +1 more

- 01 Mar 1969

- Communications of The ACM

TL;DR: It is shown that carefully designed matrix algorithms can lead to enormous savings in the number of page faults occurring when only a small part of the total matrix can be in main memory at one time.

...read moreread less

190

Journal Article•10.1145/2584665

Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages

Arvind K. Sujeeth, +6 more

- 01 Apr 2014

- ACM Transactions in Embedded Computing S...

TL;DR: An overview of the Delite compiler framework and DSLs that have been developed with it is presented and it is shown that they all achieve performance competitive to or exceeding Cpp code.

...read moreread less

181

...

Expand

Performance portable GPU code generation for matrix multiplication

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions in "Performance portable gpu code generation for matrix multiplication" ?

2. What is the implementation of matrix multiplication in figure 2?

3. What is the function that reorders the reads of the primitive?

4. What does the IR say about the variable float4?

5. What are the main reasons why GPUs are not performance portable?

6. What is the main motivation for writing a high performance program?

7. How long does it take to produce high-performance code?

8. What is the definition of matrix multiplication starting on line 6?

9. What is the way to achieve high performance code from high-level programs?

10. how to generate code using the opencl specific patterns?

11. What is the performance of the clBLAS library?

Figures

Citations

Optimization Techniques for GPU Programming

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation

Compiler-assisted test acceleration on GPUs for embedded software

Optimization space pruning without regrets

BurstZ: a bandwidth-efficient scientific computing accelerator platform for large-scale data

References

OpenCL Programming Guide

Polyhedral parallel code generation for CUDA

Automatic C-to-CUDA code generation for affine programs

Organizing matrices and matrix operations for paged memory systems

Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages

Related Papers (5)

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation

Towards performance-portable, scalable, and convenient linear algebra

Towards Automatic Transformation of Legacy Scientific Code into OpenCL for Optimal Performance on FPGAs

High performance stencil code generation with Lift