1. What are the contributions in "Performance portable gpu code generation for matrix multiplication" ?
Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication.. The authors argue that what is needed is a way to describe applications at a high-level without committing to particular implementations.. To this end, the authors developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way.. In this paper, the authors demonstrate how this approach produces high-performance OpenCL code for GPUs with a wellstudied, well-understood application: matrix multiplication.
read more
2. What is the implementation of matrix multiplication in figure 2?
The implementation in figure 2 takes advantage of many hardware features such as vectorized loads and local memory, which involves the use of synchronization primitives.
read more
3. What is the function that reorders the reads of the primitive?
The gather will reorder the memory reads of the following primitive while scatter will reorder the writes of the preceding primitive.
read more
4. What does the IR say about the variable float4?
Declaring a variable of type float4 for instance, implies that the operations performed on this variable are executed by vector units.
read more





