Optimizing memory-bound SYMV kernel on GPU hardware accelerators

doi:10.1007/978-3-642-38718-0_10

Open AccessBook Chapter10.1007/978-3-642-38718-0_10

Optimizing memory-bound SYMV kernel on GPU hardware accelerators

Ahmad Abdelfattah, +3 more

- 17 Jul 2012

- pp 72-79

12

TL;DR: This work presents an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs, and proposes a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1002/CPE.3152

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

Ichitaro Yamazaki, +5 more

- 01 Nov 2014

- Concurrency and Computation: Practice an...

TL;DR: This work develops a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node and integrates this kernel into computational kernels at higher‐levels of software stacks, which greatly improve the performance of these higher‐level kernels.

...read moreread less

39

•Journal Article•10.1145/3267101

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Ali Charara, +2 more

- 03 May 2019

- ACM Transactions on Mathematical Softwar...

TL;DR: This work describes the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs and outperforms existing state-of-the-art implementations.

...read moreread less

24

•Posted Content

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

Ahmad Abdelfattah, +2 more

- 07 Oct 2014

- arXiv: Mathematical Software

TL;DR: Considering symmetric and Hermitian matrices, the KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes and achieves asymptotically up to 50p and 60p speedup against the best competitor on single GPU and multi-GPUs systems, respectively.

...read moreread less

23

•Dissertation•10.25781/KAUST-5M8Z4

Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems

Ali Charara

- 24 May 2018

TL;DR: Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems shows the power of dataarsity to improve the quality of computations on heterogeneous systems.

...read moreread less

1

•Dissertation•10.25781/KAUST-2QE21

Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs

Ahmad Abdelfattah

- 15 Jan 2015

TL;DR: This dissertation presents a uniform design strategy for optimizing critical memorybound kernels on GPUs based on hierarchical register blocking, double bu↵ering and latency hiding techniques, which leverages the performance of a wide range of standard numerical kernels found in dense and sparse linear algebra libraries.

...read moreread less

1

References

•Journal Article•10.12694/SCPE.V11I3.654

Programming Massively Parallel Processors. A Hands-on Approach

Jie Cheng

- 01 Jan 2010

- Scalable Computing: Practice and Experie...

TL;DR: This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs).

...read moreread less

1.9K

•Book

Programming Massively Parallel Processors: A Hands-on Approach

David B. Kirk, +1 more

- 31 Dec 2012

TL;DR: Programming Massively Parallel Processors: A Hands-on Approach as discussed by the authors shows both student and professional alike the basic concepts of parallel programming and GPU architecture, and various techniques for constructing parallel programs are explored in detail.

...read moreread less

1.8K

•Proceedings Article•10.1109/SC.2008.5214359

Benchmarking GPUs to tune dense linear algebra

Volkov, +1 more

- 01 Jan 2008

TL;DR: It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.

...read moreread less

711

•Proceedings Article•10.1145/2063384.2063392

Optimizing symmetric dense matrix-vector multiplication on GPUs

Rajib Nath, +3 more

- 12 Nov 2011

TL;DR: This work presents a new algorithm for optimizing the SYMV kernel on GPUs that brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 and is general enough to be of interest for developing high-performance GPU kernels beyond the particular case ofSYMV.

...read moreread less

66