Optimizing memory-bound SYMV kernel on GPU hardware accelerators
Ahmad Abdelfattah,Jack Dongarra,David E. Keyes,Hatem Ltaief +3 more
- 17 Jul 2012
- pp 72-79
TL;DR: This work presents an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs, and proposes a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth.
read more
Abstract: Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
Ichitaro Yamazaki,Tingxing Dong,Raffaele Solcí,Stanimire Tomov,Jack Dongarra,Thomas C. Schulthess +5 more
TL;DR: This work develops a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node and integrates this kernel into computational kernels at higher‐levels of software stacks, which greatly improve the performance of these higher‐level kernels.
Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
TL;DR: This work describes the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs and outperforms existing state-of-the-art implementations.
•Posted Content
KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators
TL;DR: Considering symmetric and Hermitian matrices, the KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes and achieves asymptotically up to 50p and 60p speedup against the best competitor on single GPU and multi-GPUs systems, respectively.
23
Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems
Ali Charara
- 24 May 2018
TL;DR: Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems shows the power of dataarsity to improve the quality of computations on heterogeneous systems.
1
Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs
Ahmad Abdelfattah
- 15 Jan 2015
TL;DR: This dissertation presents a uniform design strategy for optimizing critical memorybound kernels on GPUs based on hierarchical register blocking, double bu↵ering and latency hiding techniques, which leverages the performance of a wide range of standard numerical kernels found in dense and sparse linear algebra libraries.
1
References
Programming Massively Parallel Processors. A Hands-on Approach
TL;DR: This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs).
1.9K
•Book
Programming Massively Parallel Processors: A Hands-on Approach
David B. Kirk,Wen-mei W. Hwu +1 more
- 31 Dec 2012
TL;DR: Programming Massively Parallel Processors: A Hands-on Approach as discussed by the authors shows both student and professional alike the basic concepts of parallel programming and GPU architecture, and various techniques for constructing parallel programs are explored in detail.
1.8K
Benchmarking GPUs to tune dense linear algebra
TL;DR: It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.
Optimizing symmetric dense matrix-vector multiplication on GPUs
Rajib Nath,Stanimire Tomov,Tingxing "Tim" Dong,Jack Dongarra +3 more
- 12 Nov 2011
TL;DR: This work presents a new algorithm for optimizing the SYMV kernel on GPUs that brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 and is general enough to be of interest for developing high-performance GPU kernels beyond the particular case ofSYMV.
High Performance Computing for Computational Science - VECPAR 2010
TL;DR: Why should wait for some days to get or receive the high performance computing for computational science vecpar 2010 book that you order?
59