Proceedings Article10.1145/2597652.2597678
An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs
Arash Ashari,Naser Sedaghati,John Eisenlohr,P. Sadayappan +3 more
- 10 Jun 2014
- pp 273-282
74
TL;DR: A new blocked row-column (BRC) storage format with a novel two-dimensional blocking mechanism that effectively addresses the challenges: it reduces thread divergence by reordering and grouping rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps).
read more
Abstract: Sparse matrix-vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and non-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs.In this paper we present a new blocked row-column (BRC) storage format with a novel two-dimensional blocking mechanism that effectively addresses the challenges: it reduces thread divergence by reordering and grouping rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an efficient auto-tuning technique to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. A CUDA implementation of BRC outperforms NVIDIA CUSP and cuSPARSE libraries and other state-of-the-art SpMV formats on a range of unstructured sparse matrices from multiple application domains. The BRC format has been integrated with PETSc, enabling its use in PETSc's solvers.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
Weifeng Liu,Brian Vinter +1 more
- 08 Jun 2015
TL;DR: CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi, is proposed for real-world applications such as a solver with only tens of iterations because of its low-overhead for format conversion.
Fast sparse matrix-vector multiplication on GPUs for graph applications
Arash Ashari,Naser Sedaghati,John Eisenlohr,Srinivasan Parthasarathy,P. Sadayappan +4 more
- 16 Nov 2014
TL;DR: ACSR is presented, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups which have a similar number of non-zero elements, and thus avoids significant preprocessing overheads.
174
•Posted Content
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
Weifeng Liu,Brian Vinter +1 more
TL;DR: In this article, the authors proposed CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi.
154
Automatic Selection of Sparse Matrix Representation on GPUs
Naser Sedaghati,Te Mu,Louis-Noël Pouchet,Srinivasan Parthasarathy,P. Sadayappan +4 more
- 08 Jun 2015
TL;DR: This paper performs extensive characterization of pertinent sparsity features of around 700 sparse matrices and their SpMV performance with a number of sparse representations implemented in the NVIDIA CUSP and cuSPARSE libraries, and builds a decision model using machine learning to automatically select the best representation to use for a given sparse matrix on a given target platform.
143
Evaluation Criteria for Sparse Matrix Storage Formats
Daniel Langr,Pavel Tvrdík +1 more
TL;DR: Ten evaluation criteria for sparse matrix storage formats for sparse matrices are established, their advantages and disadvantages are discussed, and general suggestions for format authors/evaluators are provided to make their work more valuable for the HPC community.
141
References
Scalable parallel programming with CUDA
John R. Nickolls,Ian Buck,Michael Garland,Kevin Skadron +3 more
- 11 Aug 2008
TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?
TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.
1.4K
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Nathan Bell,Michael Garland +1 more
- 14 Nov 2009
TL;DR: This work explores SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes, including structured grid and unstructured mesh matrices.
SPARSKIT: A basic tool kit for sparse matrix computations
Youcef Saad
- 21 May 1990
TL;DR: The main features of a tool package for manipulating and working with sparse matrices, to provide basic tools to facilitate the exchange of software and data between researchers in sparse matrix computations, are presented.
Scan primitives for GPU computing
Shubhabrata Sengupta,Mark J. Harris,Yao Zhang,John D. Owens +3 more
- 04 Aug 2007
TL;DR: Using the scan primitives, this work shows novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyzes the performance of the scanPrimitives, several sort algorithms that use the scan Primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.