1. What contributions have the authors mentioned in the paper "Efficient sparse matrix-vector multiplication on cache-based gpus" ?
This paper discusses efficient implementations of sparse matrix-vector multiplication on NVIDIA ’ s Fermi architecture, the first to introduce conventional L1 caches to GPUs.. The authors present a parametrised algorithm, show the effects of parameter tuning on performance and introduce a method for determining the nearoptimal set of parameters that incurs virtually no overhead.. On a set of sparse matrices from the University of Florida Sparse Matrix Collection the authors show an average speed-up of 2. 1 times over NVIDIA ’ s CUSPARSE 4. 0 library in single precision and 1. 4 times in double precision.. Many algorithms require repeated evaluation of sparse matrix-vector products with the same matrix, so the authors introduce a dynamic run-time auto-tuning system which improves performance by 10-15 % in seven iterations.. The authors show how problemspecific knowledge can be used to improve performance by up to a factor of two.
read more
2. How many threads can be allocated to a group of blocks?
Groups of warps called blocks are assigned to SMs and at the same time each SM can hold up to 8 blocks (depending on the amount of resources they require), but no more than 1536 threads (for compute capability 2.x).
read more
3. Why does the first assumption overestimate the amount of data moved?
Due to the semi-random assignment of thread blocks to SMs and because the L1 cache is probably not fully associative, the authors argue that the first assumption does not overestimate actual amount of data moved by too much.
read more
4. What is the way to determine the input parameters for the multiplication algorithm?
For general-purpose SpMV code, a constant time fixed rule is required to decide the input parameters for the multiplication algorithm.
read more





