TL;DR: This paper describes an efficient and robust hybrid parallel solver ''the SPIKE algorithm'' for narrow-banded linear systems, faster than the direct solvers in ScaLAPACK on parallel computing platforms, and quite competitive in terms of achieved accuracy for handling systems that are dense within the band.
Abstract: This paper describes an efficient and robust hybrid parallel solver ''the SPIKE algorithm'' for narrow-banded linear systems. Two versions of SPIKE with their built-in-options are described in detail: the Recursive SPIKE version for handling non-diagonally dominant systems and the Truncated SPIKE version for diagonally dominant ones. These SPIKE schemes can be used either as direct solvers, or as preconditioners for outer iterative schemes. Both versions are faster than the direct solvers in ScaLAPACK on parallel computing platforms, and quite competitive in terms of achieved accuracy for handling systems that are dense within the band.
TL;DR: A SPIKE scheme with multi-level parallelism is also introduced for solving large banded systems that are sparse within the band, and comparison with the corresponding algorithms of ScaLAPACK are provided.
TL;DR: This solver is the first numerically stable tridiagonal solver for GPUs, based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel.
Abstract: In this paper, we present a scalable, numerically stable, high-performance tridiagonal solver. The solver is based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel. For each small matrix, our solver applies a general 1-by-1 or 2-by-2 diagonal pivoting algorithm, which is also known to be numerically stable. Our paper makes two major contributions. First, our solver is the first numerically stable tridiagonal solver for GPUs. Our solver provides comparable quality of stable solutions to Intel MKL and Matlab, at speed comparable to the GPU tridiagonal solvers in existing packages like CUSPARSE. It is also scalable to multiple GPUs and CPUs. Second, we present and analyze two key optimization strategies for our solver: a high-throughput data layout transformation for memory efficiency, and a dynamic tiling approach for reducing the memory access footprint caused by branch divergence.
TL;DR: It is proved that the SPIKE matrix is strictly diagonally dominant by rows with a degree no less than the original matrix, establishing tight upper bounds on the decay rate of the spikes as well as the truncation error.
Abstract: The truncated SPIKE algorithm is a parallel solver for linear systems which are banded and strictly diagonally dominant by rows. There are machines for which the current implementation of the algorithm is faster and scales better than the corresponding solver in ScaLAPACK (PDDBTRF/PDDBTRS). In this paper we prove that the SPIKE matrix is strictly diagonally dominant by rows with a degree no less than the original matrix. We establish tight upper bounds on the decay rate of the spikes as well as the truncation error. We analyze the error of the method and present the results of some numerical experiments which show that the accuracy of the truncated SPIKE algorithm is comparable to LAPACK and ScaLAPACK.
TL;DR: In this chapter, several existing optimization strategies are reviewed and summarized, and the purpose for each optimization is systematically analyzed.
Abstract: Tridiagonal solvers are important building blocks for many applications on GPUs Although a wide range of algorithms and optimizations have been proposed for tridiagonal solvers, there are no comprehensive guidelines for building a high-performance tridiagonal solver for GPUs In this chapter, we review and summarize several existing optimization strategies, and systematically analyze the purpose for each optimization Finally, a case study, called SPIKE-CR, is given to demonstrate how to apply the guidelines to build a high-performance GPU tridiagonal solver