TL;DR: The main features and the tuning of the algorithms for the direct solution of sparse linear systems on distributed memory computers developed in the context of a long term European research project are analyzed and discussed.
Abstract: In this paper, we analyze the main features and discuss the tuning of the algorithms for the direct solution of sparse linear systems on distributed memory computers developed in the context of a long term European research project. The algorithms use a multifrontal approach and are especially designed to cover a large class of problems. The problems can be symmetric positive definite, general symmetric, or unsymmetric matrices, both possibly rank deficient, and they can be provided by the user in several formats. The algorithms achieve high performance by exploiting parallelism coming from the sparsity in the problem and that available for dense matrices. The algorithms use a dynamic distributed task scheduling technique to accommodate numerical pivoting and to allow the migration of computational tasks to lightly loaded processors. Large computational tasks are divided into subtasks to enhance parallelism. Asynchronous communication is used throughout the solution process to efficiently overlap communication with computation.
We illustrate our design choices by experimental results obtained on an SGI Origin 2000 and an IBM SP2 for test matrices provided by industrial partners in the PARASOL project.
TL;DR: This work considers the problem of designing a dynamic scheduling strategy that takes into account both workload and memory information in the context of the parallel multifrontal factorization and shows that a new scheduling algorithm significantly improves both the memory behaviour and the factorization time.
Abstract: We consider the problem of designing a dynamic scheduling strategy that takes into account both workload and memory information in the context of the parallel multifrontal factorization. The originality of our approach is that we base our estimations (work and memory) on a static optimistic scenario during the analysis phase. This scenario is then used during the factorization phase to constrain the dynamic decisions that compute fully irregular partitions in order to better balance the workload. We show that our new scheduling algorithm significantly improves both the memory behaviour and the factorization time. We give experimental results for large challenging real-life 3D problems on 64 and 128 processors.
TL;DR: In this paper, a new parallel distributed memory multifrontal approach is described to handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed.
TL;DR: The program given here assembles and solves symmetric positive–definite equations as met in finite element applications, more involved than the standard band–matrix algorithms, but more efficient in the important case when two-dimensional or three-dimensional elements have other than corner nodes.
Abstract: The program given here assembles and solves symmetric positive–definite equations as met in finite element applications. The technique is more involved than the standard band–matrix algorithms, but it is more efficient in the important case when two-dimensional or three-dimensional elements have other than corner nodes. Artifices are included to improve efficiency when there are many right hand sides, as in automated design. The organization of the program is described with reference to diagrams, full notation, specimen input data and supplementary comments on the ASA FORTRAN print-out.
TL;DR: The design, implementation, and performance of a frontal code for the solution of large, sparse, unsymmetric systems of linear equations, and the extensive use of higher-level BLAS kernels within MA42 are described.
Abstract: We describe the design, implementation, and performance of a frontal code for the solution of large, sparse, unsymmetric systems of linear equations. The resulting software package, MA42, is included in Release 11 of the Harwell Subroutine Library and is intended to supersede the earlier MA32 package. We discuss in detail the extensive use of higher-level BLAS kernels within MA42 and illustrate the performance on a range of practical problems on a CRAY Y-MP, an IBM 3090, and an IBM RISC System/6000. We examine extending the frontal solution scheme to use multiple fronts to allow MA42 to be run in parallel. We indicate some directions for future development.