TL;DR: The most common data exchange operations in parallel numerical methods are examined and different methods for performing them efficiently on each of several ensemble architectures are proposed and analyzed.
Abstract: In this paper we consider different methods for exchanging data among processors in parallel computers. The most common data exchange operations in parallel numerical methods are examined and different methods for performing them efficiently on each of several ensemble architectures are proposed and analyzed. Our objective is to compare the performance of each particular architecture on each of the data exchange operations.
TL;DR: In this paper, the authors described block algorithms for the reduction of a real symmetric matrix to tridiagonal or Hessenberg form using Householder transformations, and demonstrated how the initial reduction to tridagonal or bidiagonal form may be pipelined with the divide and conquer technique for computing the eigensystem of a symmetric or the singular value decomposition of a general matrix to achieve algorithms which are load balanced and rich in matrix-matrix operations.
Abstract: In this paper we described block algorithms for the reduction of a real symmetric matrix to tridiagonal form and for the reduction of a general real matrix to either bidiagonal or Hessenberg form using Householder transformations. The approach is to aggregate the transformations and to apply them in a blocked fashion, thus achieving algorithms that are rich in matrix-matrix operations. These reductions to condensed form typically comprise a preliminary step in the computation of eigenvalues or singular values. With this in mind, we also demonstrate how the initial reduction to tridiagonal or bidiagonal form may be pipelined with the divide and conquer technique for computing the eigensystem of a symmetric matrix or the singular value decomposition of a general matrix to achieve algorithms which are load balanced and rich in matrix-matrix operations.
TL;DR: Two parallel clustering algorithms are presented and the time complexity of the proposed single-link hierarchical clustering algorithm is reduced from O(MN2) of the uniprocessor algorithm to O(nN) with MN processors.
Abstract: Clustering techniques play an important role in exploratory pattern analysis, unsupervised learning and image segmentation applications. Many clustering algorithms, both partitional clustering and hierarchical clustering, require intensive computation, even for a modest number of patterns. This paper presents two parallel clustering algorithms. For a clustering problem with N = 2n patterns and M = 2m features, the time complexity of the traditional partitional clustering algorithm on a single processor computer is O(MNK), where K is the number of clusters. The proposed algorithm on anSIMD computer with MN processors has a time complexity O(K(n + m)). The time complexity of the proposed single-link hierarchical clustering algorithm is reduced from O(MN2) of the uniprocessor algorithm to O(nN) with MN processors.
TL;DR: Bounds are derived under fairly general conditions on the synchronization cost function that have implications for a variety of parallel architecture and can be used to derive several popular ‘laws’ about processor performance and efficiency.
Abstract: The impact of synchronization and communication overhead on the performance of parallel processors is investigated with the aim of establishing upper bounds on the performance of parallel processors under ideal conditions. Bounds are derived under fairly general conditions on the synchronization cost function. These bounds have implications for a variety of parallel architecture and can be used to derive several popular ‘laws’ about processor performance and efficiency.
TL;DR: The purpose is to develop algorithms that are amenable to implementation on modern multiprocessor architectures and to suggest parallel algorithms for implementing Kalman type sequential filters in the analysis and solution of estimation problems in control and signal processing.
Abstract: The process of modifying least squares computations by updating the covariance matrix has been used in control and signal processing for some time in the context of linear sequential filtering. Here we give an alternative derivation of the process and provide extensions to downdating. Our purpose is to develop algorithms that are amenable to implementation on modern multiprocessor architectures. In particular, the inverse Cholesky factor R −1 is considered and it is shown that R −1 can be updated (downdated) by applying the same sequence of orthogonal (hyperbolic) plane rotations that are used to update (downdate) R . We have attempted to provide some new insights into least squares modification processes and to suggest parallel algorithms for implementing Kalman type sequential filters in the analysis and solution of estimation problems in control and signal processing.
TL;DR: An elegant algorithm is described which makes use of reference counting to reclaim disused storage for reuse and collects garbage as soon as it is created and has minimal overheads in terms of space requirements and interprocess communications.
Abstract: A good programming language permits the programmer to concentrate on his application rather than on low-level implementation details. In particular, he does not have to concern himself with storage allocation because memory management is dealt with efficiently by the implementation of the language. To reclaim disused storage for reuse the implementation incorporates a garbage collection algorithm. When the language is implemented on a distributed multiprocessor architecture, this algorithm ideally collects garbage as soon as it is created and has minimal overheads in terms of space requirements and interprocess communications. We describe here an elegant algorithm with these properties which makes use of reference counting.
TL;DR: A new height-reducing algorithm based on elimination tree rotation and a new reordering algorithm using rotation are introduced that are shown to produce trees with minimum or near-minimum height and requires significantly less reordering time.
Abstract: We consider the problem of finding equivalent reorderings of a sparse matrix so that the reordered matrix is suitable for parallel Gaussian elimination. The elimination tree structure is used as our parallel model. We show that the reordering scheme by Jess and Kees generates an elimination tree with minimum height among all such trees from the class of equivalent reorderings. A new height-reducing algorithm based on elimination tree rotation is also introduced. Experimental results are provided to compare these two approaches. The new reordering algorithm using rotation is shown to produce trees with minimum or near-minimum height. Yet, it requires significantly less reordering time.
TL;DR: The authors' experiments, using up to 6 processors, show that the best strategy is to compute in parallel a local ILU factorization on slightly overlapping blocks, how the size of the overlap influences the efficiency.
Abstract: We describe the parallel implementation on a vector multiprocessor of two extensions of the preconditioned conjugate gradient algorithm for nonsymmetric systems: the conjugate gradient squared algorithm (CGS) and the generalized minimal residual algorithm GMRES(κ). For both methods, we consider preconditioning by a diagonal matrix and by an incomplete LU factorization. The uniprocessor implementation of CGS and GMRES(κ) is based on a general sparse matrix representation to deal with matrices with an irregular sparsity structure (the ITPACKV format). The parallelization of the non-preconditioned versions is straightforward and leads to very good speedups. The parallelization of the ILU preconditioned versions is more challenging. We describe two parallel preconditioners: • - we compute the global ILU factorization and partition it into one block per processor, • - we compute in parallel local partial factorizations for each block of the matrix. The blocks can overlap, and this partitioning is not restricted to matrices with a special sparsity structure. In the iteration loop the information between blocks is exchanged at each synchronization point, through the matrix-vector product and through the overlap between blocks. We discuss how the size of the overlap influences the efficiency. Our experiments, using up to 6 processors, show that the best strategy is to compute in parallel a local ILU factorization on slightly overlapping blocks.
TL;DR: A novel distributed implementation of the simulated annealing algorithm runs on a linear chain of processors driven by a host processor, which plays only a supervisory role, so that the efficiency of the algorithm remains high as the number of processors is increased.
Abstract: This paper describes and analyses a novel distributed implementation of the simulated annealing algorithm to find a good solution to the travelling salesman problem. The implementation runs on a linear chain of processors driven by a host processor, which plays only a supervisory role, so that the bulk of processing takes place on the chain and the efficiency of the algorithm remains high as the number of processors is increased.
TL;DR: This computational model describes the performance degradation of a single hardware parameter f 1 2 which is the number of floating-point operations per memory reference required to achieve half of the peak performance of the computer's arithmetic pipelines.
Abstract: When the arithmetic performance of a computer is greater than the memory or communication bandwidth, the overall speed of the computer is strongly dependent on the amount of arithmetic performed, on average, per main memory reference. The performance degradation can be described by a single hardware parameter f 1 2 which is the number of floating-point operations per memory reference required to achieve half of the peak performance of the computer's arithmetic pipelines. Excellent agreement is shown between this computational model and measured timings on the FPS 5000 computer.
TL;DR: This work focuses on nonsymmetric systems of equations whose eigenvalues lie on both sides of the imaginary axis, or whose symmetric part is not positive definite, and solves this system of equation using a block Kaczmarz projection method with conjugate gradient acceleration.
Abstract: We consider the iterative solution of large sparse linear systems of equations arising from elliptic and parabolic partial differential equations in two or three space dimensions. Specifically, we focus our attention on nonsymmetric systems of equations whose eigenvalues lie on both sides of the imaginary axis, or whose symmetric part is not positive definite. This system of equation is solved using a block Kaczmarz projection method with conjugate gradient acceleration. The algorithm has been designed with special emphasis on its suitability for multiprocessors. In the first part of the paper, we study the numerical properties of the algorithm and compare its performance with other algorithms such as the conjugate gradient method on the normal equations, and conjugate gradient-like schemes such as ORTHOMIN(k), GCR(k) and GMRES(k). We also study the effect of using various preconditioners with these methods. In the second part of the paper, we describe the implementation of our algorithm on the CRAY X-MP/48 multiprocessor, and study its behavior as the number of processors is increased.
TL;DR: A multitasked implementation of the scene-labeling algorithm in a distributed global memory system is detailed, and no synchronization nor critical sections are necessary to enforce correctness of execution.
Abstract: Asynchronous algorithms have the potential to be more efficient than synchronized algorithms in multiprocessors because the overheads associated with synchronization are removed. The sufficient conditions for the convergence of numerical asynchronous iterations have been established; however, relaxation procedures are also common in non-numerical applications. In this paper, we introduce sufficient conditions for the convergence of asynchronous iterations defined on any set of data, finite or infinite, countable or not. The sufficient conditions are then applied to the scene-labeling problem. A multitasked implementation of the scene-labeling algorithm in a distributed global memory system is detailed. In this implementation, no synchronization nor critical sections are necessary to enforce correctness of execution.
TL;DR: Design approaches and tools for performance visualization are the subject of this paper and the importance of dynamic, visual scientific data presentation has only recently been recognized.
Abstract: Parallel computer systems are among the most complex of man's creations, making satisfactory performance characterization difficult. Despite this complexity, there are strong, indeed, almost irresistible, incentives to quantify parallel system performance using a single metric. The fallacy lies in succumbing to such temptations. A complete performance characterization requires not only an analysis of the system's constituent levels, it also requires both static and dynamic characterizations. Static or average behavior analysis may mask transients that dramatically alter system performance. Although the human visual system is remarkedly adept at interpreting and identifying anomalies in false color data, the importance of dynamic, visual scientific data presentation has only recently been recognized Large, complex parallel system pose equally vexing performance interpretation problems. Data from hardware and software performance monitors must be presented in ways that emphasize important events while eluding irrelevant details. Design approaches and tools for performance visualization are the subject of this paper.
TL;DR: This paper considers canonical correlations and a generalization of the singular value decomposition (SVD) that involves three matrices that can be used in important applications such as weighted least squares and optimal prediction.
Abstract: In this paper we consider canonical correlations and a generalization of the singular value decomposition (SVD) that involves three matrices. We show how the two matrix problems are related and how they can be used in important applications such as weighted least squares and optimal prediction. We present two new computational procedures for the problems based on implicit SVD methods for triple matrixproducts. Our algorithms are well suited for parallel implementation.
TL;DR: The term robustness is introduced and explained in detail with two examples, the SUPRENUM and the hypercube architecture, and it is shown that a simple mapping strategy (optimal clustering of the processes) gives almost as good results as the optimal mapping.
Abstract: Machines with distributed memory have the mapping problem—assigning processes to processors. In this paper we define the mapping problem as an optimization problem and discuss the question, how far is an optimum solution from an average or random solution. The term robustness is introduced and explained in detail with two examples, the SUPRENUM and the hypercube architecture. For the SUPRENUM architecture we show that a simple mapping strategy (optimal clustering of the processes) gives almost as good results as the optimal mapping. Optimal mapping is more important for the hypercube architecture. For nonhomogeneous networks adaptive routing seems promising.
TL;DR: An adaptation of the Block Cyclic Reduction (BCR) algorithm for a multi-vector processor that addresses the main bottleneck in the solution of linear systems whose coefficient matrix is the product of tridiagonal matrices.
Abstract: This paper presents an adaptation of the Block Cyclic Reduction (BCR) algorithm for a multi-vector processor. The main bottleneck of BCR lies in the solution of linear systems whose coefficient matrix is the product of tridiagonal matrices. This bottleneck is handled by expressing the rational function corresponding to the inverse of this product as a sum of elementary fractions. As a result the solution of this system leads to parallel solutions of tridiagonal systems. Numerical experiments performed on an Alliant FX/8 are reported.
TL;DR: The parallelization of general DES applications written in SIMULA, a general purpose process oriented language whose structure allows the creation of processes which can participate in a quasi-parallel execution according to an interleaved fashion is proposed.
Abstract: The area of Discrete Event Simulation (DES) is the least impacted by parallel processing even though most of its applications require tremendous amounts of processing time. The common approach of parallelizing individually special purpose programs leads to very limited improvements in performance. We propose here the parallelization of general DES applications written in SIMULA, as a part of an ongoing project that aims towards developing methodologies and architectures for parallel DES. SIMULA is a general purpose process oriented language whose structure allows the creation of processes which can participate in a quasi-parallel execution according to an interleaved fashion. Problems related to the process interference representation, parallel scheduling and process synchronization are defined and solved. To test and verify the theoretical results the parallel execution of experimental and real DES applications has been simulated. The results show that significant improvement in performance can be expected.
TL;DR: This paper presents a specification-methodology for distributed computer architectures based on the axiomatic architecture description language AADL, which allows for a modular and concise specification of multiprocessor architectures at levels of abstraction ranging from compiler/operating-system interface down to chip-level.
Abstract: This paper presents a specification-methodology for distributed computer architectures based on the axiomatic architecture description language AADL. AADL allows for a modular and concise specification of multiprocessor architectures at levels of abstraction ranging from compiler/operating-system interface down to chip-level. The specification method is illustrated by several examples taken from an AADL-definition of an abstract view of DOOM, a distributed object oriented machine, currently developed at Philips Research Laboratories, Eindhoven within ESPRIT-project 415.
TL;DR: Studying direct and iterative numerical methods on nearest neighbor type distributed systems gives speedup evaluations depending on computation, communication and control costs and compares some of them with experimental measurements.
Abstract: Communication and synchronization costs are a key problem in parallel computing. Studying direct and iterative numerical methods on nearest neighbor type distributed systems, we give speedup evaluations depending on computation, communication and control costs and compare some of them with experimental measurements.
TL;DR: An ordering heuristic which appears to perform well with respect to parallel factorization time is one based on finding independent sets of vertices in the matrix adjacency graph.
Abstract: Commonly used matrix ordering techniques are designed to minimize fill, i.e., they are designed to minimize the number of zero elements which become nonzero during matrix factorization by Gaussian elimination. If Gaussian elimination is to be implemented on a parallel machine, however, minimum fill orderings are not necessarily optimal. Rather, the primary concern is to order a matrix so as to minimize the time required to complete its factorization. An ordering heuristic which appears to perform well with respect to parallel factorization time is one based on finding independent sets of vertices in the matrix adjacency graph.
TL;DR: A practical look at the prospect for general-purpose parallel computation is taken and some of the potential limitations are considered by using a simple parametric model of computational performance.
Abstract: It has become generally accepted that continued improvements in high-performance scientific computation will be achieved only through the ‘exploitation of parallelism’. Despite the nebulous nature of this expression, enthusiasm for the potential of parallel computing has led to calls for improvements in computational performance of more than a thousand-fold in the next few years, or for what is sometimes referred to as a Teraflop (one trillion floating-point operations per second) Computer. Such a system is envisioned as a general-purpose tool for accelerating progress in such widely varied applications as astronomy, biochemistry, circuit analysis, computational fluid dynamics, global economic modeling, high energy physics, materials science, structural analysis, and weather prediction. Although parallel architectures appear to offer the greatest promise for significant improvements in overall computational performance, it is not yet clear whether a general-purpose parallel architecture can realize the large increases solicited by the scientific community. This note will take a practical look at the prospect for general-purpose parallel computation and will consider some of the potential limitations by using a simple parametric model of computational performance.
TL;DR: This work describes parallel algorithms for the quadratic assignment problem and presents the computational experience using the massively parallel processor, DAP, and reports the speedup obtained by parallelising algorithms for solving the 2-dimensional and 3-dimensional assignment problems on the DAP.
Abstract: The innovation of parallel computers has added a new dimension to the design of algorithms. Parallel programming is not a simple extension of serial programming. We describe parallel algorithms for the quadratic assignment problem and present our computational experience using the massively parallel processor, DAP. We further report the speedup obtained by parallelising algorithms for solving the 2-dimensional and 3-dimensional assignment problems on the DAP.
TL;DR: A polynomial preconditioner that can be used with the conjugate gradient method to solve symmetric and positive definite systems of linear equations and it is shown this method is asymptotically twice as fast as SSOR when the optimal ω is used.
Abstract: We present a polynomial preconditioner that can be used with the conjugate gradient method to solve symmetric and positive definite systems of linear equations. Each step of the preconditioning is achieved by simultaneously taking an iteration of the SOR method and an iteration of the reverse SOR method (equations taken in reverse order) and averaging the results. This yields a symmetric preconditioner that can be implemented on parallel computers by performing the forward and reverse SOR iterations simultaneously. We give necessary and sufficient conditions for additive preconditioners to be positive definite. We find an optimal parameter, ω, for the SOR-Additive linear stationary iterative method applied to 2-cyclic matrices. We show this method is asymptotically twice as fast as SSOR when the optimal ω is used. We compare our preconditioner to the SSOR polynomial preconditioner for a model problem. With the optimal ω, our preconditioner was found to be as effective as the SSOR polynomial preconditioner in reducing the number of conjugate gradient iterations. Parallel implementations of both methods are discussed for vector and multiple processors. Results show that if the same number of processors are used for both preconditioners, the SSOR preconditioner is more effective. If twice as many processors are used for the SOR-Additive preconditioner, it becomes more efficient than the SSOR preconditioner when the number of equations assigned to a processor is small. These results are confirmed by the Blue Chip emulator at the University of Washington.
TL;DR: Some simple theory, pratical formal methods, and software tools to help design synchronous concurrent algorithms, which includes clocked hardware, systolic algorithms, neural nets for example.
Abstract: We consider the specification and programming of synchronous concurrent algorithms. This general class of algorithms includes clocked hardware, systolic algorithms, neural nets for example. We present some simple theory, pratical formal methods, and software tools to help design these algorithms.
TL;DR: In this paper, a systolic algorithm for computing the configuration space of an arrangement of arbitrary obstacles in the plane for a rectilinearly convex robot is presented, where the obstacles and the robot are assumed to be represented in digitized form by a √ n × √n nibary image.
Abstract: In this paper, we present a systolic algorithm for computing the configuration space of an arrangement of arbitrary obstacles in the plane for a rectilinearly convex robot. The obstacles and the robot are assumed to be represented in digitized form by a √n × √n nibary image. The algorithm is designed for a Mesh-of-Processors architecture with n processors (using the canonical representation of an image on a processor array) and has an execution time of O(√n) which is asymptotically optimal.
TL;DR: Different ways of evaluating algorithmic speedup on a distributed memory machine are discussed, and Gaussian elimination on a hypercube computer is used as a target example.
Abstract: We discuss different ways of evaluating algorithmic speedup on a distributed memory machine. We use Gaussian elimination on a hypercube computer as a target example.
TL;DR: In this article, the results of experiments with the multiple polynomial version of the quadratic sieve factorization method on a CYBER 205 and on a NEC SX-2 vector computer were presented.
Abstract: The results are presented of experiments with the multiple polynomial version of the quadratic sieve factorization method on a CYBER 205 and on a NEC SX-2 vector computer. Various numbers in the 50–92 decimal digits range have been factorized, as a contribution to (i) the Cunningham project, (ii) Brent's table of factors of Mersenne numbers, and (iii) a proof by Brent and G. Cohen of the non-existence of odd perfect numbers below 10 200 . The factorized 92-decimal digits number is a record for general purpose factorization methods.
TL;DR: SUSPENSE demonstrates that well-known theoretical concepts from computer science such as high-level specifications, design by ‘stepwise refinement’ etc., work well in the field of numerical analysis and the generation of parallel programs.
Abstract: The basic principles and the overall design of the automatic transformation system SUSPENSE (SUprenum SPEcification tool for Numerical SoftwareE), which transforms specifications into parallel programs, are presented. The system supports a high-level specification language for partial differetial equations (PDEs) and related areas in numerical analysis. The language offers facilities to describe and manipulate numerical objects such as vectors, matrices, domains, grids etc. on a high level of abstraction. Sequential algorithms can be formulated by means of general iterators which describe (in contrast to procedural programming languages) only partial orders on objects. Parallelism is obtained in a domain-specific way by splitting numerical objects such as grids, vectors etc. into segments which will be processed in parallel. The target language for the transformation system is the parallel FORTRAN dialect SUPRENUM-FORTRAN. This language is under development as a part of the German supercomputer project SUPRENUM. Algorithms written and transformed in SUSPENSE are tailored to the parallel SUPRENUM machine which consists of up to 256 processors with local memory only and message-based communication. SUSPENSE demonstrates that well-known theoretical concepts from computer science such as high-level specifications, design by ‘stepwise refinement’ etc., work well in the field of numerical analysis and the generation of parallel programs.
TL;DR: A Jacobi-like scheme for computing the generalized Schur form of a regular pencil of matrices σB − A that can efficiently be implemented in parallel on a square array of processors and yields further insight in Stewart's algorithm.
Abstract: We develop a Jacobi-like scheme for computing the generalized Schur form ofa regular pencil of matrices σB − A The method starts with a preliminary triangularization of the matrix B and iteratively reduces A to triangular form, while maintaining B triangular The scheme heavily relies on the technique of Stewart for computing the Schur form of an arbitrary matrix A Just as Stewart's algorithm, this one can efficiently be implemented in parallel on a square array of processors This explains some of its peculiarities, and at the same time yields further insight in Stewart's algorithm