TL;DR: The algorithm is implemented on the CM-5 and is run repeatedly on two deceptive problems to demonstrate the added implicit parallelism and faster convergence which can result from larger population sizes.
Abstract: This paper introduces and analyzes a parallel method of simulated annealing. Borrowing from genetic algorithms, an effective combination of simulated annealing and genetic algorithms, called parallel recombinative simulated annealing, is developed. This new algorithm strives to retain the desirable asymptotic convergence properties of simulated annealing, while adding the populations approach and recombinative power of genetic algorithms. The algorithm iterates a population of solutions rather than a single solution, employing a binary recombination operator as well as a unary neighborhood operator. Proofs of global convergence are given for two variations of the algorithm. Convergence behavior is examined, and empirical distributions are compared to Boltzmann distributions. Parallel recombinative simulated annealing is amenable to straightforward implementation on SIMD, MIMD, or shared-memory machines. The algorithm, implemented on the CM-5, is run repeatedly on two deceptive problems to demonstrate the added implicit parallelism and faster convergence which can result from larger population sizes.
TL;DR: CAMEL (Cellular Automata environMent for systEms modeLing), a scalable software environment based on the cellular automata theory implemented on a Transputer-based parallel computer, and a load balancing strategy is used to minimize time costs in case of not uniform intervals for transition steps.
Abstract: This paper describes CAMEL (Cellular Automata environMent for systEms modeLing), a scalable software environment based on the cellular automata theory implemented on a Transputer-based parallel computer. Cellular automata were originally defined as a theory to model the basic mechanisms of dynamic systems, permitting a new approach which is in many cases simpler and more efficient than the traditional approach based on partial differential equations. Today, cellular automata become more attractive because they are suitable to be effectively and naturally implemented on parallel computers achieving high performance. CAMEL allows a user to program computational science applications exploiting the computing power offered by highly parallel computers in a transparent way. CAMEL implements a cellular automaton as a SPMD program. A load balancing strategy is used to minimize time costs in case of not uniform intervals for transition steps. In the paper the programming environment and the parallel architecture of CAMEL are presented and some experiments are discussed.
TL;DR: The REFINE multiprocessor is shown to offer a cost-effective alternative to the Boolean n-cube multiprocessionor architecture without substantial loss in performance.
Abstract: A reconfigurable interconnection network based on a multi-ring architecture called REFINE is described. REFINE embeds a single 1-factor of the Boolean hypercube in any given configuration. The mathematical properties of the REFINE topology and the hardware for the reconfiguration switch are described. The REFINE topology is scalable in the sense that the number of interprocessor communication links scales linearly with network size whereas the network diameter scales logarithmically with network size. Primitive parallel operations on the REFINE topology are described and analyzed. These primitive operations could be used as building blocks for more complex parallel algorithms. A large class of algorithms for the Boolean n-cube which includes the FFT and the Batcher's bitonic sort is shown to map efficiently on the REFINE topology. The REFINE multiprocessor is shown to offer a cost-effective alternative to the Boolean n-cube multiprocessor architecture without substantial loss in performance.
TL;DR: This work proposes a deterministic approach finding an optimal distribution of the job's load on a hypercube of processors, which allows also the determination of performance bounds on the hypercube architecture.
Abstract: In this work a problem of finding an optimal distribution of a divisible computational job among a set of processors is considered. In the model of parallel computer systems two important factors must be taken into account: speeds of processors and speeds of communications links. With regard to this, we propose a deterministic approach finding an optimal distribution of the job's load on a hypercube of processors. The method used allows also the determination of performance bounds on the hypercube architecture.
TL;DR: New algorithms for embedding Hamiltonian cycles along with ranking and unranking algorithms with respect to them, and the set of cycles are described.
Abstract: In recent times the use of star and pancake networks as a viable interconnection scheme for parallel computers has been examined by a number of researchers. An attractive feature of these two classes of graphs is that they have sublogarithmic diameter and have a great deal of symmetry akin to the binary hypercube. In this paper we describe new algorithms for embedding: (a) Hamiltonian cycles along with ranking and unranking algorithms with respect to them, and (b) The set of cycles. The analogous problems for star graphs has been solved recently [5].
TL;DR: A partitioning strategy for the message passing version that significantly reduces memory requirements and increases model speed is developed and found the T3D and CM-5 are found to execute the “large” model version at roughly the same speed.
Abstract: A two-pronged effort to convert a recently developed ocean circulation model written in Fortran-77 for execution on massively parallel computers is described. A data-parallel version was developed for the CM-5 manufactured by Thinking Machines, Inc., while a message-passing version was developed for both the Cray T3D and the Silicon Graphics ONYX workstation. Since the time differentiation scheme in the ocean model is fully explicit and does not require solution of elliptic partial differential equations, adequate machine utilization has been achieved without major changes to the original algorithms. We developed a partitioning strategy for the message passing version that significantly reduces memory requirements and increases model speed. On a per-node basis (a T3D node is one Alpha processor, a CM-5 node is one Sparc chip and four vector units), the T3D and CM-5 are found to execute our “large” model version consisting of 511 × 511 horizontal mesh points at roughly the same speed.
TL;DR: A new method is proposed, Multiple Recursive Decomposition (MRD), which partitions the data using the prime factors of the dimensions of a multiprocessor network with mesh topology, and introduces a new storage scheme, storage-by-row-of-blocks, that significantly increases the efficiency of the Scatter method.
Abstract: Sparse matrix vector multiplication (SpMxV) is often one of the core components of many scientific applications. Many authors have proposed methods for its data distribution in distributed memory multiprocessors. We can classify these into four groups: Scatter, D-Way Strip, Recursive and Miscellaneous. In this work we propose a new method (Multiple Recursive Decomposition (MRD)), which partitions the data using the prime factors of the dimensions of a multiprocessor network with mesh topology. Furthermore, we introduce a new storage scheme, storage-by-row-of-blocks, that significantly increases the efficiency of the Scatter method. We will name Block Row Scatter (BRS) method this new variant. The MRD and BRS methods achieve results that improve those obtained by other analyzed methods, being their implementation easier. In fact, the data distributions resulting from the MRD and BRS methods are a generalization of the Block and Cyclic distributions used in dense matrices.
TL;DR: A detailed study of the PDD algorithm is given, which is extended to solve periodic tridiagonal systems and its scalability is studied, and the reduced PDD algorithms are proposed, which has a smaller operation count than that of the conventional sequential algorithm for many applications.
Abstract: The Parallel Diagonal Dominant (PDD) algorithm is an efficient tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is extended to solve periodic tridiagonal systems and its scalability is studied. Then the reduced PDD algorithm, which has a smaller operation count than that of the conventional sequential algorithm for many applications, is proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric and skew-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the PDD and reduced PDD algorithms are good candidates for emerging massively parallel machines.
TL;DR: This work presents a scalable parallel local search algorithm based on data parallelism for the Traveling Salesman Problem that finds the same quality solutions as the classical 2-opt algorithm and has a good speed-up.
Abstract: We present a scalable parallel local search algorithm based on data parallelism. The concept of distributed neighborhood structures is introduced, and applied to the Traveling Salesman Problem (TSP). Our parallel local search algorithm finds the same quality solutions as the classical 2-opt algorithm and has a good speed-up. The algorithm is implemented on a Parsytec GCel, consisting of 512 transputers. Its performance is empirically analyzed for TSP instances with several thousands of cities.
TL;DR: The key areas of new development are focused on the support of effective parallel computations when the structure of the numerical discretization, the mesh, is evolving, and in fact constructed, during the computation.
Abstract: Consideration is given to the techniques required to support adaptive analysis of automatically generated unstructured meshes on distributed memory MIMD parallel computers. The key areas of new development are focused on the support of effective parallel computations when the structure of the numerical discretization, the mesh, is evolving, and in fact constructed, during the computation. All the procedures presented operate in parallel on already distributed mesh information. Starting from a mesh definition in terms of a topological hierarchy, techniques to support the distribution, redistribution and communication among the mesh entities over the processors is given, and algorithms to dynamically balance processor workload based on the migration of mesh entities are given. A procedure to automatically generate meshes in parallel, starting from CAD geometric models, is given. Parallel procedures to enrich the mesh through local mesh modifications are also given. Finally, the combination of these techniques to produce a parallel automated finite element analysis procedure for rotorcraft aerodynamics calculations is discussed and demonstrated.
TL;DR: An algorithm for the reduction to block upper-Hessenberg form which can be used to solve the nonsymmetric eigenvalue problem on message-passing multicomputers and an evaluation of the algorithm’s communication cost is presented.
Abstract: In this paper, we present an algorithm for the reduction to block upper-Hessenberg form which can be used to solve the nonsymmetric eigenvalue problem on message-passing multicomputers. On such multicomputers, a nonsymmetric matrix can be distributed across processing nodes logically configured into a two-dimensional mesh using the block-cyclic data distribution. Based on the matrix partitioning and mapping, the algorithm employs both Householder reflectors and Givens rotations within each reduction step. We analyze the arithmetic and communication complexities and describe the implementation details of the algorithm on message-passing multicomputers. We discuss two different implementations - synchronous and asynchronous - and present performance results on the Intel iPSC/860 and DELTA. We conclude with an evaluation of the algorithm’s communication cost, and suggest areas for further improvement.
TL;DR: The problem of load balancing for parallelized particle simulations implemented on MIMD machines is discussed and a simple dynamic load balancing algorithm is introduced and a mapping to a deterministic cellular automaton is given which allows detailed studies of this algorithm.
Abstract: The problem of load balancing for parallelized particle simulations implemented on MIMD machines is discussed and a simple dynamic load balancing algorithm is introduced. Furthermore a mapping to a deterministic cellular automaton is given which allows detailed studies of this algorithm. In particular it is shown that for strip decomposition, the time, τ, need for the system to convergence from an initially unbalanced state to a well balanced state grows like τ α Pα, with P being the number of processors and α = 0.089 ± 0.009. This convergence time can be considered ‘fast’, enabling this algorithm to track fluctuations in the particle density as the simulation progresses.
TL;DR: This work introduces a new mapping strategy, called the spiral mapping, that reduces the communication overhead of array redistribution, and implements redistribution using the standard linear mapping and the spiral mapped for two dimensional arrays.
Abstract: Languages such as High Performance Fortran implement parallel algorithms by distributing large data structures across a multicomputer system. To enhance parallelism and reduce communication, it is sometimes beneficial for a programmer to change the distribution between phases of the algorithm. We introduce a new mapping strategy, called the spiral mapping, that reduces the communication overhead of array redistribution. Redistribution using the spiral mapping exploits communication locality and reduces global communication conflicts. We implemented redistribution using the standard linear mapping and the spiral mapping for two dimensional arrays; for 1024 × 1024 arrays, redistribution using the spiral mapping is 36% faster than using the linear mapping on a 16 node Intel iPSC/860.
TL;DR: A collection of public-domain Fortran 77 routines for the solution of systems of linear equations using a variety of iterative methods on parallel architectures with either shared or distributed memory.
Abstract: We present a collection of public-domain Fortran 77 routines for the solution of systems of linear equations using a variety of iterative methods. The routines implement methods which have been modified for their efficient use on parallel architectures with either shared or distributed memory. PIM was designed to be portable across different machines. Results are presented for a variety of parallel computers.
TL;DR: A two dimensional domain decomposition strategy is used to achieve parallelism and is implemented by message passing and this parallel algorithm is shown to scale favorably as the number of processors is increased.
Abstract: A new version of the UCLA atmospheric general circulation model suitable for massively parallel computer architectures has been developed. This paper presents the principles for the code's design and examines performance on a variety of distributed memory computers. A two dimensional domain decomposition strategy is used to achieve parallelism and is implemented by message passing. This parallel algorithm is shown to scale favorably as the number of processors is increased. In the fastest configuration, performance roughly equivalent to that of multitasking vector supercomputers is achieved.
TL;DR: A parallel iterative solver for discrete second order elliptic PDEs based on the conjugate gradient algorithm with incomplete factorization preconditioning, using a domain decomposed ordering to allow parallelism in the triangular solves, and resorting to some special recently developed parallelization technique to avoid communication bottleneck for the computation associated to the internal boundary nodes.
Abstract: We present a parallel iterative solver for discrete second order elliptic PDEs. It is based on the conjugate gradient algorithm with incomplete factorization preconditioning, using a domain decomposed ordering to allow parallelism in the triangular solves, and resorting to some special recently developed parallelization technique to avoid communication bottleneck for the computation associated to the internal boundary nodes. Numerical results are given for a transputer network with up to 512 processors and a few workstation cluster.
TL;DR: A control estimator is introduced which ensures correct computation of the variance in parallel Monte Carlo and monitors the correlation transmitted from the source of random numbers to the results, i.e. the effect on computation velocity.
Abstract: Effects and detection of correlations in parallel Monte Carlo are discussed, on the assumption that each processor uses a sequence of truly random numbers, but the sequences are mutually correlated. Depending on the parallel implementation of the algorithm, effects may concern the mean value of the solution or only its variance. In the first case an alternative implementation of the algorithm is suggested. In the second — where it is possible to lose control of the result's accuracy — a control estimator is introduced which ensures correct computation of the variance. In this way one obtains reliability even when coprocessors are not independent. The same estimator, moreover, monitors the correlation transmitted from the source of random numbers to the results, i.e. the effect on computation velocity. Numerical examples show the sensitivity of the implemented control.
TL;DR: An overview of a number of related iterative methods for the solution of linear systems of equations, so-called Krylov projection type methods, which include popular methods as Conjugate Gradients, Bi-Conjugate gradients, CGST Bi-CGSTAB, QMR, LSQR and GMRES, and how these methods can be derived from simple basic iteration formulas is presented.
Abstract: In these notes we will present an overview of a number of related iterative methods for the solution of linear systems of equations. These methods are so-called Krylov projection type methods and the include popular methods as Conjugate Gradients, Bi-Conjugate Gradients, CGST Bi-CGSTAB, QMR, LSQR and GMRES. We will show how these methods can be derived from simple basic iteration formulas. We will not give convergence proofs, but we will refer for these, as far as available, to litterature.
Iterative methods are often used in combination with so-called preconditioning operators (approximations for the inverses of the operator of the system to be solved). Since these preconditions are not essential in the derivation of the iterative methods, we will not give much attention to them in these notes. However, in most of the actual iteration schemes, we have included them in order to facilitate the use of these schemes in actual computations.
For the application of the iterative schemes one usually thinks of linear sparse systems, e.g., like those arising in the finite element or finite difference approximatious of (systems of) partial differential equations. However, the structure of the operators plays no explicit role in any of these schemes, and these schemes might also successfully be used to solve certain large dense linear systems. Depending on the situation that might be attractive in terms of numbers of floating point operations.
It will turn out that all of the iterative are parallelizable in a straight forward manner. However, especially for computers with a memory hierarchy (i.e. like cache or vector registers), and for distributed memory computers, the performance can often be improved significantly through rescheduling of the operations. We will discuss parallel implementations, and occasionally we will report on experimental findings.
TL;DR: In this paper, a sequential algorithm which does not require global operations is first presented and a sequence of algorithm transformations is applied to convert it into a parallel algorithm for mesh-connected SIMD computers.
Abstract: The Euclidean distance transform (EDT) converts a binary image into one where each pixel has a value equal to its Euclidean distance to the nearest foreground pixel. It has important uses in image analysis, computer vision, and robotics where high speed computation is essential. In this paper, a sequential algorithm which does not require global operations is first presented. We then apply a sequence of algorithm transformations to convert it into a parallel algorithm for mesh-connected SIMD computers. For an n × n image on an equal-sized processor array, the time complexity is O(n). An algorithm for computing large EDT problems on smaller processor arrays is also given. For an n × n image on a g × g processor array, the time complexity is O((n2g) log(ng)).
TL;DR: A program for the automatically parallel triangular mesh generation in arbitrary bounded plane domains that are a priori divided up into several single connected subdomains called PARMESH can be an efficient ingredient for parallel numerical solvers of discrete problems arisen from mathematical physics.
Abstract: We developed a program for the automatically parallel triangular mesh generation in arbitrary bounded plane domains that are a priori divided up into several single connected subdomains. Its output data structure briefly described too is very suitably for further performing parallel hierarchical mesh generation in each subdomain starting from the triangulation got there as well as for parallel processing itself. Two numerical examples are presented. Consequently this program package called PARMESH can be an efficient ingredient for parallel numerical solvers of discrete problems arisen from mathematical physics.
TL;DR: Two distributed algorithms based on Dijkstra's algorithm for the single-source shortest path problem are experimentally compared, and the theoretically less attractive update-driven algorithm turns out to perform much better than the apparently preferable minimum-driven algorithms.
Abstract: We experimentally compare two distributed algorithms based on Dijkstra's algorithm for the single-source shortest path problem. The algorithms are intended for asynchronous distributed systems with a small number of processors without shared memory, and specifically address the situation where communication is costly. Variations of the algorithms have been implemented in occam , and results from experiments on a 16 processor transputer system with randomly generated graphs of different types and densities with up to 20000 vertices and 250000 edges are reported. The distributed algorithms exploit the two obvious sources of parallelism in Dijkstra's algorithm. In the update-driven algorithm several vertices may be selected and scanned simultaneously. Since not all selected vertices can be guaranteed to be correct, the achievable speed-up depends on the problem instance, and no worst-case guarantee better than the sequential algorithm can be given. In contrast the minimum-driven algorithm performs the scanning in parallel but from only one selected vertex at a time. Ideally this algorithm has linear speed-up for dense graphs. Implemented naively both algorithms perform poorly, but can be improved by relaxations and approximations which reduce communication volume. In all cases the theoretically less attractive update-driven algorithm turns out to perform much better than the apparently preferable minimum-driven algorithm.
TL;DR: In this paper possibilities to obtain a satisfactory multigrid convergence when a domain is partitioned into blocks are discussed and another flexible nonstandardMultigrid method, called MG-S, is introduced.
Abstract: In this paper possibilities to obtain a satisfactory multigrid convergence when a domain is partitioned into blocks are discussed. First, standard parallel multigrid with alternating line Gauss-Seidel as smoother is investigated for the Euler equations in a partitioned domain. When the domain is partitioned into blocks the smoother updates lines per block. A possibility to regain the single block multigrid convergence for many block splittings for this problem is given. For singular perturbed problems with strongly coupled unknowns in one direction only this method will not be satisfactory. Therefore, the research for nonstandard multigrid methods that are based on point smoothers and are robust is sketched. Another flexible nonstandard multigrid method, called MG-S, is introduced. Because the method is equivalent with standard multigrid with a lower dimensional multigrid smoother the behaviour of the method is quite clear. Finally, the nonstandard multigrid method is tested for two-dimensional model equations, the rotated anisotropic diffusion equation and the convection diffusion equation.
TL;DR: The algorithms make use of non-blocking, point-to-point communication between processors, which allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization.
Abstract: This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A · B, the algorithms are used to compute parallel multiplications of transposed matrices, C = AT · BT, in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
TL;DR: This paper describes architecture-independent solutions for each operation, as well as solutions tailored towards the mesh architecture, and shows how the relationship among the parameters of a parallel machine and the relationship of these parameters to the message size determines the best solution.
Abstract: In this paper we consider three frequently arising communication operations, one-to-all, all-to-one, and all-to-all. We describe architecture-independent solutions for each operation, as well as solutions tailored towards the mesh architecture. We show how the relationship among the parameters of a parallel machine and the relationship of these parameters to the message size determines the best solution. We discuss performance and scalability issues of our solutions on the Intel Touchstone Delta. Our results show that in order to cover a broad range of scalability for a particular operation, multiple solutions should be employed.
TL;DR: The parallel WZ matrix factorisation is presented, and numerical results to confirm the superiority of WZ over the LV method are presented.
Abstract: The parallel WZ matrix factorisation is presented. Some pivoting techniques for a modification of this factorisation are considered. Diagonally dominant matrices are studied separately in order to attain improved results, and finally numerical results to confirm the superiority of WZ over the LV method are presented.
TL;DR: A O(log2 P) time barrier algorithm is described for problems when the number of pre-synchronization messages to be received by a process is unknown, for example, in any computation that is largely driven by an unpredictable exchange of messages.
Abstract: Barrier synchronization is a fundamental operation in parallel computation In many contexts, at the point a process enters a barrier it knows that it has already processed all work required of it prior to the synchronization It then commits to the barrier, in the sense that the process blocks until every other process has also committed to the barrier This paper treats the alternative case, when a process cannot enter a barrier with the assurance that it has already performed all necessary pre-synchronization computation The problem arises when the number of pre-synchronization messages to be received by a process is unknown, for example, in any computation that is largely driven by an unpredictable exchange of messages We describe a O(log2 P) time barrier algorithm for such problems, study its performance on a large-scale parallel system, and consider extensions to general associative reductions, as well as associative parallel prefix computations
TL;DR: The design and implementation of an effective and scalable dynamic load-balancing mechanism suitable for a loosely-coupled MIMD system with a interconnected topology is described and the proposed algorithm extends the well-known gradient model to improve the overall execution time.
Abstract: This paper describes the design and the implementation of an effective and scalable dynamic load-balancing mechanism suitable for a loosely-coupled MIMD system with a interconnected topology. The proposed algorithm extends the well-known gradient model to improve the overall execution time. A prototype system has been implemented using a network of transputer processors. A load generator spawns tasks of predetermined processing demand at run time, producing asynchronous events. The objective is to improve maximum resource utilization and also to achieve a transparent placement of processes onto processors.
TL;DR: This work considers a new approach to modeling wave propagation in urban environments, based on the Transmission Line Matrix (TLM) method, which provides good predictions for the intensity of a wave when compared with in-situ measurements and is appropriate to very fast massively parallel computations.
Abstract: We consider a new approach to modeling wave propagation in urban environments, based on the Transmission Line Matrix (TLM) method. Two-dimensional simulations are performed using a map of a city A renormalization technique is proposed to convert the results to the three-dimensional space. Our approach provides good predictions for the intensity of a wave when compared with in-situ measurements and is appropriate to very fast massively parallel computations. In order to provide a performance analysis, the algorithm has been used as a benchmark on different parallel architecture (CM200, CM5, IBM SP2 and Cray T3D).
TL;DR: A parallel shared memory implementation of multifrontal QR factorization using a combination of tree and node level parallelism and a buddy system based on Fibonacci blocks to achieve high performance for general large and sparse matrices is discussed.
Abstract: We discuss a parallel shared memory implementation of multifrontal QR factorization. To achieve high performance for general large and sparse matrices, a combination of tree and node level parallelism is used. Acceptable load balancing is obtained by the use of a pool-of-tasks approach. For the storage of frontal and update matrices, we use a buddy system based on Fibonacci blocks. It turns out to be more efficient than blocks of size 2i, as proposed by other authors. Also the order in which memory space for update and frontal matrices are allocated is shown to be of importance. An implementation of the proposed algorithm on the CRAY X-MP/416 (four processors), gives speedups of about three with about 20% of extra real memory space required.