TL;DR: A new genetic algorithm which relies on intelligent evolution of individuals is presented, which is inherently parallel and shows a superlinear speedup in multiprocessor systems.
Abstract: Evolution algorithms for combinatorial optimization have been proposed in the 70's. They did not have a major influence. With the availability of parallel computers, these algorithms will become more important. In this paper we discuss the dynamics of three different classes of evolution algorithms: network algorithms derived from the replicator equation, Darwinian algorithms and genetic algorithms inheriting genetic information. We present a new genetic algorithm which relies on intelligent evolution of individuals. With this algorithm, we have computed the best solution of a famous travelling salesman problem. The algorithm is inherently parallel and shows a superlinear speedup in multiprocessor systems.
TL;DR: The design of an interactive system for the semi-automatic transformation of FORTRAN 77 programs into parallel programs for the SUPERNUM machine is described, characterized by a powerful analysis component, a catalog of MIMD and SIMD parallelization transformations, and a flexible dialog facility.
Abstract: This paper describes the design of an interactive system for the semi-automatic transformation of FORTRAN 77 programs into parallel programs for the SUPERNUM machine. The system is characterized by a powerful analysis component, a catalog of MIMD and SIMD parallelization transformations, and a flexible dialog facility. It contains specific knowledge about the parallelization of an important class of numerical algorithms.
TL;DR: A single-program-multiple-data computational model which is implemented in the EPEX system to run in parallel mode FORTRAN scientific application programs and the applicability of the model in the parallelization of several applications is demonstrated.
Abstract: We present a single-program-multiple-data computational model which we have implemented in the EPEX system to run in parallel mode FORTRAN scientific application programs. The computational model assumes a shared memory organization and is based on the scheme that all processes executing a program in parallel remain in existence for the entire execution; however, the tasks to be executed by each process are determined dynamically during execution by the use of appropriate synchronizing constructs that are imbedded in the program. We have demonstrated the applicability of the model in the parallelization of several applications. We discuss parallelization features of these applications and performance issues such as overhead, speedup, efficiency.
TL;DR: It is shown that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models.
Abstract: This paper introduces a graph-theoretic approach to analyse the performances of several parallel Gaussian-like triangularization algorithms on an MIMD computer. We show that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models. We derive new complexity results and compare the asymptotic performances of these parallel versions.
TL;DR: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops and addresses the concern that their computations must be understood thoroughly, so that efficient implementations may be written.
Abstract: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops. The Loops represent the type of computational kernels typically found in large-scale scientific computing and have been used to benchmark computer system since the mid-60's. On parallel systems, a process's computational structure can greatly affect its efficiency. If the loops are to be used to benchmark such systems, their computations must be understood thoroughly, so that efficient implementations may be written. This paper addresses that concern.
TL;DR: It is proposed that an iterative inversion using a varying background velocity obtains all wavenumbers that are resolvable separately by migration and tomography, which is effectively the same as in medical imaging.
Abstract: Seismic inversion, broadly enough defined, is equivalent to doing migration and reflection tomography simultaneously. Diffraction tomography and inversion work best when sources and receivers surround the region of interest, as in medical imaging applications. Theoretical studies typically show that the high vertical wavenumber velocity perturbations are resolved in seismic reflection experiments where the sources and receivers are restricted to the Earth's surface but low vertical wavenumbers must be obtained using a separate step such as a velocity analysis or reflection tomography. I propose that an iterative inversion using a varying background velocity obtains all wavenumbers that are resolvable separately by migration and tomography. (The background velocity must contain abrupt discontinuities.) Reflectors in the background model simulate sources and receivers within the Earth so the source and receiver coverage in seismic reflection inverse problems is effectively the same as in medical imaging. Some synthetic examples verify the theoretical predictions and show that reflector locations and interval velocities can be obtained simultaneously.
TL;DR: This survey focuses on promising approaches for solving large, well-structured constrained problems: dualization of problems with separable objective and constraint functions, and decomposition of hierarchical problems with linking variables.
Abstract: This survey is concerned with variants of nonlinear optimization methods designed for implementation on parallel computers. First, we consider a variety of methods for unconstrained minimization. We consider a particular type of parallelism (simultaneous function and gradient evaluations), and we concentrate on the main sources of inspiration: conjugate directions, homogeneous functions, variable-metric updates, and multi-dimensional searches. The computational process for solving small and medium-size constrained optimization problems is usually based on unconstrained optimization. This provides a straightforward opportunity for the introduction of parallelism. In the present survey, however, we focus on promising approaches for solving large, well-structured constrained problems: dualization of problems with separable objective and constraint functions, and decomposition of hierarchical problems with linking variables (typical for Bender's decomposition in the linear case). Finally, we outline the key issues in future computational studies of parallel nonlinear optimization algorithms.
TL;DR: It is shown the classical inner product algorithms can be nearly as efficient as the usual column sweep algorithm in solving triangular systems with column storage.
Abstract: The solution on parallel computers of systems of equations Lx = b, where L is lower triangular, is considered. Some authors have suggested that it is difficult to solve such systems in parallel on message-passing, local-memory machines when L is stored by columns. It is shown here that this is not necessarily the case if the machine can accomplish fan-in communication with reasonable efficiency.
TL;DR: A number of results on Gray codes which characterise a certain family of Gray codes are presented, and it is shown that Fast Fourier Transforms on data distributed among processors according to a Gray code can also be efficiently implemented on the hypercube.
Abstract: Fast Fourier Transforms are a widely-used and powerful tool for the analysis and solution of many problems. They have been used in such diverse areas as medicine, acoustics, image processing, system design and many other fields. By transforming the data the problem may be simpler, more tractable or more efficiently solved and for many applications (e.g. speech processing) the data may be much more easily understood in the transform domain. Therefore fast algorithms for implementing transforms are vital for any powerful computer. This paper describes the implementation of a Fast Fourier Transform on a 64-node INTEL hypercube and shows how the hypercube architecture may be efficiently used. Usually the FFT is only a part of the solution process and the data on the hypercube has to be arranged in a certain manner for the efficient solution of the whole problem. A common way is for the data to be distributed according to a Gray code, so that neighbouring points in the domain are in neighbouring processors. We present a number of results on Gray codes which characterise a certain family of Gray codes, and show that Fast Fourier Transforms on data distributed among processors according to a Gray code can also be efficiently implemented on the hypercube.
TL;DR: An outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type, a message-based system designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead, is presented.
Abstract: The designer of a numerical supercomputer is confronted with fundamental design decisions stemming from some basic dichotomies in supercomputer technology and architecture. On the side of the hardware technology there exists the dichotomy between the use of very high-speed circuitry or very large-scale integrated circuitry. On the side of the architecture there exists the dichotomy between the SIMD vector machine and the MIMD multiprocessor architecture. In the latter case, the ‘nodes’ of the system may communicate through shared memory, or each node has only private memory, and communication takes place through the exchange of messages. All these design decisions have implications with respect to performance, cost-effectiveness, software complexity, and fault-tolerance. In the paper the various dichotomies are discussed and a rationale is provided for the decision to realize the SUPRENUM supercomputer, a large ‘number cruncher’ with 5 Gflops peak performance, in the form of a massively parallel MIMD/SIMD multicomputer architecture. In its present incorporation, SUPRENUM is configurable to up to 256 nodes, where each node is a pipeline vector machine with 20 Mflops peak performance, IEEE double precision. The crucial issues of such an architecture, which we consider the trendsetter for future numerical supercomputer architecture in general, are on the hardware side the need for a bottleneck-free interconnection structure as well as the highest possible node performance obtained with the highest possible packaging density, in order to accommodate a node on a single circuit board. On the side of the system software the design goal is to obtain an adequately high degree of operational safety and data security with minimum software overhead. On the side of the user an appropriate program development environment must be provided. Last but not least, the system must exhibit a high degree of fault tolerance, if for nothing else but for the sake of obtaining a sufficiently high MTBF. In the paper a detailed discussion of the hardware and software architecture of the SUPRENUM supercomputer, whose design is based upon the considerations discussed, is presented. A largely bottleneck-free interconnection structure is accomplished in a hierarchical manner: the machine consists of up to 16 ‘clusters’, and each cluster consists of 16 working ‘nodes’ plus some organisational nodes. The node is accommodated on a single circuit board; its architecture is based on the principle of data structure architecture explained in the paper. SUPRENUM is strictly a message-based system; consequently, the local node operating system has been designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead. SUPRENUM is organized as a distributed system—a prerequisite for the high degree of fault tolerance required; therefore, there exists no centralized global operating system. The paper concludes with an outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type.
TL;DR: An O(log( n )) time with O( n ) processors optimal algorithm for finding the maximal elements of a set and the model of parallel computation is the CREW-PRAM.
Abstract: We describe an O(log( n )) time with O( n ) processors optimal algorithm for finding the maximal elements of a set. The model of parallel computation we consider is the CREW-PRAM, i.e. it is the synchronous shared memory model where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location (even if they are trying to write the same thing).
TL;DR: The results show that the ISA concept combines the advantages of standard systolic arrays with those of the MIMD concept, and in addition theISA architecture has smaller area requirements than a corresponding syStolic array or MIMM machine it is strong practical relevance.
Abstract: In this paper we investigate the relationships between three different models of parallel computers based on mesh-connected arrays: the processor array (PA), which is an MIMD-array of independent processors, the instruction broadcasting array (IBA), where the instructions are broadcast to all the processors of a column and executed according to selector information which is broadcast to all the processors of a row, and the instruction systolic array (ISA), where the instructions are pumped through the array row by row and combined with selector information which is pumped through the array column by column. For every two of these models we determine tight bounds on the worst-case delay introduced by a transformation of a program on one model into an equivalent program on the other. The results show that the ISA concept combines the advantages of standard systolic arrays with those of the MIMD concept. Since in addition the ISA architecture has smaller area requirements than a corresponding systolic array or MIMD machine it is strong practical relevance.
TL;DR: This work proposes a modification of the partition method of Wang which reduces the amount of data transport considerably, without affecting the computational complexity and which has about the same degree of parallelism as the original version.
Abstract: The partition method of Wang, for the solution of tridiagonal linear systems, is analysed with regard to data transport between the processors of a parallel (local memory) computer. We propose a modification which reduces the amount of data transport considerably, without affecting the computational complexity and which has about the same degree of parallelism as the original version. We will also discuss the effects of this modification to a generalized version for banded systems. The parallel solution of a bidiagonal system is considered.
TL;DR: The authors' implementation of quickmerge achieves significantly higher speedup than occur implementation of parallel quicksort, and is suitable for implementation on tightly coupled multiprocessors and compare their performance on the Denelcor HEP.
Abstract: We present three parallel sorting algorithms suitable for implementation on tightly coupled multiprocessors and compare their performance on the Denelcor HEP. Two of the algorithms implemented—parallel Shellsort and quickmerge—are new. Shellsort is amenable to parallelization; however, since Shellsort has higher complexity than quicksort, parallel Shellsort is inferior to parallel quicksort. A second new parallel algorithm, called quickmerge , is based upon both quicksort and mergesort. Our implementation of quickmerge achieves significantly higher speedup than occur implementation of parallel quicksort.
TL;DR: A general Cartesian data distribution scheme is presented which contains many of the existing distribution schemes as special cases and is used to prove optimality of load balance for the grid distribution.
Abstract: A parallel algorithm is derived for LU decomposition with partial pivoting on a local-memory multiprocessor. A general Cartesian data distribution scheme is presented which contains many of the existing distribution schemes as special cases. This scheme is used to prove optimality of load balance for the grid distribution. Experimental results of an implementation of the algorithm in occam-2 on a square mesh of 36 transputers show an efficiency of 88% and a speed of 21.5 Mflop/s for a matrix of size n=1000.
TL;DR: A parallel multigrid solver for steady-state incompressible Navier-Stokes equations on general domains which is currently being developed at the GMD is described.
Abstract: Multigrid methods are distinguished by their optimal (sequential) efficiency and by the fact that all their algorithmical components are fully parallelizable. For this reason, this class of numerical methods is especially attractive for use on parallel (MIMD, local memory) computers. In this paper, we describe a parallel multigrid solver for steady-state incompressible Navier-Stokes equations on general domains which is currently being developed at the GMD. Due to the geometrical generality of the problem, our approach is based on a non-staggered (nodal-point) finite volume scheme on multi-block boundary fitted grids. The typical instability of non-staggered schemes is overcome by suitably modifying the discrete continuity equation without affecting the overall order of consistency. Starting from the most simple Cartesian case, we discuss several possible multigrid approaches to the general 2D-problem. This motivates the basic design decisions of our multigrid solver in regard to both the discretization and the choice of multigrid components (smoothing schemes). Furthermore, the principal technique of parallelization (grid partitioning) is described as well as some fundamental aspects of the implementation (communication library).
TL;DR: Lower bounds for the expected efficiency of SRM are developed and compared with its performance as obtained through computional experiments, as well as in a simulated distributed environment on a sequential machine.
Abstract: Gauss-Seidel type relaxation techniques are applied in the context of strictly convex network optimization problems. The algorithm lends itself for processing in a massively distributed environment. A synchronous relaxation method (SRM) is proposed, based on the k-coloring properties of the network graph. The method is tested in a simulated distributed environment on a sequential machine. Lower bounds for the expected efficiency of SRM are developed and compared with its performance as obtained through computional experiments.
TL;DR: In this paper, the non-zero elements in a sparse matrix are organized in the form of non-overlapping stripes, and only the elements within the stripe structure of the matrix are manipulated.
Abstract: The multiplication of a vector by a matrix and the solution of triangular linear systems are the most demanding operations in the majority of iterative techniques for the solution of linear systems. Data-driven VLSI networks which perform these two operations, efficiently, for certain sparse matrices are introduced. In order to avoid computations that involve zero operands, the non-zero elements in a sparse matrix are organized in the form of non-overlapping stripes, and only the elements within the stripe structure of the matrix are manipulated. Detailed analysis of the networks proves that both operations may be completed in n global cycles with minimal communication overhead, where n is the order of the linear system. The number of cells in each network as well as the communication overhead, are determined by the stripe structure of the matrix. Different stripe structures for the class of sparse matrices generated in Finite Element Analysis are examined in a separate paper.
TL;DR: The methods suggested here increase the degree to which work can be performed while data are communicated between processors, allowing efficient overlap of computation with communication.
Abstract: Methods for efficient computation of numerical algorithms on a wide variety of MIMD machines are proposed. These techniques reorganize the data dependency patterns to improve the processor utilization. The model problem finds the time-accurate solution to a parabolic partial differential equation discretized in space and implicitly marched forward in time. The algorithms are extensions of Jacobi and SOR. The extensions consist of iterating over a window of several timesteps, allowing efficient overlap of computation with communication. The methods increase the degree to which work can be performed while data are communicated between processors. The effect of the window size and of domain partitioning on the system performance is examined both by implementing the algorithm on a simulated multiprocessor system.
TL;DR: In this article, a set of tools that can be used as an aid in the analysis of memory access patterns of FORTRAN programs are described, which can be found in Table 1.
Abstract: This paper describes a set of tools that can be used as an aid in the analysis of memory access patterns of FORTRAN programs.
TL;DR: Some approaches to heterogeneous architectures are discussed, hardware and software issues are identified, and several built or proposed systems are analyzed.
Abstract: Various organization styles have been used in the architecture of supercomputers in order to achieve cost-effective performance and programmability. Traditionally, a particular organization style (e.g., vector pipeline processor, array processor, or multiprocessor) has been selected to satisfy the performance requirements of a class of applications, achieving usually a much lower performance in other applications. In addition, the mapping of ‘foreign’ algorithms to a single-style architecture may create great programming difficulties. Since each architecture style provides attractive cost-performance and programming features, the question of heterogeneity (i.e., combining of several architecture/design styles in a single system) deserves attention. In this paper we discuss some approaches to heterogeneous architectures, identify hardware and software issues, and analyze several built or proposed systems.
TL;DR: It is found that for dyadic operations using the TSKSTART and TSKWAIT synchronization primitives, that R∞ = 130 Mflop/s and s1/2 = 5700 flop is probably the minimum possible value for synchronization overhead on the Cray X-MP.
Abstract: We report performance measurements made on the 2-CPU Cray X-MP at ECMWF, Reading. Vector (SIMD) performance on one CPU is interpreted by the two parameters (r∞, n1/2), and we find for dyadic operations using FORTRAN r∞ = 70 Mflop/s, n1/2 = 53 flop. All vector triadic operations produce r∞ = 107 Mflop/s, n1/2 = 45 flop; and a triadic operation with two vectors and one scalar gives r∞ = 148 Mflop/s and n1/2 = 60 flop. MIMD performance using both CPUs on one job is interpreted with the two parameters (r∞, s1/2),1/2where s1/2 is the amount of arithmetic that could have been done during the time taken to synchronize the two CPUs. We find, for dyadic operations using the TSKSTART and TSKWAIT synchronization primitives, that R∞ = 130 Mflop/s and s1/2 = 5700 flop. This means that a job must contain more than ∼6000 floating-point operations if it is to run more than 50% of the maximum performance when split between both CPUs by this method. Less expensive synchronization methods using LOCKS and EVENTS reduce s1/2 to 4000 flop and 2000 flop respectively. A simplified form of LOCK synchronization written in CAL code further reduces s1/2 to 220 flop. This is probably the minimum possible value for synchronization overhead on the Cray X-MP.
TL;DR: It is demonstrated that grid applications can be implemented quite easily on dm-mp systems if a hardware-independent process system exists and convenient tools (such as the SUPRENUM mapping and communications library) are available.
Abstract: It was shown in the paper of Solchenbach and Trottenberg (in this special issue) that grid algorithms are inherently parallel and that parallel grid algorithms for regular grids can be efficiently implemented on dm-mp systems using the concept of grid partitioning. In this paper, we demonstrate that grid applications can be implemented quite easily on dm-mp systems if a hardware-independent process system exists and convenient tools (such as the SUPRENUM mapping and communications library) are available. The evaluation of parallel grid algorithms shows that the multiprocessor speedup and efficiency for single grid applications depends on the communication/calculation performance ratio of the hardware, on the communication/calculation ratio of the algorithms, and on the process size. The efficiency of parallel multigrid algorithms additionally depends on the number of nodes.
TL;DR: A new algorithm designed for large, single transforms is presented, which employs a pair of multiple transforms to perform the single transform.
Abstract: The Fast Fourier Transform algorithm does not readily lend itself to efficient implementation on vector computers, especially on machines where sequential access is important. Several authors have commented that the efficiency of computation is much improved if many transforms are performed simultaneously. We present a new algorithm designed for large, single transforms, which employs a pair of multiple transforms to perform the single transform. The merits of the algorithm are discussed with reference to its implementation on a CDC CYBER 205.
TL;DR: A proposal for an objective measure of a library's complexity is put forward along with a collection of subjective issues which should be considered in reference to parallel libraries.
Abstract: Several libraries for parallel processing on supercomputers are analyzed in terms of their parallel processing facilities, complexity, use, and how they reflect the hardware for which they were designed. A proposal for an objective measure of a library's complexity is put forward along with a collection of subjective issues which should be considered in reference to parallel libraries. The libraries discussed include those provided by ETA, CRAY, IBM, and FPS, as well as a portable parallel library developed by one of the authors. A brief discussion of how these libraries address the basic concepts of parallel processing is provided.
TL;DR: The fundamental concepts and structure of the distributed operating system, PEACE, for SUPRENUM, are described and an optimal and application-oriented mapping of the entire operating system onto the distributed SUPRenUM architecture is made feasible.
Abstract: This paper describes the fundamental concepts and structure of the distributed operating system, PEACE, for SUPRENUM. A large scale of distribution is achieved because of consequently encapsulating typical operating system services by processes. By this way an optimal and application-oriented mapping of the entire operating system onto the distributed SUPRENUM architecture is made feasible.
TL;DR: This application provides a simple example of the problems associated with load balancing on distributed memory architectures and introduces the use of provably optimal global communication algorithms that are needed for the PIC implementation on the hypercube.
Abstract: We have mapped onto the iPSC hypercube a particle-in-cell (PIC) algorithm that executes a plasma simulation. PIC simulates the movement of charged particles under the influence of an electrostatic field. This application provides a simple example of the problems associated with load balancing on distributed memory architectures. We present several alternative solutions to mappings of the algorithm onto the hypercube. One solution's performance is modeled and benchmarked with data from an implementation on the iPSC. The model is used to predict performance for larger size problems and a state-of-the-art hypercube architecture. We also introduce the use of provably optimal global communication algorithms that are needed for the PIC implementation on the hypercube.