TL;DR: A general purpose multiprocessor should be scalable, i.e. show higher performance when more hardware resources are added to the machine and address the loss in processor efficiency due to two fundamental issues: long memory latencies and waits due to synchronization events.
Abstract: A general purpose multiprocessor should be scalable, i.e. show higher performance when more hardware resources are added to the machine. Architects of such multiprocessors must address the loss in processor efficiency due to two fundamental issues: long memory latencies and waits due to synchronization events. It is argued that a well designed processor can overcome these losses provided there is sufficient parallelism in the program being executed. The detrimental effect of long latency can be reduced by instruction pipelining, however, the restriction of a single thread of computation in von Neumann processors severely limits their ability to have more than a few instructions in the pipeline. Furthermore, techniques to reduce the memory latency tend to increase the cost of task switching. The cost of synchronization events in von Neumann machines makes decomposing a program into very small tasks counter-productive. Dataflow machines, on the other hand, treat each instruction as a task, and by paying a small synchronization cost for each instruction executed, offer the ultimate flexibility in scheduling instructions to reduce processor idle time.
TL;DR: It is shown that the mapping problem—assigning processes to processors—can be reduced to the graph partitioning problem and an evolution method derived from biology is applied to the travelling salesman problem.
Abstract: Problems concerning parallel computers can be understood and solved using results of natural sciences. We show that the mapping problem—assigning processes to processors—can be reduced to the graph partitioning problem. We solve the partitioning problem by an evolution method derived from biology. The evolution method is then applied to the travelling salesman problem. The competition part of the evolution method gives an almost linear speedup compared to the sequential method. A cooperation method leads to new heuristics giving better results than the known heuristics.
TL;DR: The features of BLAZE are described, and how this language would be used in typical scientific programming is shown, to allow one to achieve highly parallel execution on multiprocessor architectures while still providing the user with conceptually sequential control flow.
Abstract: A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.
TL;DR: Parallel Supercomputing Today and the Cedar approach and Questions and Unexpected Answers in Concurrent Computation (G. Gottlieb).
Abstract: Parallel Supercomputing Today and the Cedar Approach (D.J. Kuck et al.). An Overview of the NYU Ultracomputer Project (A. Gottlieb). Questions and Unexpected Answers in Concurrent Computation (G.C. Fox). An Introduction to the IBM Research Parallel Processor Prototype (RP3) (G.F. Pfister et al.). Large Scale Parallel Computation on a Loosely Coupled Array of Processors (E. Clementi, J. Detrich). The Manchester Dataflow Computing System (J. Gurd, C. Kirkham, W. Bohm). The Cornell Parallel Supercomputing Effort (A. Nicolau). The GF11 Parallel Computer (J. Beetem, M. Denneau, D. Weingarten). Index.
TL;DR: A graphical model is described that profiles the execution of the barriers and other parallel programming constructs and shows that in order to achieve the best performance, different situations call for different barrier implementations.
Abstract: A barrier is a method for synchronizing a large number of concurrent computer processes. After considering some basic synchronization mechanisms, a collection of barrier algorithms with either linear or logarithmic depth are presented. A graphical model is described that profiles the execution of the barriers and other parallel programming constructs. This model shows how the interaction between the barrier algorithms and the work that they synchronize can impact their performance. One result is that logarithmic tree structured barriers show good performance when synchronizing fixed length work, while linear self-scheduled barriers show better performance when synchronizing fixed length work with an imbedded critical section. The linear barriers are better able to exploit the process skew associated with critical sections. Timing experiments, performed on an eighteen processor Flex/32 shared memory multiprocessor that support these conclusions, are detailed.
TL;DR: A row-oriented implementation of Gaussian elimination with partial pivoting on a local-memory multiprocessor with an Intel hypercube is described, which is inexpensive and which maintains computational balance in the presence of pivoting.
Abstract: A row-oriented implementation of Gaussian elimination with partial pivoting on a local-memory multiprocessor is described. In the absence of pivoting, the initial data loading of the node processors leads to a balanced computation. However, if interchanges occur, the computational loads on the processors may become unbalanced, leading to inefficiency. A simple load-balancing scheme is described which is inexpensive and which maintains computational balance in the presence of pivoting. Using some reasonable assumptions about the probability of pivoting occurring, an analysis of the communication costs of the algorithm is developed, along with an analysis of the computation performed in each node processor. This model is then used to derive the expected speedup of the algorithm. Finally, experiments using an Intel hypercube are presented in order to demonstrate the extent to which the analytical model predicts the performance.
TL;DR: A parallel implementation of the QR-algorithm for the eigenvalues of a non-Hermitian matrix is proposed, designed to run efficiently on a linear array of processors that communicate by accessing their neighbor's memory.
Abstract: In this paper a parallel implementation of the QR-algorithm for the eigenvalues of a non-Hermitian matrix is proposed The algorithm is design to run efficiently on a linear array of processors that communicate by accessing their neighbor's memory A module for building such arrays, the Maryland CRAB, is also described
TL;DR: Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach to text signatures, fixed-length bit string representations of document content.
Abstract: This paper considers the use of text signatures, fixed-length bit string representations of document content, in an experimental information retrieval system: such signatures may be generated from the list of keywords characterising a document or a query. A file of documents may be searched in a bit-serial parallel computer, such as the ICL Distributed Array Processor, using a two-level retrieval strategy in which a comparison of a query signature with the file of document signatures provides a simple and efficient means of identifying those few documents that need to undergo a computationally demanding, character matching search. Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach.
TL;DR: An attempt at a short term solution to the transportability problem of producing transportable mathematical software using a package called SCHEDULE, which provides a standard user interface to several shared memory parallel machines.
Abstract: The emergence of commercially produced parallel computers has greatly increased the problem of producing transportable mathematical software. Exploiting these new parallel capabilities has led to extensions of existing languages such as FORTRAN and to proposals for the development of entirely new parallel languages. We present an attempt at a short term solution to the transportability problem. The motivation for developing the package has been to extend capabilities beyond loop based parallelism and to provide a convenient machine independent user interface. A package called SCHEDULE is described which provides a standard user interface to several shared memory parallel machines. A user writes standard FORTRAN code and calls SCHEDULE routines which express and enforce the large grain data dependencies of his parallel algorithm. Machine dependencies are internal to SCHEDULE and change from one machine to another but the users code remains essentially the same across all such machines. The semantics and usage of SCHEDULE are described and several examples of parallel algorithms which have been implemented using SCHEDULE are presented.
TL;DR: This work has been testing kernels and codes on a CRAY-2 prior to the delivery of a machine to Harwell in 1987 and reports some results on the solution of sparse equations which indicate that high efficiency can be obtained.
Abstract: The CRAY-2 is sometimes viewed as a CRAY-1 with a faster cycle time. We show that this viewpoint is not completely appropriate by describing some of the important architectural differences between the machines and indicating how they can be used to facilitate numerical calculations. We have been testing kernels and codes on a CRAY-2 prior to the delivery of a machine to Harwell in 1987 and report some results on the solution of sparse equations which indicate that high efficiency can be obtained. We give details of one of the techniques for attaining this performance. We also comment on the use of parallelism in the solution of sparse linear equations on the CRAY-2.
TL;DR: This paper shows how a block version of the two-sided Jacobi method can be used to compute the SVD efficiently on a distributed architecture and compares two variants of this method that differ mainly in the degree to which they diagonalize a given subproblem.
Abstract: Jacobi methods for computing the singular value decomposition (SVD) of a matrix are ideally suited for multiprocessor environments due to their inherent parallelism. In this paper we show how a block version of the two-sided Jacobi method can be used to compute the SVD efficiently on a distributed architecture. We compare two variants of this method that differ mainly in the degree to which they diagonalize a given subproblem. The first method is a true block generalization of the scalar scheme in that each off-diagonal block is completely annihilated. The second method is a scalar Jacobi algorithm reorganized in such a manner that it conforms to the block decomposition of the problem. We have performed experiments on the Loosely Coupled Array Processor (LCAP) system at IBM Kingston which for the purposes of this article can be viewed as a ring of up to ten FPS-164/MAX array processors. These experiments show that the block Jacobi algorithm performs well on a distributed system, especially when the processors have vector processing hardware. As an example, we were able to achieve a sustained performance of 159 MFlops on a 960-by-720 SVD problem using eight processors. A surprising outcome of these experiments is that the determining factor for the performance of the algorithm on a high-performance architecture is the subproblem solver, not the communication overhead of the algorithm.
TL;DR: An alternative decomposition for a tridiagonal matrix which has the property that the decomposition as well as the subsequent solution process can be done in two parallel parts is analysed, equivalent to the two-sided Gaussian elimination algorithm.
Abstract: We analyse an alternative decomposition for a tridiagonal matrix which has the property that the decomposition as well as the subsequent solution process can be done in two parallel parts. This decomposition is equivalent to the two-sided Gaussian elimination algorithm that has been discussed by Babuska. In the context of parallel computing a similar approach has been suggested by Joubert and Cloete. The computational complexity of this alternative decomposition is the same as for the standard decomposition and a remarkable aspect is that it often leads to slightly more accurate solutions than the standard process does. The algorithm can be combined with recursive doubling or cyclic reduction in order to increase the degree of parallelism and vectorizability.
TL;DR: The problem-heap paradigm is illustrated by algorithms which have been implemented and analyzed using the Multi-Maren mulitiprocessor.
Abstract: The problem-heap paradigm has evolved through four years of experiments with the Multi-Maren multiprocessor Problem-heap algorithms have been formulated for a number of different tasks such as numerical problems, sorting, searching and optimization Although these tasks are very different, the analyses of the running times of all the problem-heap algorithms are very similar The problem-heap paradigm is illustrated by algorithms which have been implemented and analyzed using the Multi-Maren mulitiprocessor
TL;DR: A parallel Monte Carlo photon transport algorithm that insures the reproducibility of results and the introduction of a pair of pseudo-random number generators that are able to reproduce results exactly in a asynchronous parallel processing environment is presented.
Abstract: We present a parallel Monte Carlo photon transport algorithm that insures the reproducibility of results. The important feature of this parallel implementation is the introduction of a pair of pseudo-random number generators. This pair of generators is structured in such a manner as to insure minimal correlation between the two sequences of pseudo-random numbers produced. We term this structure as a ‘pseudo-random tree’. Using this structure, we are able to reproduce results exactly in a asynchronous parallel processing environment. The algorithm tracks the history of photons as they interact with two carbon cylinders joined end to end. The algorithm was implemented on both a Denelcor HEP and a CRAY X-MP/48. We describe the algorithm and the pseudo-random tree structure and present speedup results of our implementation.
TL;DR: The NEC supercomputer SX system is a high-speed, large-scale supercomputer designed for scientific and engineering computations, and the standard vectorizing FORTRAN, various performance tuning tools and the operating system with various features are supported.
Abstract: The NEC supercomputer SX system is a high-speed, large-scale supercomputer designed for scientific and engineering computations. It features 16 vector pipelines with a vector peak speed of 1.3 Gflops, a simplified scalar design with a control and arithmetic processor, and 256 Mbytes of main memory with 512 banks. To achieve the ease-of-use, the standard vectorizing FORTRAN, various performance tuning tools and the operating system with various features are supported.
TL;DR: This work presents a parallel algorithm for symbolic Cholesky factorization of sparse symmetric matrices that complements a parallel numeric factorization algorithm published earlier and describes two enhancements that improve performance.
Abstract: We present a parallel algorithm for symbolic Cholesky factorization of sparse symmetric matrices. The symbolic factorization algorithm complements a parallel numeric factorization algorithm published earlier. The implementation is designed for a message-passing, distributed-memory multiprocessor. In addition to discussing the basic algorithm and data structures required, we also describe two enhancements that improve performance. Empirical test results obtained on an Intel iPSC hypercube are given.
TL;DR: In this communication, a computational network for the Gauss-Jordan algorithm is presented and it is shown that this network compares favorably with optimal implementations of theGauss elimination/back substitution algorithm.
Abstract: Any factorization/back substitution scheme for the solution of linear systems consists of two phases which are different in nature, and hence may be inefficient for parallel implementation on a single computational network The Gauss-Jordan elimination scheme unifies the nature of the two phases of the solution process and thus seems to be more suitable for parallel architectures, especially if reconfiguration of the communication pattern is not permitted In this communication, a computational network for the Gauss-Jordan algorithm is presented This network compares favorably with optimal implementations of the Gauss elimination/back substitution algorithm
TL;DR: It is observed that the FLEX/32 does not have any communication bottleneck and probably will not suffer substantial communication performance degradation if the processor speeds are increased by a factor of 10, and the partitioning method works reasonably well even here where communication costs are negligigble.
Abstract: We consider modeling, predicting and evaluating the performance of methods for solving PDEs in parallel architectures. We have developed a method for coarse grain partitioning of computations for parallel architectures and we apply it to three PDE applications: (a) Cholesky factorization, (b) spline collocation, and (c) an application complete from processing text input to plotting the PDE solution. Our partitioning method is oriented to minimizing interprocessor communication and we review some ‘uniform’ architectures and models of their communication. We apply this method to the three applications implemented on the FLEX/32 multicomputer. We review the architecture of the FLEX/32 and the results of applying the partitioning method to computation running on the FLEX/32. We observe that the FLEX/32 does not have any communication bottleneck and probably will not suffer substantial communication performance degradation if the processor speeds are increased by a factor of 10. Our partitioning method works reasonably well even here where communication costs are negligigble. The coarse grain structure of two of these applications is not highly parallel and we observe speedups of about k /2 for k processors. The other application is highly parallel and we observe optimal speedups for any number of processors as the problem size increases.
TL;DR: A library of parallel vector and matrix operations for hypercube multiprocessors that supports both full and sparse matrices is described and it is shown that these algorithms perform at high computational efficiency on both the Caltech and Intel hypercubes.
Abstract: We describe a library of parallel vector and matrix operations for hypercube multiprocessors that supports both full and sparse matrices The library includes operations such as vector arithmetic, innerproducts, matrix transpose, matrix-vector and matrix-matrix multiplication and rank one updates The library should be generally applicable to a wide range of architectures Performance of the library routines depends on the ability to map various topological graphs onto the processor network In the case of hypercubes we have used such mappings for binary trees, hierarchies of rings and rectangular grids We describe algorithms for the solution of elliptic and hyperbolic equations on parallel computers, and present results of several implementations The library is a fundamental tool in the development of the PDE solution algorithms and all machine dependencies of these algorithms are hidden in the linear algebra package We show that these algorithms perform at high computational efficiency on both the Caltech and Intel hypercubes Solution methods involved include preconditioned conjugate gradient, multigrid methods, and for hyperbolic problems, both explicit finite differences and the random choice method These algorithms implement substantial parts of many fluid dynamics calculations
TL;DR: A number of parameters are defined for describing the variation of computer performance with vector length, work per memory reference, grain size, and the number of processors used, respectively, n12, f12, s12 and the parameter pair (p12, p).
Abstract: A number of parameters are defined for describing the variation of computer performance with vector length, work per memory reference, grain size, and the number of processors used. These are respectively, n12, f12, s12, and the parameter pair (p12, p). Where possible parameters are related to quantities that are likely to be known by the programmer, and which are easily measured. Measurements of the parameters are given for the Denelcor HEP, CRAY X-MP, FPS 5000 and an IBM 4381 hosting ten FPS-164 computers.
TL;DR: The linear equation solver in the analysis part of SESAM is described, which is well suited for vector and parallel processing and uses substructure techniques at the highest level.
Abstract: The task of converting large scale engineering programs to new computer architectures is expensive and nontrivial. An example of such a program is the structural analysis system SESAM. This paper describes the linear equation solver in the analysis part of SESAM. The algorithm is well suited for vector and parallel processing. The method uses substructure techniques at the highest level. Block sparsity is exploited at an intermediate level, while a new, sparse implementation of the extended BLAS routines forms the basis for the lowest level of the algorithm. Several problems unique to large scale general programs are described in relation to new computer technology.
TL;DR: The programming strategy for migrating codes from a conventional sequential system to a parallel one and the performance of a variety of applications programs is analyzed to demonstrate the merits of this approach.
Abstract: We discuss two experimental parallel computer systems 1CAP-1 and 1CAP-2 which can be applied to the entire spectrum of scientific and engineering applications. These systems achieve “supercomputer” levels of performance by spreading large scale computations across multiple cooperating processors — several with vector capabilities. We outline system hardware and software, and discuss our programming strategy for migrating codes from a conventional sequential system to parallel. The performance of a variety of applications programs is analyzed to demonstrate the merits of this approach. Finally, we discuss 1CAP-3 an extension to this computing system, which has been recently assembled.
TL;DR: This paper tries to resolve the controversy over the statement that in parallel algorithms superlinear speedup can be achieved.
Abstract: In the July 1986 edition of Parallel Computing two articles appeared claiming A resp. - A were true, where A is the statement that in parallel algorithms superlinear speedup can be achieved. This paper tries to resolve the controversy.
TL;DR: It is shown how Euler's method which is combined with extrapolation to improve the estimates of the solution is extended to the Bulirsch and Stoer algorithm and hence a generic form can be given to systolic arrays for the construction of extrapolation tables.
Abstract: We consider here the systolic array constuction of extrapolation tables us sed in the solution of Ordinary Differential Equations (ODE's) associated with initial value type problems. The technique is examined first for a low order formula, i.e. Euler's method which is combined with extrapolation to improve the estimates of the solution. We also show how this is extended to the Bulirsch and Stoer algorithm and hence a generic form can be given to systolic arrays for the construction of extrapolation tables.
TL;DR: The relative performance comparison of different algorithms on the EC 1045-EC 2345 system is presented and the effectiveness of algorithms and optimal evaluation of programs is discussed.
Abstract: This paper examines the performance and implementation of algorithms on host-computer/attached array-processor systems. A performance analysis method for such systems is reviewed with the application to the host-computer EC 1045/array-processor EC 2345 system produced in the USSR. The relative performance comparison of different algorithms on the EC 1045-EC 2345 system is presented and the effectiveness of algorithms and optimal evaluation of programs is discussed. The results are demonstrated on several numerical linear algebra algorithms. Theoretical results are compared with numerical experiments.
TL;DR: Comment on the two short communications in the July 1986 issue of Parallel Computing entitled "Supeflinear speedup of an efficient serial algorithm is not possible" and "Parallel efficiency can be greater than unity" with opposite viewpoints.
Abstract: We are compelled to comment on the two short communications in the July 1986 issue of Parallel Computing on the same subject but with opposite viewpoints. They are entitled "Supeflinear speedup of an efficient serial algorithm is not possible" [1] and "Parallel efficiency can be greater than unity" [2]. The first c~:)mmunication was authored by us and showed that from an algorithm point of view, superlir~,ear speedup is impossible unless hardware considerations like context switching or memory size are considered. The proof that we employed was to take any algorithm that purported to have supeflinear speedup and show that by 'emulating' the parallel algorithm in a serial manner, we arrive at a sequential algorithm that is at worst P times slower than the parallel (where P is the number of processors). It is important to note that this statement is a theorem, which follows logically from two assumptions: (1) hardware considerations are ignored and (2) the definition of speedup. There are two ways in which one may argue with our theorem. One might disagree with our assumptions or one might claim our proof to be invalid. If one chooses to claim that hardware considerations cannot be ignored, then our theorem may not hold. However, in evaluating a particular parallel algorithm, it is clear that the correct definition of speedup should be
TL;DR: The Fitted Diagonals method is proposed in order to halve the number of processors within systolic arrays for matrix computation and it is shown that processor utilization is improved and the time spent by the algorithms is kept invariant.
Abstract: In this work the Fitted Diagonals method is proposed in order to halve the number of processors within systolic arrays for matrix computation. The method is applied to Leiserson's systolic arrays and it is shown that processor utilization is improved. The time spent by the algorithms is kept invariant but at the hardware level some modifications are required.
TL;DR: Seven internal methods for sorting a set (a1, a2,…,an) of real numbers into non-descending order are compared with regard to their performance on the vector computers CRAY-1S, Cray-1M, CRAY X-MP, AMDAHL 1100,AMDAHL 1200 and the AMDA HL 470/V7.
Abstract: Seven internal methods for sorting a set (a1, a2,…,an) of real numbers into non-descending order are compared with regard to their performance on the vector computers CRAY-1S, CRAY-1M, CRAY X-MP, AMDAHL 1100, AMDAHL 1200 and the AMDAHL 470/V7. The algorithms considered are: Bubble sort, odd-even transposition sort, Batcher's parallel merge-exchange sort, heapsort, quicksort, vector quicksort and diamond sort. Moreover, certain variants of some of these algorithms are also considered. The suitability of the algorithms with respect to vector machine implementation is discussed and the FORTRAN Cray codes for Batcher's parallel merge-exchange sort as well as Diamond sort are given.
TL;DR: It is shown that simple extrapolations from few processor/single problem environments have very little relevance to environments with many processors.
Abstract: The ‘cost’ of most computations of a multiple processor system includes components due to the need to communicate data amongst the processors. These costs for a set of global algorithms are examined and a figure of merit is defined—the balance factor is defined for mesh connected processors. In particular the costs of the SUMMATION operator are considered in a wide range of contexts and it is shown that simple extrapolations from few processor/single problem environments have very little relevance to environments with many processors.
TL;DR: A new parallel algorithm for transforming an arithmetic infix expression into a par se tree is presented, based on a result due to Fischer (1980) which enables the construction of the parse tree, by appropriately scanning the vector of precedence values associated with the elements of the expression.
Abstract: A new parallel algorithm for transforming an arithmetic infix expression into a par se tree is presented. The technique is based on a result due to Fischer (1980) which enables the construction of the parse tree, by appropriately scanning the vector of precedence values associated with the elements of the expression. The algorithm presented here is suitable for execution on a shared memory model of an SIMD machine with no read/write conflicts permitted. It uses O(n) processors and has a time complexity of O(log2n) where n is the expression length. Parallel algorithms for generating code for an SIMD machine are also presented.