TL;DR: The partition method of Wang for tridiagonal equations is generalized to the arbitrary band case and the algorithm is compared to Gaussian elimination and cyclic reduction.
Abstract: The partition method of Wang for tridiagonal equations is generalized to the arbitrary band case. A stability criterion is given. The algorithm is compared to Gaussian elimination and cyclic reduction.
TL;DR: A broad classification of MIMD computers is proposed, and the computers are discussed in this framework, together with brief details of the architecture and performance of each machine.
Abstract: It is often said that the 1980s are becoming the decade of multiinstruction stream or MIMD computers, while the 1970s could be described as the decade of the SIMD (single instruction stream multiple data stream) computers. The availability of microprocessors and VLSI facilities has led to the proposal and construction of novel computer architectures based on linking many hundreds or even thousands of microprocessors, or specially designed VLSI chips. Some of the larger manufacturers offer computers with a small number of CPUs. Because of the variety of the new developments, it was decided to conduct a survey of proposed and existing MIMD computers in the U.S., taking into account a simple classification of the different devices. Particular attention is given to computers which are designed for numerical work with floating-point numbers and the solution of large problems in physics, chemistry, and engineering.
TL;DR: Variants of the numerical Schwarz algorithms for solving elliptic partial differential equations on multiprocessing systems are described and it is shown that under certain matrix nonnegativity conditions that the convergence rate of the global iteration is invariant to the amount of overlap of the subdomains.
Abstract: Variants of the numerical Schwarz algorithms for solving elliptic partial differential equations on multiprocessing systems are described and analyzed. the methods are described in terms of domain decomposition techniques and mathematically cast into an inner/outer iterative form. It is shown that under certain matrix nonnegativity conditions that the convergence rate of the global iteration is invariant to the amount of overlap of the subdomains.
TL;DR: An overview of the promises and accomplishments of parallel processing as well as the problems and work that remain is treated, which shows the field is at an interesting juncture.
Abstract: We are on the threshold of a new era in computer architecture. It is becoming increasingly difficult to obtain more performance from the time-honored von Neumann model, and many of the technological constraints that influenced its design over thirty years ago have changed drastically. Many of the arguments for processing a single instruction at a time no longer apply, and a number of enthusiastic parallel processing projects are working on various ways to allow many processors to work on a single problem at the same time. However, this re-opens a Pandora's box of questions about how computation should be done, and some of the strengths of the von Neumann model which temporarily closed this box three decades ago become especially apparent when one tries to replace it. This overview treats the promises and accomplishments of parallel processing as well as the problems and work that remain. The paper is organized as follows: Current driving forces for parallel processing; Definitions and fundamental questions; Survey of projects; Emerging answers. As will be shown, the field is at an interesting juncture. Much work has been done, and the ideas are now there for putting it all together. But some large experiments are needed to provide real results from real programs if the pace of progress is to be maintained.
TL;DR: A method for the large-scale numerical simulation of fluid flow and fundamental principles of vector programming in FORTRAN are discussed in order to set the stage for the main topic, the vector coding and execution of the finite-volume procedure on the CYBER 205.
Abstract: The paper reviews a method for the large-scale numerical simulation of fluid flow and discusses fundamental principles of vector programming in FORTRAN in order to set the stage for the main topic, the vector coding and execution of the finite-volume procedure on the CYBER 205. With the proper structure given to the data by the grid transformation each coordinate direction can be differenced throughout the entire grid in one vector operation. Boundary conditions must be interleaved which tends to inhibit the concurrency of the overall scheme, but a stragey of no data motion together with only inner-loop vectorization is judged to be the best compromise. The computed example of transonic vortex flow separating from the sharp leading edge of a delta wing demonstrates the processing performance of the procedure. Vectors over 40000 elements long are obtained, and a rate of over 125 megaflops sustained over the entire computation indicates the high degree of vectorization achieved.
TL;DR: A cellular programming language — named CEPROL — is presented which offers means for programming and controlling cellular automata processing such algorithms.
Abstract: Realized cellular automata may be operated by universal computer systems as programmable special-purpose processors for parallelizable problems. Because of their architecture (local neighbourhood, small storage size per cell, they are well suited for processing systolic algorithms. A cellular programming language — named CEPROL — is presented which offers means for programming and controlling cellular automata processing such algorithms.
TL;DR: A modest collection of primitives for synchronization and control in parallel numerical algorithms are proposed, phrased in a syntax that is compatible with FORTRAN, creating a publication language for parallel software.
Abstract: We propose a modest collection of primitives for synchronization and control in parallel numerical algorithms. These are phrased in a syntax that is compatible with FORTRAN, creating a publication language for parallel software. A preprocessor may be used to map code written in this extended FORTRAN into standard FORTRAN with calls to the run-time libraries of the various parallel systems now in use. We solicit the reader's comments on the clarity, as well as the adequacy, of the primitives we have proposed.
TL;DR: A model of a general class of asynchronous, iterative solution methods for linear systems is developed and a data transfer model predicting both the probability that data must be transferred between two tasks and the amount of data to be transferred is presented.
Abstract: A model of a general class of asynchronous, iterative solution methods for linear systems is developed. In the model, the system is solved by creating several cooperating tasks that each compute a portion of the solution vector. A data transfer model predicting both the probability that data must be transferred between two tasks and the amount of data to be transferred is presented. This model is used to derive an execution time model for predicting parallel execution time and an optimal number of tasks given the dimension and sparsity of the coefficient matrix and the costs of computation, synchronization, and communication. The suitability of different parallel architectures for solving randomly sparse linear systems is discussed. Based on the complexity of task scheduling, one parallel architecture, based on a broadcast bus, is presented and analyzed.
TL;DR: The key components to establish the ease of use and higher performance are described, and the actual performances on the VP system are described.
Abstract: FUJITSU has developed pipelined supercomputers, the FACOM VP-100/200 with the latest technology and new architecture. Based on extensive analyses of application programs, the following advanced features are employed in the VP system: 1. 1) Dynamically reconfigurable vector registers with large capacity. 2. 2) Efficient vector operations for vectorizing IF-statements in DO-loops, 3. 3) High level concurrency for parallel scalar-vector and vector-vector operations, 4. 4) Powerful vectorizing compiler for utilizing the advanced features, 5. 5) Effective tuning tools to extract higher performance of application programs, and 6. 6) Keeping good affinity with general-purpose computer systems. The final goals of development of the VP system are both ease of use and higher performance for various scientific and engineering applications. These goals have been achieved successfully. This paper describes the key components to establish the ease of use and higher performance, and also describes actual performances on the VP system.
TL;DR: This paper establishes a one-to-one correspondence between the set of nodes that possess right sibling and theSet of leaf nodes for any forest for pre-order traversal.
Abstract: Three commonly used traversal methods for binary trees (forsets) are pre-order, in-order and post-order. It is well known that sequential algorithms for these traversals takes order O(N) time where N is the total number of nodes. This paper establishes a one-to-one correspondence between the set of nodes that possess right sibling and the set of leaf nodes for any forest. For the case of pre-order traversal, this result is shown to provide an alternate characterization that leads to a simple and elegant parallel algorithm of time complexity O(log N) with or without read-conflicts on an N processor SIMD shared memory model, where N is the total number of nodes in a forest.
TL;DR: Concurrency aspects of ADA are presented as a case study of a state-of-the-art programming language and the problems of synchronization and communication includes semaphores, messages and mailboxes, and monitors.
Abstract: This paper surveys concurrency issues of programming languages. The evolution of these issues is analyzed in the context of the evolution of other language concepts, such as data and control abstraction. Specific concurrency concepts discussed in the paper include: granularity of parallelism, degree of parallelism, synchronization and communication, and physical distribution. The review of the problems of synchronization and communication includes semaphores, messages and mailboxes, and monitors. Concurrency aspects of ADA are also presented as a case study of a state-of-the-art programming language.
TL;DR: Two parallel algorithms for determining the convex hull of a set of data points in two dimensional space are presented and experimental results on a MIMD parallel system of 4 processors are analysed and presented.
Abstract: Two parallel algorithms for determining the convex hull of a set of data points in two dimensional space are presented. Both are suitable for MIMD parallel systems. The first is based on the strategy of divide-and-conquer, in which some simplest convex-hulls are generated first and then the final convex hull of all points is achieved by the processes of merging 2 sub-convex hulls. The second algorithm is by the process of picking up the points that are necessarily in the convex hull and discarding the points that are definitely not in the convex hull. Experimental results on a MIMD parallel system of 4 processors are analysed and presented.
TL;DR: A hybrid granularity model is proposed for general concurrent solution and relevance to a many-processor CRAY X-MP is demonstrated by simulation.
Abstract: A hybrid granularity model is proposed for general concurrent solution. It is applied to the triangular factorization of a dense matrix ranging in size from 4 to 1024. Concurrency is achieved at two levels: (1) with small (micro) task granularity and (2) with large (blocked) task granularity. Relevance to a many-processor CRAY X-MP is demonstrated by simulation.
TL;DR: An alternative approach, based on function-based computing, is reviewed that to a large degree eliminates or avoids much of the Von Neumann bottleneck, and offers opportunities for the exploitation of parallelism in ways not even conceivable in classical computing.
Abstract: One of today's most popular computing folktheorems states that true parallel processing and conventional computing techniques are mutually incompatible. The term Von Neumann bottleneck summarizes what many feel are the basic stumbling blocks preventing the successful application of parallelism in day-to-day computing. This paper reviews an alternative approach, based on function-based computing, that to a large degree eliminates or avoids much of the Von Neumann bottleneck, and offers opportunities for the exploitation of parallelism in ways not even conceivable in classical computing. Topics covered include a review of the Von Neumann bottleneck and imperative languages, the mathematical foundation of functional computing, namely lambda calculus, how this foundation provides opportunities for parallelism, and characteristics of the design space for implementation of these concepts in real computing hardware.
TL;DR: A programming methodology for multiprocessors that leads to well-structured code, ease of debugging, and, most important, portability among multipROcessors offering different synchronization primitives is described.
Abstract: We describe here a programming methodology for multiprocessors that leads to well-structured code, ease of debugging, and, most important, portability among multiprocessors offering quire different synchronization primitives. The emphasis in this paper is on the implementation of this methodology for the Lemur, an eight-processor machine built at Argonne National Laboratory. Included are several complete programs illustrating the methodology.
TL;DR: It seems to be possible to create a Standard-Processor STP, which unifies the many different operation modes in computing, and the resulting performance will be higher on an average over all modes than it could be achieved e.g. if one tries to transpose a typical APP-Problem onto a conventional GPP- Processor.
Abstract: The question is raised, whether flexibility of computer structures, which proved to be a fruitful concept in computer history, can be extended to an elegible utilization of different operation modes like General Purpose Processor (GPP), High Level Language Processor (HLL), Reduction Automation (RED), Data Flow Processor (FLO), Associative Parallel Processor (APP), Cellular Automation (CEL), and eg Digital Differential Analyser (DDA) It is argued that all these principles (each one having a certain merit) are not incompatible on principle Instead it seems to be possible to create a Standard-Processor STP, which unifies the many different operation modes These modes are made eligible by the programmer The resulting performance will not be the highest possible one with respect to one specific operation mode Nevertheless the performance will be higher on an average over all modes than it could be achieved eg if one tries to transpose a typical APP-Problem onto a conventional GPP-Processor (or to transpose in a reverse direction!) The STP is not designed in detail The paper is thought to be rather a stimulus to investigate a universal hardware set of registers, control, and logic circuits which admit quite different interpretation modes in computing
TL;DR: A simple taxonomy for the interconnected computer systems is presented by using the address space or buffer type as the key identifying element to distinguish the major difference between multicomputer and multiprocessor systems.
Abstract: This paper presents a simple taxonomy for the interconnected computer systems by using the address space or buffer type as the key identifying element. The main aim of this classification is to distinguish the major difference between multicomputer and multiprocessor systems and to derive the definitions for the same.
TL;DR: An overview of Japanese research and development efforts on the parallel processing architectures is given and some examples of research projects for each of the application domains such as artificial intelligence, numerical processing, and others like database, image, graphics, etc.
Abstract: This paper gives an overview of Japanese research and development efforts on the parallel processing architectures. Projects are categorized by their application domains. Following an introduction, general trends and some examples of research projects for each of the application domains such as artificial intelligence, numerical processing, and others like database, image, graphics, etc. are presented.
TL;DR: The parallel approaches to AI are divided into three broad categories, though the boundaries between them are often fuzzy: the general programming approach, applications of parallelism to the processing of specialized programming languages, and massively parallel active memory systems.
Abstract: Intelligence, whether in a machine or in a living creature, is a mixture of many abilities. Our current artificial intelligence (AI) technology does a good job of emulating some aspects of human intelligence, generally those things that, when they are done by people, seem to be serial and conscious. AI is very far from being able to match other human abilities, generally those things that seem to happen “in a flash” and without any feeling of sustained mental effort. We are left with an unbalanced technology that is powerful enough to be of real commercial value, but that is very far from exhibiting intelligence in any broad, human-like sense of the word. It is ironic that AI’s successes have come in emulating the specialized performance of human experts, and yet we cannot begin to approach the common sense of a five-year-old child or the sensory abilities and physical coordination of a rat.
TL;DR: This paper generalizes the traditional dataflow model of computation and defines the essential problems in multiprocessing: control implementation, program partitioning, scheduling, synchronization, and memory access.
Abstract: This paper generalizes the traditional dataflow model of computation and defines the essential problems in multiprocessing: control implementation, program partitioning, scheduling, synchronization, and memory access. The paper assumes that these essential problems are axes of a multiprocessor design space and that the solutions to these problems are values on the axes. Each point in the space represents a multiprocessor including a computational paradigm that a user must follow to achieve high performance and efficiency on the particular machine. Thus, a classification of machines from the user's point of view is introduced naturally. Five well-known multiprocessors are compared using this classification scheme.
TL;DR: The first method transforms products to sums and applies one of the known methods for rounding exact summation in time complexity O( n 2 ) with n processors ( n denoting the “length” of the expression).
Abstract: We propose two parallel algorithms for the rounding exact evaluation of sums of products. The first method transforms products to sums and applies one of the known methods for rounding exact summation in time complexity O( n 2 ) with n processors ( n denoting the “length” of the expression). The second method approximates the products as well as the sum and has average time complexity O( ld ( n )) for n /2 processors and has average time complexity O( n ) viewed as a sequential algorithm.
TL;DR: The algorithm is designed to be particularly suited for parallel computation, in which floating-point multiplication, floating- point addition and fixed-point arithmetic can be performed simultaneously.
Abstract: An algorithm is presented for finding x −1 2 , given x . The algorithm is designed to be particularly suited for parallel computation, in which floating-point multiplication, floating-point addition and fixed-point arithmetic can be performed simultaneously.
TL;DR: By choosing a particular representation, the grid file, and analyzing its behaviour, this work wants to point out the difficulties encountered in trying to achieve speed improvements from a multiprocessor.
Abstract: By using a multiprocessor to implement the lowest level of a relational database we want to achieve fast execution of database operations such as join, find, and update But the potential speed improvements provided by a multiprocessor can only be achieved if one can construct algorithms and corresponding physical data representations that can utilize the potential By choosing a particular representation, the grid file, and analyzing its behaviour, we want to point out the difficulties encountered in trying to achieve speed improvements from a multiprocessor
TL;DR: The problems of designing such MMPSs are discussed as well as some realisations of a data exchange module as a register module and some algorithms for parallel data exchange between the MPMs.
Abstract: In SIMD MIMD functionally reconfigurable multimicroprocessor systems /MMPS/ some of the microprocessor modules /MPM/ can execute a common program /SIMD mode/ while the rest of the MPMs are executing their own programs /MIMD mode/. Every MPM at any moment can be reconfigured functionally from one to another mode. In this paper the problems of designing such MMPSs are discussed as well as some realisations of a data exchange module as a register module and some algorithms for parallel data exchange between the MPMs. A hierarchically structed MMPS are developed.
TL;DR: This paper proposes a method to merge two communicating sequential processes (that would be adjacent in the pipeline) into one communicating sequential process by matching the output expressions of the first communicating Sequential Processes with the appropriate input expressions from the second.
Abstract: The segments of a pipelined process can be represented as communicating sequential processes. The communication between the segments of the pipeline are represented as channel communication between the communicating sequential processes. It is possible to merge two communicating sequential processes (that would be adjacent in the pipeline) into one communicating sequential process. This is done by matching the output expressions of the first communicating sequential process (e.g. chlexpr) with the appropriate input expressions of the second communicating sequential process (e.g. ch?var) and replacing each pair by a single assignment statement (var = expr).