Top 136 papers presented at Parallel Computing in 1988

Showing papers presented at "Parallel Computing in 1988"

Journal Article•10.1016/0167-8191(88)90098-1•

Evolution algorithms in combinatorial optimization

[...]

Heinz Mühlenbein, Martina Gorges-Schleuter, Ottmar Krämer

1 Apr 1988

TL;DR: A new genetic algorithm which relies on intelligent evolution of individuals is presented, which is inherently parallel and shows a superlinear speedup in multiprocessor systems.

...read moreread less

Abstract: Evolution algorithms for combinatorial optimization have been proposed in the 70's. They did not have a major influence. With the availability of parallel computers, these algorithms will become more important. In this paper we discuss the dynamics of three different classes of evolution algorithms: network algorithms derived from the replicator equation, Darwinian algorithms and genetic algorithms inheriting genetic information. We present a new genetic algorithm which relies on intelligent evolution of individuals. With this algorithm, we have computed the best solution of a famous travelling salesman problem. The algorithm is inherently parallel and shows a superlinear speedup in multiprocessor systems.

...read moreread less

435 citations

Journal Article•10.1016/0167-8191(88)90002-6•

SUPERB: A tool for semi-automatic MIMD/SIMD parallelization☆

[...]

Hans P. Zima¹, Heinz-J Bast¹, Michael Gerndt¹•Institutions (1)

University of Bonn¹

1 Jan 1988

TL;DR: The design of an interactive system for the semi-automatic transformation of FORTRAN 77 programs into parallel programs for the SUPERNUM machine is described, characterized by a powerful analysis component, a catalog of MIMD and SIMD parallelization transformations, and a flexible dialog facility.

...read moreread less

Abstract: This paper describes the design of an interactive system for the semi-automatic transformation of FORTRAN 77 programs into parallel programs for the SUPERNUM machine. The system is characterized by a powerful analysis component, a catalog of MIMD and SIMD parallelization transformations, and a flexible dialog facility. It contains specific knowledge about the parallelization of an important class of numerical algorithms.

...read moreread less

384 citations

Journal Article•10.1016/0167-8191(88)90094-4•

A single-program-multiple-data computational model for EPEX/FORTRAN

[...]

Frederica Darema¹, David A. George¹, Vern Alan Norton¹, Gregory Francis Pfister¹•Institutions (1)

IBM¹

1 Apr 1988

TL;DR: A single-program-multiple-data computational model which is implemented in the EPEX system to run in parallel mode FORTRAN scientific application programs and the applicability of the model in the parallelization of several applications is demonstrated.

...read moreread less

Abstract: We present a single-program-multiple-data computational model which we have implemented in the EPEX system to run in parallel mode FORTRAN scientific application programs. The computational model assumes a shared memory organization and is based on the scheme that all processes executing a program in parallel remain in existence for the entire execution; however, the tasks to be executed by each process are determined dynamically during execution by the use of appropriate synchronizing constructs that are imbedded in the program. We have demonstrated the applicability of the model in the parallelization of several applications. We discuss parallelization features of these applications and performance issues such as overhead, speedup, efficiency.

...read moreread less

210 citations

Journal Article•10.1016/0167-8191(88)90070-1•

Parallel Gaussian elimination on an MIMD computer

[...]

Michel Cosnard, Mounir Marrakchi, Yves Robert, Denis Trystram¹•Institutions (1)

École Centrale Paris¹

1 Mar 1988

TL;DR: It is shown that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models.

...read moreread less

Abstract: This paper introduces a graph-theoretic approach to analyse the performances of several parallel Gaussian-like triangularization algorithms on an MIMD computer. We show that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models. We derive new complexity results and compare the asymptotic performances of these parallel versions.

...read moreread less

118 citations

Journal Article•10.1016/0167-8191(88)90037-3•

An analysis of the computational and parallel complexity of the Livermore Loops

[...]

John Feo¹•Institutions (1)

Lawrence Livermore National Laboratory¹

1 Jun 1988

TL;DR: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops and addresses the concern that their computations must be understood thoroughly, so that efficient implementations may be written.

...read moreread less

Abstract: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops. The Loops represent the type of computational kernels typically found in large-scale scientific computing and have been used to benchmark computer system since the mid-60's. On parallel systems, a process's computational structure can greatly affect its efficiency. If the loops are to be used to benchmark such systems, their computations must be understood thoroughly, so that efficient implementations may be written. This paper addresses that concern.

...read moreread less

84 citations

Book Chapter•10.1007/3-540-51604-2_6•

Inversion = Migration + Tomography

[...]

Peter Mora

1 Jun 1988

TL;DR: It is proposed that an iterative inversion using a varying background velocity obtains all wavenumbers that are resolvable separately by migration and tomography, which is effectively the same as in medical imaging.

...read moreread less

Abstract: Seismic inversion, broadly enough defined, is equivalent to doing migration and reflection tomography simultaneously. Diffraction tomography and inversion work best when sources and receivers surround the region of interest, as in medical imaging applications. Theoretical studies typically show that the high vertical wavenumber velocity perturbations are resolved in seismic reflection experiments where the sources and receivers are restricted to the Earth's surface but low vertical wavenumbers must be obtained using a separate step such as a velocity analysis or reflection tomography. I propose that an iterative inversion using a varying background velocity obtains all wavenumbers that are resolvable separately by migration and tomography. (The background velocity must contain abrupt discontinuities.) Reflectors in the background model simulate sources and receivers within the Earth so the source and receiver coverage in seismic reflection inverse problems is effectively the same as in medical imaging. Some synthetic examples verify the theoretical predictions and show that reflector locations and interval velocities can be obtained simultaneously.

...read moreread less

81 citations

Journal Article•10.1016/0167-8191(88)90080-4•

State-of-the-art in parallel nonlinear optimization

[...]

Freerk A. Lootsma¹, K. M. Ragsdell²•Institutions (2)

Delft University of Technology¹, University of Missouri²

1 Feb 1988

TL;DR: This survey focuses on promising approaches for solving large, well-structured constrained problems: dualization of problems with separable objective and constraint functions, and decomposition of hierarchical problems with linking variables.

...read moreread less

Abstract: This survey is concerned with variants of nonlinear optimization methods designed for implementation on parallel computers. First, we consider a variety of methods for unconstrained minimization. We consider a particular type of parallelism (simultaneous function and gradient evaluations), and we concentrate on the main sources of inspiration: conjugate directions, homogeneous functions, variable-metric updates, and multi-dimensional searches. The computational process for solving small and medium-size constrained optimization problems is usually based on unconstrained optimization. This provides a straightforward opportunity for the introduction of parallelism. In the present survey, however, we focus on promising approaches for solving large, well-structured constrained problems: dualization of problems with separable objective and constraint functions, and decomposition of hierarchical problems with linking variables (typical for Bender's decomposition in the linear case). Finally, we outline the key issues in future computational studies of parallel nonlinear optimization algorithms.

...read moreread less

76 citations

Journal Article•10.1016/0167-8191(88)90021-X•

1988 International conference on supercomputing

[...]

W.E Nagel

1 Dec 1988

65 citations

Journal Article•10.1016/0167-8191(88)90009-9•

Parallel solution of triangular systems of equations

[...]

Charles H Romine¹, James M. Ortega¹•Institutions (1)

University of Virginia¹

1 Jan 1988

TL;DR: It is shown the classical inner product algorithms can be nearly as efficient as the usual column sweep algorithm in solving triangular systems with column storage.

...read moreread less

Abstract: The solution on parallel computers of systems of equations Lx = b, where L is lower triangular, is considered. Some authors have suggested that it is difficult to solve such systems in parallel on message-passing, local-memory machines when L is stored by columns. It is shown here that this is not necessarily the case if the machine can accomplish fan-in communication with reasonable efficiency.

...read moreread less

56 citations

Journal Article•10.1016/0167-8191(88)90087-7•

Gray codes, Fast Fourier Transforms and hypercubes☆

[...]

R.M. Chamberlain¹•Institutions (1)

Christian Michelsen Institute¹

1 Feb 1988

TL;DR: A number of results on Gray codes which characterise a certain family of Gray codes are presented, and it is shown that Fast Fourier Transforms on data distributed among processors according to a Gray code can also be efficiently implemented on the hypercube.

...read moreread less

Abstract: Fast Fourier Transforms are a widely-used and powerful tool for the analysis and solution of many problems. They have been used in such diverse areas as medicine, acoustics, image processing, system design and many other fields. By transforming the data the problem may be simpler, more tractable or more efficiently solved and for many applications (e.g. speech processing) the data may be much more easily understood in the transform domain. Therefore fast algorithms for implementing transforms are vital for any powerful computer. This paper describes the implementation of a Fast Fourier Transform on a 64-node INTEL hypercube and shows how the hypercube architecture may be efficiently used. Usually the FFT is only a part of the solution process and the data on the hypercube has to be arranged in a certain manner for the efficient solution of the whole problem. A common way is for the data to be distributed according to a Gray code, so that neighbouring points in the domain are in neighbouring processors. We present a number of results on Gray codes which characterise a certain family of Gray codes, and show that Fast Fourier Transforms on data distributed among processors according to a Gray code can also be efficiently implemented on the hypercube.

...read moreread less

49 citations

Journal Article•10.1016/0167-8191(88)90048-8•

SUPRENUM: A trendsetter in modern supercomputer development

[...]

Wolfgang Giloi¹•Institutions (1)

Technical University of Berlin¹

1 Sep 1988

TL;DR: An outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type, a message-based system designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead, is presented.

...read moreread less

Abstract: The designer of a numerical supercomputer is confronted with fundamental design decisions stemming from some basic dichotomies in supercomputer technology and architecture. On the side of the hardware technology there exists the dichotomy between the use of very high-speed circuitry or very large-scale integrated circuitry. On the side of the architecture there exists the dichotomy between the SIMD vector machine and the MIMD multiprocessor architecture. In the latter case, the ‘nodes’ of the system may communicate through shared memory, or each node has only private memory, and communication takes place through the exchange of messages. All these design decisions have implications with respect to performance, cost-effectiveness, software complexity, and fault-tolerance. In the paper the various dichotomies are discussed and a rationale is provided for the decision to realize the SUPRENUM supercomputer, a large ‘number cruncher’ with 5 Gflops peak performance, in the form of a massively parallel MIMD/SIMD multicomputer architecture. In its present incorporation, SUPRENUM is configurable to up to 256 nodes, where each node is a pipeline vector machine with 20 Mflops peak performance, IEEE double precision. The crucial issues of such an architecture, which we consider the trendsetter for future numerical supercomputer architecture in general, are on the hardware side the need for a bottleneck-free interconnection structure as well as the highest possible node performance obtained with the highest possible packaging density, in order to accommodate a node on a single circuit board. On the side of the system software the design goal is to obtain an adequately high degree of operational safety and data security with minimum software overhead. On the side of the user an appropriate program development environment must be provided. Last but not least, the system must exhibit a high degree of fault tolerance, if for nothing else but for the sake of obtaining a sufficiently high MTBF. In the paper a detailed discussion of the hardware and software architecture of the SUPRENUM supercomputer, whose design is based upon the considerations discussed, is presented. A largely bottleneck-free interconnection structure is accomplished in a hierarchical manner: the machine consists of up to 16 ‘clusters’, and each cluster consists of 16 working ‘nodes’ plus some organisational nodes. The node is accommodated on a single circuit board; its architecture is based on the principle of data structure architecture explained in the paper. SUPRENUM is strictly a message-based system; consequently, the local node operating system has been designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead. SUPRENUM is organized as a distributed system—a prerequisite for the high degree of fault tolerance required; therefore, there exists no centralized global operating system. The paper concludes with an outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type.

...read moreread less

Journal Article•10.1016/0167-8191(88)90042-7•

An optimal parallel algorithm for solving the maximal elements problem in the plane

[...]

Ivan Stojmenovic¹, Masahiro Miyakawa•Institutions (1)

University of Novi Sad¹

1 Jun 1988

TL;DR: An O(log( n )) time with O( n ) processors optimal algorithm for finding the maximal elements of a set and the model of parallel computation is the CREW-PRAM.

...read moreread less

Abstract: We describe an O(log( n )) time with O( n ) processors optimal algorithm for finding the maximal elements of a set. The model of parallel computation we consider is the CREW-PRAM, i.e. it is the synchronous shared memory model where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location (even if they are trying to write the same thing).

...read moreread less

Journal Article•10.1016/0167-8191(88)90095-6•

The instruction systolic array and its relation to other models of parallel computers

[...]

Manfred Kunde¹, Hans-Werner Lang¹, Manfred Schimmler¹, Hartmut Schmeck¹, Heiko Schröder¹ - Show less +1 more•Institutions (1)

University of Kiel¹

1 Apr 1988

TL;DR: The results show that the ISA concept combines the advantages of standard systolic arrays with those of the MIMD concept, and in addition theISA architecture has smaller area requirements than a corresponding syStolic array or MIMM machine it is strong practical relevance.

...read moreread less

Abstract: In this paper we investigate the relationships between three different models of parallel computers based on mesh-connected arrays: the processor array (PA), which is an MIMD-array of independent processors, the instruction broadcasting array (IBA), where the instructions are broadcast to all the processors of a column and executed according to selector information which is broadcast to all the processors of a row, and the instruction systolic array (ISA), where the instructions are pumped through the array row by row and combined with selector information which is pumped through the array column by column. For every two of these models we determine tight bounds on the worst-case delay introduced by a transformation of a program on one model into an equivalent program on the other. The results show that the ISA concept combines the advantages of standard systolic arrays with those of the MIMD concept. Since in addition the ISA architecture has smaller area requirements than a corresponding systolic array or MIMD machine it is strong practical relevance.

...read moreread less

Journal Article•10.1016/0167-8191(88)90078-6•

A two-layered mesh array for matrix multiplication

[...]

Subhash Kak¹•Institutions (1)

Louisiana State University¹

1 Mar 1988

TL;DR: A two-layered mesh array for matrix multiplication that computers the matrix product faster than the standard array is presented.

...read moreread less

Abstract: A two-layered mesh array for matrix multiplication is presented. It computers the matrix product faster than the standard array.

...read moreread less

Journal Article•10.1016/0167-8191(88)90044-0•

Third conference on hypercube concurrent computers and applications

[...]

Richard Chamberlain

1 Jun 1988

Journal Article•10.1016/0167-8191(88)90099-3•

Data transport in Wang's partition method

[...]

Peter Michielse¹, Henk A. van der Vorst¹•Institutions (1)

Delft University of Technology¹

1 Apr 1988

TL;DR: This work proposes a modification of the partition method of Wang which reduces the amount of data transport considerably, without affecting the computational complexity and which has about the same degree of parallelism as the original version.

...read moreread less

Abstract: The partition method of Wang, for the solution of tridiagonal linear systems, is analysed with regard to data transport between the processors of a parallel (local memory) computer. We propose a modification which reduces the amount of data transport considerably, without affecting the computational complexity and which has about the same degree of parallelism as the original version. We will also discuss the effects of this modification to a generalized version for banded systems. The parallel solution of a bidiagonal system is considered.

...read moreread less

Journal Article•10.1016/0167-8191(88)90075-0•

Parallel sorting algorithms for tightly coupled multiprocessors

[...]

Michael J. Quinn¹•Institutions (1)

University of New Hampshire¹

1 Mar 1988

TL;DR: The authors' implementation of quickmerge achieves significantly higher speedup than occur implementation of parallel quicksort, and is suitable for implementation on tightly coupled multiprocessors and compare their performance on the Denelcor HEP.

...read moreread less

Abstract: We present three parallel sorting algorithms suitable for implementation on tightly coupled multiprocessors and compare their performance on the Denelcor HEP. Two of the algorithms implemented—parallel Shellsort and quickmerge—are new. Shellsort is amenable to parallelization; however, since Shellsort has higher complexity than quicksort, parallel Shellsort is inferior to parallel quicksort. A second new parallel algorithm, called quickmerge , is based upon both quicksort and mergesort. Our implementation of quickmerge achieves significantly higher speedup than occur implementation of parallel quicksort.

...read moreread less

Book Chapter•10.1007/3-540-51604-2_5•

Parallel LU Decomposition on a Transputer Network

[...]

Rob H. Bisseling¹, Johannes G. G. van de Vorst¹•Institutions (1)

Royal Dutch Shell¹

1 Jun 1988

TL;DR: A general Cartesian data distribution scheme is presented which contains many of the existing distribution schemes as special cases and is used to prove optimality of load balance for the grid distribution.

...read moreread less

Abstract: A parallel algorithm is derived for LU decomposition with partial pivoting on a local-memory multiprocessor. A general Cartesian data distribution scheme is presented which contains many of the existing distribution schemes as special cases. This scheme is used to prove optimality of load balance for the grid distribution. Experimental results of an implementation of the algorithm in occam-2 on a square mesh of 36 transputers show an efficiency of 88% and a speed of 21.5 Mflop/s for a matrix of size n=1000.

...read moreread less

Journal Article•10.1016/0167-8191(88)90066-X•

Parallel multigrid solution of the Navier-Stokes equations on general 2D domains

[...]

Johannes Linden, Barbara Steckel, Klaus Stüben

1 Sep 1988

TL;DR: A parallel multigrid solver for steady-state incompressible Navier-Stokes equations on general domains which is currently being developed at the GMD is described.

...read moreread less

Abstract: Multigrid methods are distinguished by their optimal (sequential) efficiency and by the fact that all their algorithmical components are fully parallelizable. For this reason, this class of numerical methods is especially attractive for use on parallel (MIMD, local memory) computers. In this paper, we describe a parallel multigrid solver for steady-state incompressible Navier-Stokes equations on general domains which is currently being developed at the GMD. Due to the geometrical generality of the problem, our approach is based on a non-staggered (nodal-point) finite volume scheme on multi-block boundary fitted grids. The typical instability of non-staggered schemes is overcome by suitably modifying the discrete continuity equation without affecting the overall order of consistency. Starting from the most simple Cartesian case, we discuss several possible multigrid approaches to the general 2D-problem. This motivates the basic design decisions of our multigrid solver in regard to both the discretization and the choice of multigrid components (smoothing schemes). Furthermore, the principal technique of parallelization (grid partitioning) is described as well as some fundamental aspects of the implementation (communication library).

...read moreread less

Journal Article•10.1016/0167-8191(88)90004-X•

A distributed algorithm for convex network optimization problems

[...]

Stavros A. Zenios¹, John M. Mulvey²•Institutions (2)

University of Pennsylvania¹, Princeton University²

1 Jan 1988

TL;DR: Lower bounds for the expected efficiency of SRM are developed and compared with its performance as obtained through computional experiments, as well as in a simulated distributed environment on a sequential machine.

...read moreread less

Abstract: Gauss-Seidel type relaxation techniques are applied in the context of strictly convex network optimization problems. The algorithm lends itself for processing in a massively distributed environment. A synchronous relaxation method (SRM) is proposed, based on the k-coloring properties of the network graph. The method is tested in a simulated distributed environment on a sequential machine. Lower bounds for the expected efficiency of SRM are developed and compared with its performance as obtained through computional experiments.

...read moreread less

Journal Article•10.1016/0167-8191(88)90082-8•

Parallel solution of linear systems with striped sparse matrices

[...]

Rami Melhem¹•Institutions (1)

University of Pittsburgh¹

1 Feb 1988

TL;DR: In this paper, the non-zero elements in a sparse matrix are organized in the form of non-overlapping stripes, and only the elements within the stripe structure of the matrix are manipulated.

...read moreread less

Abstract: The multiplication of a vector by a matrix and the solution of triangular linear systems are the most demanding operations in the majority of iterative techniques for the solution of linear systems. Data-driven VLSI networks which perform these two operations, efficiently, for certain sparse matrices are introduced. In order to avoid computations that involve zero operands, the non-zero elements in a sparse matrix are organized in the form of non-overlapping stripes, and only the elements within the stripe structure of the matrix are manipulated. Detailed analysis of the networks proves that both operations may be completed in n global cycles with minimal communication overhead, where n is the order of the linear system. The number of cells in each network as well as the communication overhead, are determined by the stripe structure of the matrix. Different stripe structures for the class of sparse matrices generated in Finite Element Analysis are examined in a separate paper.

...read moreread less

Journal Article•10.1016/0167-8191(88)90003-8•

Towards developing robust algorithms for solving partial differential equations on MIMD machines

[...]

Joel H. Saltz¹, Vijay K Naik¹•Institutions (1)

Yale University¹

1 Jan 1988

TL;DR: The methods suggested here increase the degree to which work can be performed while data are communicated between processors, allowing efficient overlap of computation with communication.

...read moreread less

Abstract: Methods for efficient computation of numerical algorithms on a wide variety of MIMD machines are proposed. These techniques reorganize the data dependency patterns to improve the processor utilization. The model problem finds the time-accurate solution to a parabolic partial differential equation discretized in space and implicitly marched forward in time. The algorithms are extensions of Jacobi and SOR. The extensions consist of iterating over a window of several timesteps, allowing efficient overlap of computation with communication. The methods increase the degree to which work can be performed while data are communicated between processors. The effect of the window size and of domain partitioning on the system performance is examined both by implementing the algorithm on a simulated multiprocessor system.

...read moreread less

Journal Article•10.1016/0167-8191(88)90016-6•

Tools to aid in the analysis of memory access patterns for FORTRAN programs

[...]

Orlie Brewer¹, Jack Dongarra¹, Danny C. Sorensen¹•Institutions (1)

Argonne National Laboratory¹

1 Dec 1988

TL;DR: In this article, a set of tools that can be used as an aid in the analysis of memory access patterns of FORTRAN programs are described, which can be found in Table 1.

...read moreread less

Abstract: This paper describes a set of tools that can be used as an aid in the analysis of memory access patterns of FORTRAN programs.

...read moreread less

Journal Article•10.1016/0167-8191(88)90055-5•

Heterogeneity in supercomputer architectures

[...]

Milos D. Ercegovac¹•Institutions (1)

University of California, Los Angeles¹

1 Sep 1988

TL;DR: Some approaches to heterogeneous architectures are discussed, hardware and software issues are identified, and several built or proposed systems are analyzed.

...read moreread less

Abstract: Various organization styles have been used in the architecture of supercomputers in order to achieve cost-effective performance and programmability. Traditionally, a particular organization style (e.g., vector pipeline processor, array processor, or multiprocessor) has been selected to satisfy the performance requirements of a class of applications, achieving usually a much lower performance in other applications. In addition, the mapping of ‘foreign’ algorithms to a single-style architecture may create great programming difficulties. Since each architecture style provides attractive cost-performance and programming features, the question of heterogeneity (i.e., combining of several architecture/design styles in a single system) deserves attention. In this paper we discuss some approaches to heterogeneous architectures, identify hardware and software issues, and analyze several built or proposed systems.

...read moreread less

Book Chapter•10.1007/978-3-642-83248-2_5•

(r ∞ , n 1/2 s 1/2 ) Measurements on the 2-CPU Cray X-MP

[...]

Roger W. Hockney¹•Institutions (1)

University of Reading¹

1 Jan 1988

TL;DR: It is found that for dyadic operations using the TSKSTART and TSKWAIT synchronization primitives, that R∞ = 130 Mflop/s and s1/2 = 5700 flop is probably the minimum possible value for synchronization overhead on the Cray X-MP.

...read moreread less

Abstract: We report performance measurements made on the 2-CPU Cray X-MP at ECMWF, Reading. Vector (SIMD) performance on one CPU is interpreted by the two parameters (r∞, n1/2), and we find for dyadic operations using FORTRAN r∞ = 70 Mflop/s, n1/2 = 53 flop. All vector triadic operations produce r∞ = 107 Mflop/s, n1/2 = 45 flop; and a triadic operation with two vectors and one scalar gives r∞ = 148 Mflop/s and n1/2 = 60 flop. MIMD performance using both CPUs on one job is interpreted with the two parameters (r∞, s1/2),1/2where s1/2 is the amount of arithmetic that could have been done during the time taken to synchronize the two CPUs. We find, for dyadic operations using the TSKSTART and TSKWAIT synchronization primitives, that R∞ = 130 Mflop/s and s1/2 = 5700 flop. This means that a job must contain more than ∼6000 floating-point operations if it is to run more than 50% of the maximum performance when split between both CPUs by this method. Less expensive synchronization methods using LOCKS and EVENTS reduce s1/2 to 4000 flop and 2000 flop respectively. A simplified form of LOCK synchronization written in CAL code further reduces s1/2 to 220 flop. This is probably the minimum possible value for synchronization overhead on the Cray X-MP.

...read moreread less

Journal Article•10.1016/0167-8191(88)90053-1•

Grid applications on distributed memory architectures: Implementation and evaluation

[...]

Karl Solchenbach

1 Sep 1988

TL;DR: It is demonstrated that grid applications can be implemented quite easily on dm-mp systems if a hardware-independent process system exists and convenient tools (such as the SUPRENUM mapping and communications library) are available.

...read moreread less

Abstract: It was shown in the paper of Solchenbach and Trottenberg (in this special issue) that grid algorithms are inherently parallel and that parallel grid algorithms for regular grids can be efficiently implemented on dm-mp systems using the concept of grid partitioning. In this paper, we demonstrate that grid applications can be implemented quite easily on dm-mp systems if a hardware-independent process system exists and convenient tools (such as the SUPRENUM mapping and communications library) are available. The evaluation of parallel grid algorithms shows that the multiprocessor speedup and efficiency for single grid applications depends on the communication/calculation performance ratio of the hardware, on the communication/calculation ratio of the algorithms, and on the process size. The efficiency of parallel multigrid algorithms additionally depends on the number of nodes.

...read moreread less

Journal Article•10.1016/0167-8191(88)90086-5•

A segmented FFT algorithm for vector computers

[...]

Mike Ashworth¹, Andrew Lyne•Institutions (1)

University of Manchester¹

1 Feb 1988

TL;DR: A new algorithm designed for large, single transforms is presented, which employs a pair of multiple transforms to perform the single transform.

...read moreread less

Abstract: The Fast Fourier Transform algorithm does not readily lend itself to efficient implementation on vector computers, especially on machines where sequential access is important. Several authors have commented that the efficiency of computation is much improved if many transforms are performed simultaneously. We present a new algorithm designed for large, single transforms, which employs a pair of multiple transforms to perform the single transform. The merits of the algorithm are discussed with reference to its implementation on a CDC CYBER 205.

...read moreread less

Journal Article•10.1016/0167-8191(88)90129-9•

A comparative study of libraries for parallel processing

[...]

David F. Snelling¹, Geerd-R. Hoffmann¹•Institutions (1)

European Centre for Medium-Range Weather Forecasts¹

1 Oct 1988

TL;DR: A proposal for an objective measure of a library's complexity is put forward along with a collection of subjective issues which should be considered in reference to parallel libraries.

...read moreread less

Abstract: Several libraries for parallel processing on supercomputers are analyzed in terms of their parallel processing facilities, complexity, use, and how they reflect the hardware for which they were designed. A proposal for an objective measure of a library's complexity is put forward along with a collection of subjective issues which should be considered in reference to parallel libraries. The libraries discussed include those provided by ETA, CRAY, IBM, and FPS, as well as a portable parallel library developed by one of the authors. A brief discussion of how these libraries address the basic concepts of parallel processing is provided.

...read moreread less

Journal Article•10.1016/0167-8191(88)90051-8•

PEACE: The distributed SUPRENUM operating system

[...]

Wolfgang Schröder¹•Institutions (1)

Technical University of Berlin¹

1 Sep 1988

TL;DR: The fundamental concepts and structure of the distributed operating system, PEACE, for SUPRENUM, are described and an optimal and application-oriented mapping of the entire operating system onto the distributed SUPRenUM architecture is made feasible.

...read moreread less

Abstract: This paper describes the fundamental concepts and structure of the distributed operating system, PEACE, for SUPRENUM. A large scale of distribution is achieved because of consequently encapsulating typical operating system services by processes. By this way an optimal and application-oriented mapping of the entire operating system onto the distributed SUPRENUM architecture is made feasible.

...read moreread less

Journal Article•10.1016/0167-8191(88)90017-8•

Modeling the performance of hypercubes: A case study using the particle-in-cell application

[...]

Olaf M. Lubeck¹, Vance Faber¹•Institutions (1)

Los Alamos National Laboratory¹

1 Dec 1988

TL;DR: This application provides a simple example of the problems associated with load balancing on distributed memory architectures and introduces the use of provably optimal global communication algorithms that are needed for the PIC implementation on the hypercube.

...read moreread less

Abstract: We have mapped onto the iPSC hypercube a particle-in-cell (PIC) algorithm that executes a plasma simulation. PIC simulates the movement of charged particles under the influence of an electrostatic field. This application provides a simple example of the problems associated with load balancing on distributed memory architectures. We present several alternative solutions to mappings of the algorithm onto the hypercube. One solution's performance is modeled and benchmarked with data from an implementation on the iPSC. The model is used to predict performance for larger size problems and a state-of-the-art hypercube architecture. We also introduce the use of provably optimal global communication algorithms that are needed for the PIC implementation on the hypercube.

...read moreread less

...

Expand