Top 30 papers presented at Parallel Computing in 1986

Showing papers presented at "Parallel Computing in 1986"

Journal Article•10.1016/0167-8191(86)90019-0•

Parallel implementation of multifrontal schemes

[...]

20 Jul 1986

TL;DR: This work considers the direct solution of large sparse sets of linear equations in an MIMD environment using a multifrontal approach and shows how to distribute tasks among processors according to an elimination free that can be automatically generated from any pivot strategy.

...read moreread less

Abstract: We consider the direct solution of large sparse sets of linear equations in an MIMD environment. We base our algorithm on a multifrontal approach and show how to distribute tasks among processors according to an elimination free that can be automatically generated from any pivot strategy. We study organizational aspects of the scheme for shared memory multiprocessor configurations and consider implications for multiprocessors with local memories and a communication network.

...read moreread less

179 citations

Journal Article•10.1016/0167-8191(86)90030-X•

Effects of synchronization barriers on multiprocessor performance

[...]

T. S. Axelrod¹•Institutions (1)

Lawrence Livermore National Laboratory¹

1 May 1986

TL;DR: It is found that the performance of a recently proposed ‘butterfly’ barrier is significantly higher than the more traditional ‘two lock” barrier.

...read moreread less

Abstract: Synchronization barriers are frequently required by numerical algorithms for multiprocessors. When the number of processors becomes large these barriers may cause significant performance degradation. This paper examines the performance of two alternative types of synchronization barriers using simulation tools. It is found that the performance of a recently proposed ‘butterfly’ barrier is significantly higher than the more traditional ‘two lock’ barrier.

...read moreread less

78 citations

Journal Article•10.1016/0167-8191(86)90006-2•

The performance of FORTRAN implementations for preconditioned conjugate gradients on vector computers

[...]

H.A. van der Vorst¹•Institutions (1)

Delft University of Technology¹

1 Mar 1986

TL;DR: This work considers in detail the performance of FORTRAN implementations for the conjugate gradient algorithm on a number of well-known vector computers: CRAY-1, CRAY X-MP and CYBER 205.

...read moreread less

Abstract: We will consider in detail the performance of FORTRAN implementations for the conjugate gradient algorithm, for the solution of large linear systems, on a number of well-known vector computers: CRAY-1, CRAY X-MP and CYBER 205. Lower bounds on the CPU-times, required for separate parts of the algorithm, are presented and these are compared to the actually observed CPU-times. It appears that these lower bounds are reasonably sharp.

...read moreread less

65 citations

Journal Article•10.1016/0167-8191(86)90004-9•

Implementation of some concurrent algorithms for matrix factorization

[...]

Jack Dongarra¹, Ahmed H. Sameh², Danny C. Sorensen¹•Institutions (2)

Argonne National Laboratory¹, University of Illinois at Urbana–Champaign²

1 Mar 1986

TL;DR: Three parallel algorithms for computing the QR-factorization of a matrix are presented and computational results indicate that the Pipelined Givens method is preferred and that this is primarily due to the number of array references required by the various algorithms.

...read moreread less

Abstract: Three parallel algorithms for computing the QR-factorization of a matrix are presented. The discussion is primarily concerned with implementation of these algorithms on a computer that supports tightly coupled parallel processes sharing a large common memory. The three algorithms are a Householder method based upon high-level modules, a Windowed Householder method that avoids fork-join synchronization, and a Pipelined Givens method that is a variant of the data-flow type algorithms offering large enough granularity to mask synchronization costs. Numerical experiments were conducted on the Denelcor HEP computer. The computational results indicate that the Pipelined Givens method is preferred and that this is primarily due to the number of array references required by the various algorithms.

...read moreread less

57 citations

Journal Article•10.1016/0167-8191(86)90028-1•

Structuring parallel algorithms in an MIMD, shared memory environment

[...]

Harry F. Jordan¹•Institutions (1)

University of Colorado Boulder¹

1 May 1986

TL;DR: The automatic generation of programs with global parallelism seems to be a promising possibility for algorithms in which parallelism is introduced at the top of the program structure hierarchy, i.e. MIMD computational model.

...read moreread less

Abstract: This paper discusses the class of algorithms having global parallelism, i.e. those in which parallelism is introduced at the top of the program structure hierarchy. Such algorithms have performance advantages in a shared-memory. MIMD computational model. A programming environment consisting of FORTRAN, enhanced by some pre-processed macros, has been built to aid in writing programs for such algorithms for the Denelcor HEP multiprocessor. Applications of from tens to hundreds of FORTRAN statements have been written and tested in this environment. A few parallelism constructs suffice to yield understandable programs with a high degree of parallelism. The automatic generation of programs with global parallelism seems to be a promising possibility.

...read moreread less

51 citations

Journal Article•10.1016/0167-8191(86)90005-0•

Parallelizing conjugate gradient for the CRAY X-MP

[...]

Mark Seager¹•Institutions (1)

Lawrence Livermore National Laboratory¹

1 Mar 1986

TL;DR: A standard preconditioned conjugate gradient algorithm for the solution of symmetric linear systems is studied in the context of multiprocessing based on a computational model of an MIMD machine with shared global memory and efficiencies near one are observed.

...read moreread less

Abstract: A standard preconditioned conjugate gradient algorithm for the solution of symmetric linear systems is studied in the context of multiprocessing. A totally parallel approach is taken based on a computational model of an MIMD machine with shared global memory. In order to assure mathematical correctness of the algorithm, four barrier syncs are required during each iteration. Large linear systems are solved with this parallel adaptation of conjugated gradient and efficiencies near one observed on the CRAY X-MP24 running COS 1.13. Further, segments of the code which could be considered independent (i.e. between syncs) clocked in at speedups very close to the number of tasks. The latter indicates that the loss efficiency in this implementation of the algorithm on the X-MP24 is connected with the cost of barrier syncs. On the CRAY X-MP48 running COS 1.15 similar results are observed utilizing two CPU's. When all four CPU's execute parallel tasks in vector mode memory bank conflicts cause a 30% loss of overall speedup.

...read moreread less

45 citations

Journal Article•10.1016/0167-8191(86)90024-4•

Superlinear speedup of an efficient sequential algorithm is not possible

[...]

Vance Faber, Olaf M. Lubeck, Andrew B. White

20 Jul 1986

TL;DR: Disclosed herein is a verbena plant which has a broad spreading growth habit and long stems.

...read moreread less

45 citations

Journal Article•10.1016/0167-8191(86)90033-5•

Two and three dimensional FFTS on highly parallel computers

[...]

Andy Brass¹, G.S Pawley¹•Institutions (1)

University of Edinburgh¹

1 May 1986

TL;DR: An algorithm is described for computing two and three dimensional Fourier transforms on computers of SIMD architecture and the use of the 2-d results in conjunction with base ‘r1 + r2’ FFT algorithms to calculate 3-d Fourier transform on a set of N3 complex data points.

...read moreread less

Abstract: An algorithm is described for computing two and three dimensional Fourier transforms on computers of SIMD architecture. The algorithm assumes the existence of a library routine for the calculation of a 2-d Fourier transform on a set of Np2 data points where Np2 is the number of processing elements. The paper discusses how to use this routine to calculate 2-d Fourier transforms on a set of N2 data points where NpN is a power of two, using an interleaving technique. The paper also discusses the use of the 2-d results in conjunction with base ‘r1 + r2’ FFT algorithms to calculate 3-d Fourier transforms on a set of N3 complex data points. In the final section a general program is described to calculate 3-d Fourier transforms for any values of N and Np such that NpN is a power of two. Timings are given for the algorithms run on an ICL Distributed Array Processor.

...read moreread less

22 citations

Journal Article•10.1016/0167-8191(86)90015-3•

Modelling, measurement, and simulation of memory interference in the CRAY X-MP

[...]

W Oed, O Lange¹•Institutions (1)

RWTH Aachen University¹

1 Oct 1986

TL;DR: Some analytical results regarding the access in vector mode to an interleaved memory system and the number and type of memory conflicts that were encountered are presented.

...read moreread less

Abstract: Memory interleaving and multiple access ports are the key to a high memory bandwidth in vector processing systems. Each of the active ports supports an independent access stream to memory among which access conflicts may arise. Such conflicts lead to a decrease in memory bandwidth and consequently to longer execution times. We present some analytical results regarding the access in vector mode to an interleaved memory system. In order to demonstrate the practical effects of our analytical results we have done time measurements of some simple vector loops on a 2-CPU, 16-bank CRAY X-MP. By corresponding simulations we obtained the number and type of memory conflicts that were encountered.

...read moreread less

21 citations

Journal Article•10.1016/0167-8191(86)90008-6•

A note on the vectorization of scalar recursions (Short Communication)

[...]

O Axelsson¹, V Eijkhout¹•Institutions (1)

The Catholic University of America¹

1 Mar 1986

TL;DR: A comparison is made of the performance of three methods for scalar recursions (with Horner's scheme as a special case) on vector computers.

...read moreread less

Abstract: A comparison is made of the performance of three methods for scalar recursions (with Horner's scheme as a special case) on vector computers.

...read moreread less

13 citations

Journal Article•10.1016/0167-8191(89)90015-X•

Message Length Effects for Solving Polynomial Systems on a Hypercube

[...]

Wolfgang Pelz¹, Layne T. Watson²•Institutions (2)

University of Akron¹, University of Michigan²

1 Jan 1986

TL;DR: The solution of polynomial systems of equations via a globally convergent homotopy algorithm on a hypercube and some timing results for different situations are considered.

...read moreread less

Abstract: Comparisons between problems solved on uniprocessor systems and those solved on distributed computing systems generally ignore the overhead associated with information transfer from one process to another. This paper considers the solution of polynomial systems of equations via a globally convergent homotopy algorithm on a hypercube and some timing results for different situations.

...read moreread less

Journal Article•10.1016/0167-8191(86)90012-8•

Romberg integration using systolic arrays

[...]

David J. Evans¹, Graham M. Megson¹•Institutions (1)

Loughborough University¹

1 Oct 1986

TL;DR: A systolic array is presented to improve numerical approximations to integrals using Richardson's extrapolation procedure in the form of Romberg integration, which shows a significant improvement on the O(n2) steps required to construct the extrapolation table sequentially.

...read moreread less

Abstract: A systolic array is presented to improve numerical approximations to integrals using Richardson's extrapolation procedure in the form of Romberg integration. Two designs are presented, the first is an intuitive linear systolic array, the second, a systolic ring using approximately 1/3 of the cells of the first array. Both systolic arrays have a computation time of 3n cycles, which is a significant improvement on the O(n2) steps required to construct the extrapolation table sequentially.

...read moreread less

Journal Article•10.1016/0167-8191(86)90032-3•

Solving the generalized eigenvalue problem on a synchronous linear processor array

[...]

Daniel Boley¹•Institutions (1)

University of Minnesota¹

1 May 1986

TL;DR: A parallel method to solve the generalized eigenvalue problem on a linear array of processors, each connected to their nearest neighbors and operating synchronously, based on the well-known QZ algorithm of Moler and Stewart, which simultaneously reduces two n × n matrices to upper triangular form by orthogonal or unitary transformations.

...read moreread less

Abstract: We present a parallel method to solve the generalized eigenvalue problem on a linear array of processors, each connected to their nearest neighbors and operating synchronously. We also include a wrap-around connection from end to end. Our method is based on the well-known QZ algorithm of Moler and Stewart, which simultaneously reduces two n × n matrices to upper triangular form by orthogonal or unitary transformations. We show how this algorithm may be partitioned and distributed of n + 1 processors, achieving a speed-up over the serial algorithm of O(n). We use the concept of windows to describe the action of each processor at each step. We show how to incorporate singles shifts, and how to apply orthogonal plane rotations on either side of a matrix without the need to transpose the matrix itself.

...read moreread less

Journal Article•10.1016/0167-8191(86)90003-7•

Systolic sorting in a sequential input/output environment

[...]

Selim G. Akl¹, Hartmut Schmeck¹•Institutions (1)

Queen's University¹

1 Mar 1986

TL;DR: The 2-way sorter qualifies to be a perfect systolic architecture: It is built from simple cells having a constant number of inputs and outputs and constant area and time.

...read moreread less

Abstract: A new parallel sorting architecture called the 2-way sorter is presented which is especially well-suited for use in an environment with sequential input and output. A 2-way sorter having an area of n(k + 1)a 2 can sort m sequences of n k -bit numbers in time ((⌈ m 2 ⌉ + 1)n + k)t , where a and t are the area and the time of its bit-level building block, the 2-way cell. Using the same hardware mn k -bit numbers can be sorted in time O( mn log 2 m ) without needing more memory than for storing the mn numbers. The 2-way sorter qualifies to be a perfect systolic architecture: It is built from simple cells having a constant number of inputs and outputs and constant area and time. Except for a one-bit control information all communication is local. All its cells are active at the same time. In a sequential input/output environment the 2-way sorter has optimal area, period and time.

...read moreread less

Journal Article•10.1016/0167-8191(86)90021-9•

The mapping of 2-D array processors to 1-D array processors

[...]

Chang-Biau Yang¹, Richard C. T. Lee¹, Richard C. T. Lee²•Institutions (2)

National Tsing Hua University¹, Academia Sinica²

20 Jul 1986

TL;DR: It is shown that in such a situation, this 2-dimensional wavefront processor can be mapped to a linear array processor if the wavefronts never backtrack, and the mapping will not increase the number of registers in each processor element.

...read moreread less

Abstract: We consider the case of a 2-dimensional wavefront array processor where only one wavefront appears at any time. We show that in such a situation, this 2-dimensional wavefront processor can be mapped to a linear array processor if the wavefronts never backtrack. The mapping will not increase the number of registers in each processor element. Two examples, the spoken words recognition problem and the longest common subsequence problem, are given to demonstrate the feasibility of this method.

...read moreread less

Journal Article•10.1016/0167-8191(86)90011-6•

Parallel adaptive full-multigrid methods on message-based multiprocessors

[...]

H C Hoppe, H Mühlenbein

1 Oct 1986

TL;DR: It is shown that the nonnumeric parts of the algorithm—the initialization, the termination and the mapping of processes to processors—are very important for the overall performance.

...read moreread less

Abstract: This paper explores the macro data flow approach for solving numerical applications on distributed memory systems. We discuss the problems of this approach with a sophisticated ‘real life’ algorithm—the adaptive full multigrid method. It is shown that the nonnumeric parts of the algorithm—the initialization, the termination and the mapping of processes to processors—are very important for the overall performance. To avoid unnecessary global synchronization points we propose to use the distributed supervisors. We compare this solution with more centralized algorithms. The performance evaluation is done for nearest neighbour and bus connected multiprocessors using a simulation systems.

...read moreread less

Journal Article•10.1016/0167-8191(86)90020-7•

Two parallel SOR variants of the Schwarz alternating procedure

[...]

U Meier

20 Jul 1986

TL;DR: Two parallel variants of the Schwarz alternating procedure for solving two-dimensional elliptic partial differential equations by using a decomposition of the domain into overlapping rectangles using the SOR-algorithm are presented.

...read moreread less

Abstract: Two parallel variants of the Schwarz alternating procedure for solving two-dimensional elliptic partial differential equations by using a decomposition of the domain into overlapping rectangles are presented. In each of these the SOR-algorithm is applied to the linear systems that arise from finite difference approximations in each subdomain. The convergence behaviour of the methods is examined and compared with the serial SOR-algorithm. Some numerical results, carried out on a CRAY X-MP with two processors, are presented.

...read moreread less

Journal Article•10.1016/0167-8191(86)90009-8•

The parallel neighbour sort and 2-way merge algorithm (Short Communication)

[...]

David J. Evans¹, N Y Yousif¹•Institutions (1)

Loughborough University¹

1 Mar 1986

TL;DR: The implementation of the neighbour sort and 2-way merge algorithms on a parallel MIMD computer and their computational complexity are described.

...read moreread less

Abstract: This paper briefly describes the implementation of the neighbour sort and 2-way merge algorithms on a parallel MIMD computer and analyses their computational complexity.

...read moreread less

Journal Article•10.1016/0167-8191(86)90018-9•

A strategy for vactorization

[...]

B. L. Buzbee¹•Institutions (1)

Los Alamos National Laboratory¹

20 Jul 1986

TL;DR: The intent of this paper is to provide many of these new users of vector processors with a high-level discussion of some of the fundamental aspects of vector processing.

...read moreread less

Abstract: The community of people using vector processors is growing rapidly. First, within the United States, the National Science Foundation has established several vector supercomputer centers, and a large number of scientists in academe will be using these resources. Second, IBM has added a vector capability to its high-end mainframe system, and the widespread use of these systems will dramatically increase the community of people using vector processors. Finally, host of minicomputer manufacturers have added vector capability to their latest systems. So, as a result, there will likely be a reveal of interest in vectorization and some exciting additions to the associated technology. The intent of this paper is to provide many of these new users of vector processors with a high-level discussion of some of the fundamental aspects of vector processing.

...read moreread less

Journal Article•10.1016/0167-8191(86)90029-3•

Some issues in parallel processing as encountered on the Denelcor HEP

[...]

Robert Hiromoto¹•Institutions (1)

Los Alamos National Laboratory¹

1 May 1986

TL;DR: A collection of differing parallel implementation of a single, computationally intensive algorithm that models the collisionless, electrostatic interaction between two relatively moving plasma beams, known as the Particle-in-Cell (PIC) method is presented.

...read moreread less

Abstract: We present a collection of differing parallel implementation of a single, computationally intensive algorithm that models the collisionless, electrostatic interaction between two relatively moving plasma beams. This numerical simulation uses a method, important in many scientific applications known as the Particle-in-Cell (PIC) method. Our aim in this study is to determine the advantages and disadvantages associated with those various parallel implementations. Our experiments with parallelizing this particular numerical simulation, referred to throughout this paper as the PIC code, were performed on a single Denelcor HEP Process Execution Module (PEM). A complete set of parallel processing speedups and execution times as a function of number of processes is presented.

...read moreread less

Journal Article•10.1016/0167-8191(86)90007-4•

An extension of the language C for concurrent programming

[...]

M Sonnenschein¹•Institutions (1)

RWTH Aachen University¹

1 Mar 1986

TL;DR: This paper presents processes, modified ports, and modified signals as concepts for extending C and shows that many classical concepts of concurrent programming can be simulated by ports and signals and, therefore, these primitives are sufficiently powerful.

...read moreread less

Abstract: C is a well-known language for systems programming in UNIX-systems. Its concepts are very efficient rather than very safe and, therefore, an extension of C for concurrent programming has also to focus on an efficient implementation instead of on very safe programming concepts. We will present processes, modified ports, and modified signals as concepts for extending C. These concepts are defined close to hardware structures as mailboxes and interrupts and, therefore, they can be implemented efficiently. On the other hand we will show that many classical concepts of concurrent programming can be simulated by ports and signals and, therefore, these primitives are sufficiently powerful.

...read moreread less

Journal Article•10.1016/0167-8191(86)90013-X•

Restructuring SIMPLE for the CHiP architecture

[...]

D Gannon¹, J Panetta¹•Institutions (1)

Purdue University¹

1 Oct 1986

TL;DR: A mapping of the algorithms to a configurable highly parallel (CHiP) computer being designed at the University of Washington is described and the way in which parallelism can be used to speed up execution is discussed.

...read moreread less

Abstract: The SIMPLE program is a commonly used benchmark for testing new architectures designed for high speed scientific computation. As the name implies, the code is a simple example of a Lagrangian hydrodynamics application. In this paper we describe the SIMPLE benchmark in detail and discuss the way in which parallelism can be used to speed up execution. The focus of the work is a mapping of the algorithms to a configurable highly parallel (CHiP) computer being designed at the University of Washington.

...read moreread less

Proceedings Article•

Complexity of the parallel QR decomposition of a rectangular matrix

[...]

Michel Cosnard, Yves Robert

1 Jan 1986

Journal Article•10.1016/0167-8191(86)90023-2•

A lattice model for cellular (systolic) algorithms

[...]

E. Katona¹•Institutions (1)

Hungarian Academy of Sciences¹

20 Jul 1986

TL;DR: A lattice model (here ‘lattice’ means a net of points) is introduced for homogeneous cellular algorithms and a transformation methodology is developed, which makes it possible to produce many different versions of a given cellular algorithm.

...read moreread less

Abstract: In this paper a lattice model (here ‘lattice’ means a net of points) is introduced for homogeneous cellular algorithms. On the basis of this model a transformation methodology is developed, which makes it possible to produce many different versions of a given cellular algorithm. These versions may have quite different structural properties, but they perform the same computation as the original algorithm. In this way a great variety of cellular algorithms can be offered to choose the best version in practice and, on the other hand, cellular algorithms can be classified according to their inherent structures.

...read moreread less

Journal Article•10.1016/0167-8191(86)90026-8•

Some problems of exploiting a pipeline processor

[...]

J. J. Modi¹, J S Rollett²•Institutions (2)

University of Cambridge¹, University of Oxford²

20 Jul 1986

TL;DR: The design of parallel algorithms in general is discussed and it is shown that using the low-level language APAL it is possible to gain a significant speed up over the Mathlib routines provided by the manufacturers.

...read moreread less

Abstract: For the Admiralty Marine Technology Establishment (AMTE) we have designed a number of routines which are now in practical use. In this paper we shall discuss the design of parallel algorithms in general and show that using the low-level language APAL it is possible to gain a significant speed up over the Mathlib routines provided by the manufacturers. However, in order to achieve this it is necessary to perform floating-point addition, floating-point multiplication and fixed-point arithmetic as well as accessing operands to and from the main data storage, all in parallel. Furthermore there are certain restrictions on the inputs to the arithmetic units, which make it difficult to exploit the arithmetic speed of the machine fully. We will explain the features of the design of the AP-120B which make it difficult to write FORTRAN programs so that they run at speeds close to those for equivalent routines written in assembler code. We will also attempt to suggest ways in which the design of the machine could be changed to improve this situation.

...read moreread less

Journal Article•10.1016/0167-8191(86)90031-1•

Fast scan-line conversion using vectorisation

[...]

M Goldapp

1 May 1986

TL;DR: A new method for polygons is described that allows almost total vectorisation and affords considerably faster execution speeds even in scalar mode, and is primarily created for small computers or future hardware.

...read moreread less

Abstract: Algorithms to convert surfaces to scan-lines are used in many colour graphics applications. Fast methods are needed in real-time simulators and computerised film making. This paper discusses vectorisation of common algorithms, and describes a new method for polygons that allows almost total vectorisation and affords considerably faster execution speeds even in scalar mode. Its main idea is a decomposition of a polygon into certain trapezia and triangles that allow vector operations both horizontally and vertically. Biparametric surfaces can be displayed after generating a grid of quadrilaterals. Although primarily created for small computers or future hardware, the methodsdescribed have been run on a CRAY-1 for demonstration purposes.

...read moreread less

Journal Article•10.1016/0167-8191(86)90022-0•

Fault-tolerance and performance analysis of beta-networks

[...]

John Paul Shen¹, John P. Hayes², L Ciminiera³, A Serra³•Institutions (3)

Carnegie Mellon University¹, University of Michigan², Polytechnic University of Turin³

20 Jul 1986

TL;DR: It is shown that DPR-nets possess the maximal fault tolerance, and the class of DPR-networks is unique in achieving the maximum possible fault tolerance.

...read moreread less

Abstract: The relationship between fault tolerance and performance is explored for β-networks used as interconnection networks in multicomputer systems. The networks of interest are composed of 2 × 2 switches (β-elements) and are represented by a graph model called a β-graph. Two parameters derived from β-graphs are used to characterize β-networks. The fault tolerance (FT) parameter is the maximum number of β-element faults that can be tolerated. The communication delay (CD) parameter, representing the worst-case delay between any pair of computers, is used as a measure of the performance of the β-networks. Tight bounds for both FT and CD parameters are derived. Two important classes of β-networks are introduced, namely, DPR-networks and MISE-networks. It is shown that DPR-networks possess the maximal fault tolerance, and the class of DPR-networks is unique in achieving the maximum possible fault tolerance. The class of MISE-networks is minimally fault tolerant, but has the minimum communication delay. A class of β-networks, called RDTT-networks, that achieve an optimal balance of the FT and CD parameters is also presented.

...read moreread less

Journal Article•10.1016/0167-8191(86)90014-1•

Computational models and task scheduling for parallel sparse Cholesky factorization

[...]

Joseph W. H. Liu¹•Institutions (1)

York University¹

1 Oct 1986

TL;DR: A new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix and give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.

...read moreread less

Abstract: In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-grained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.

...read moreread less

Journal Article•10.1016/0167-8191(86)90025-6•

Parallel efficiency can be greater than unity

[...]

Dennis Parkinson¹•Institutions (1)

Queen Mary University of London¹

20 Jul 1986

TL;DR: It is shown that in some case parallel architectures with p processors can show speed-ups greater than p and efficiences greater than unity.

...read moreread less

Abstract: It is shown that in some case parallel architectures with p processors can show speed-ups greater than p and efficiences greater than unity.

...read moreread less

Journal Article•10.1016/0167-8191(86)90002-5•

Framework for formulation and analysis of parallel computation structures

[...]

James C Browne¹•Institutions (1)

University of Texas at Austin¹

1 Mar 1986

TL;DR: In this article, a systematic methodology for the formulation of parallel computation structures and algorithms is presented, where each node is the binding of an action to a data object and the arcs are the dependency relationships between the unit computations executed at the nodes.

...read moreread less

Abstract: This paper gives a systematic methodology for the formulation of parallel computation structures and algorithms. The fundamental definition of a computation structure is a graph where each node is the binding of an action to a data object and the arcs are the dependency relationships between the unit computations executed at the nodes. The structure of the graph is determined by the selection of elements for the model of computation in which the graph is expressed. An abstract machine is created by defining the resources including for example instruction sets for the processors, which realize the conceptual elements of the model of parallel computations. An algorithm is a mapping of the computation graph to the abstract machine and a program which traverses the mapped graph to execute the computation. The methodology proceeds by describing parallel computations on successively more fully specified abstract machines. A model of parallel computation is selected and an abstract machine implementing the model of computation is defined. Specification of increasingly resolved abstract machines is structured by both increasing the span of elements from the model of computation represented in the machine and be increasing the level of detail resolved for each element.

...read moreread less