Top 154 papers presented at Parallel Computing in 1992

Showing papers presented at "Parallel Computing in 1992"

Proceedings Article•

Determining scan flip-flops in partial-scan designs

[...]

2 Jan 1992

TL;DR: Results of applying optimal and heuristic procedures on a set of benchmark circuits indicate that heuristic methods give fast and near minimal solutions.

...read moreread less

Abstract: A report is presented on procedures investigated to determine flip-flops to be scanned in partial-scan designs for sequential circuits. The main idea pursued is to derive a minimal feedback vertex set of the so-called S-graphs. Results of applying optimal and heuristic procedures on a set of benchmark circuits indicate that heuristic methods give fast and near minimal solutions.<>

...read moreread less

140 citations

Book Chapter•10.1016/B978-0-444-88712-2.50012-3•

Computer support for machine-independent parallel programming in Fortran D

[...]

Seema Hiranandani¹, Ken Kennedy¹, Chau-Wen Tseng¹•Institutions (1)

Rice University¹

3 Jan 1992

TL;DR: The design of a prototype Fortran D compiler for the iPSC/860, a MIMD distributed-memory machine is presented and issues addressed include data decomposition analysis, guard introduction, communications generation and optimization, program transformations, and storage assignment.

...read moreread less

Abstract: Because of the complexity and variety of parallel architectures, an efficient machine-independent parallel programming model is needed to make parallel computing truly usable for scientific programmers. We believe that Fortran D, a version of Fortran enhanced with data decomposition specifications, can provide such a programming model. This paper presents the design of a prototype Fortran D compiler for the iPSC/860, a MIMD distributed-memory machine. Issues addressed include data decomposition analysis, guard introduction, communications generation and optimization, program transformations, and storage assignment. A test suite of scientific programs will be used to evaluate the effectiveness of both the compiler technology and programming model for the Fortran D compiler.

...read moreread less

115 citations

Book Chapter•10.1016/B978-0-444-88712-2.50007-X•

Vienna Fortran—a Fortran language extension for distributed memory multiprocessors

[...]

Barbara Chapman¹, Piyush Mehrotra², Hans P. Zima¹•Institutions (2)

University of Vienna¹, Langley Research Center²

3 Jan 1992

TL;DR: This paper presents the basic features of Vienna Fortran along with a set of examples illustrating the use of these features and presents the advantages of a shared memory programming paradigm while explicitly controlling the placement of data.

...read moreread less

Abstract: Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.

...read moreread less

79 citations

Journal Article•10.1016/0167-8191(92)90048-C•

distributed algorithms for the quickest path problem

[...]

Yung-Chen Hung¹, Gen-Huey Chen¹•Institutions (1)

National Taiwan University¹

1 Jul 1992

TL;DR: Distributed algorithms are developed for the quickest path problem in an asynchronous communication network to find paths in N to transmit a given amount of data such that the transmission time is minimized.

...read moreread less

Abstract: Let N = (V, A, C, L) be a network with node set V, arc set A, positive arc capacity function C, and nonnegative arc lead time function L. The quickest path problems is to find paths in N to transmit a given amount of data such that the transmission time is minimized. In this paper, distributed algorithms are developed for the quickest path problem in an asynchronous communication network. For the one-source quickest path problem, we present three algorithms that require O(rn2) messages and O(2) time, O(n) messages and O(rn) time, and O1+elog w) messages and O(rn1+elog w) time for any e, 0

...read moreread less

76 citations

Journal Article•10.1016/0167-8191(92)90011-U•

Reduction to condensed form for the Eigenvalue problem on distributed memory architectures

[...]

Jack Dongarra¹, Jack Dongarra², Robert A. van de Geijn³•Institutions (3)

University of Tennessee¹, Oak Ridge National Laboratory², University of Texas at Austin³

1 Jan 1992

TL;DR: This paper describes a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively, based on LAPACK sequential codes and use a panel-wrapped mapping ofMatrices to nodes.

...read moreread less

Abstract: In this paper, we describe a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively. The methods are based on LAPACK sequential codes and use a panel-wrapped mapping of matrices to nodes. Results from experiments on the Intel Touchstone Delta are given.

...read moreread less

73 citations

Journal Article•10.1016/0167-8191(92)90066-G•

Cooley-Tukey FFT on the Connection Machine

[...]

S. Lennart Johnsson, Robert L Krawitz

1 Nov 1992

TL;DR: An implementation of the Cooley-Tukey complex-to-complex FFT on the Connection Machine is described, which is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors.

...read moreread less

Abstract: We describe an implementation of the Cooley-Tukey complex-to-complex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM-200 with 2048 floating-point processors. The peak rate of FFT computations local to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision. The same FFT routine is used to perform both one- and multi-dimensional FFT without any explicit data rearrangement. The peak performance for a one-dimensional FFT on data distributed over all processors is 5.4 Gflops/s in 32-bit precision and 3.2 Gflops/s in 64-bit precision. The peak performance for square, two-dimensional transforms, is 3.1 Gflops/s in 32-bit precision, and for cubic, three dimensional transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain oblong shapes yield better performance. The number of twiddle factors stored in each processor is P/2N + log2 N for an FFT on P complex points uniformly distributed among N processors. To achieve this level of storage efficiency we show that a decimation-in-time FFT is required for normal order input, and a decimation-in-frequency FFT is required for bit-reversed input order.

...read moreread less

62 citations

Journal Article•10.1016/0167-8191(92)90069-J•

Global multidimensional optimization on parallel computer

[...]

Roman G. Strongin, Yaroslav D. Sergeyev

1 Nov 1992

TL;DR: A parallel algorithm for solving multiextremal multidimensional global optimization problems by applying Peano-type space-filling curves and conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established.

...read moreread less

Abstract: A parallel algorithm for solving multiextremal multidimensional global optimization problems is proposed. The algorithm is based on reducing multidimensional problems to the one-dimensional ones by applying Peano-type space-filling curves. A new parallel scheme to construct such curves is presented. For reduced optimization problems a parallel global optimization method is constructed. Sufficient conditions of global convergence are investigated. Conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established. Numerical experiments executed on ALLIANT FX/80 are also presented.

...read moreread less

58 citations

Journal Article•10.1016/0167-8191(92)90141-S•

1991 International conference on supercomputing

[...]

H. C. Burg¹, J. Helin²•Institutions (2)

Forschungszentrum Jülich¹, Tampere University of Technology²

1 Apr 1992

51 citations

Book Chapter•10.1016/B978-0-444-88712-2.50014-7•

Distributed memory compiler methods for irregular problems—data copy reuse and runtime partitioning

[...]

Raja Das¹, Ravi Ponnusamy², Joel H. Saltz¹, Dimitri J. Mavriplis¹•Institutions (2)

Langley Research Center¹, Syracuse University²

3 Jan 1992

TL;DR: This paper outlines two methods which it is believed will play an important role in any distributed memory compiler able to handle sparse and unstructured problems and describes a viable mechanism for tracking and reusing copies of off-processor data.

...read moreread less

Abstract: Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods.

...read moreread less

49 citations

Journal Article•10.1016/0167-8191(92)90039-A•

Cray Y-MP C90: system features and early benchmark results

[...]

Wilfried Oed¹•Institutions (1)

Cray¹

1 Aug 1992

TL;DR: Results of various benchmarks and a description of new architectural features are presented and the first completed Cray Y-MP C90 supercomputer was delivered to the customer in February 1992.

...read moreread less

Abstract: On November 19, 1991 at Albuquerque's Supercomputing '91 Cray Research introduced its new top of the line — the Cray Y-MP C90 supercomputer. With 16 CPUs, 2Gbytes (256 Mwords) of central memory, and a new dual-vector pipeline architecture, the system offers peak performance of 16 Gflops, and unequalled sustained ‘real world’ performance. the first completed system was on-line at the Supercomputing '91, and delivered to the customer in February 1992. In this paper, results of various benchmarks and a description of new architectural features are presented.

...read moreread less

43 citations

Proceedings Article•

Dependability modeling of a heterogeneous VAX-cluster system using stochastic reward nets

[...]

Jogesh K. Muppala, Archana S. Sathaye, Richard C. Howe, Kishor S. Trivedi

2 Jan 1992

Journal Article•10.1016/0167-8191(92)90030-B•

A random number generator for parallel computers

[...]

Srinivas Aluru¹, G. M. Prabhu¹, John L. Gustafson¹•Institutions (1)

Iowa State University¹

1 Aug 1992

TL;DR: An efficient parallelization of the Generalized Feedback Shift Register (GFSR) algorithm for generating pseudorandom numbers is presented and works on any parallel computer where the number of processors is a power of two and requires the same amount of memory per processor as required by the sequential GFSR algorithm.

...read moreread less

Abstract: Running huge simulational computations on a system of parallel processors requires the generation of uniform random sequences on each processor Various techniques useful for the generation of parallel random sequences are analyzed for their suitability to parallel architectures An efficient parallelization of the Generalized Feedback Shift Register (GFSR) algorithm for generating pseudorandom numbers is presented The algorithm works on any parallel computer where the number of processors is a power of two and requires the same amount of memory per processor as required by the sequential GFSR algorithm

...read moreread less

Proceedings Article•

Sigma II: A Tool Kit for Building Parallelizing Compilers and Performance Analysis Systems

[...]

Dennis Gannon, Jenq Kuen Lee, Bruce Shei, Sekhar R. Sarukkai, Srinivas Narayana, Neelakantan Sundaresan, Daya Atapattu, François Bodin - Show less +4 more

6 Apr 1992

Journal Article•10.1016/0167-8191(92)90059-G•

Divide and conquer algorithms for the bandsymmetric eigenvalue problem

[...]

Peter Arbenz

1 Oct 1992

TL;DR: The new algorithms are compared to the traditional solutionspaths offered by Eispack, tridiagonalization of the band matrix followed by thetridiagonal QR algorithm.

...read moreread less

Abstract: Divide and conquer algorithms are formulated for the solution of the eigenvalue problem for symmetric band matrices. The new algorithms are compared to the traditional solutionspaths offered by Eispack , tridiagonalization of the band matrix followed by the tridiagonal QR algorithm.

...read moreread less

Journal Article•10.1016/0167-8191(92)90004-Q•

Broadcasting in wraparound meshes with parallel monodirectional links

[...]

Jean-Claude Bermond¹, Philippe Michallon, Denis Trystram•Institutions (1)

Centre national de la recherche scientifique¹

1 Jun 1992

TL;DR: An asymptotically optimal broadcasting algorithm improving the preceding results and using in the wraparound mesh the constructions of two edge-disjoint spanning trees rooted at a given node and of minimum depth.

...read moreread less

Abstract: In this paper we give an algorithm to broadcast a message in a wraparound mesh distributed-memory parallel architecture with parallel monodirectional links. This algorithm uses a general strategy based on the diffusion of the message in edge-disjoint spanning trees. We first present in this setting the results of Saad and Schultz and the improvements obtained by Simmen. We then give an asymptotically optimal broadcasting algorithm improving the preceding results. It uses in the wraparound mesh the constructions of two edge-disjoint spanning trees rooted at a given node and of minimum depth.

...read moreread less

Journal Article•10.1016/0167-8191(92)90091-K•

Parallel molecular dynamics of biomolecules

[...]

Hellfried Schreiber¹, Othmar Steinhauser¹, Peter Schuster¹•Institutions (1)

University of Vienna¹

1 May 1992

TL;DR: It is found out that communication trees are highly efficient means for sending, receiving and gathering the computed data, especially for large processor numbers.

...read moreread less

Abstract: The basic principles of a typical sequential Molecular Dynamics (MD) program suitable for the study of solvated biomolecules are described, the inherent parallelism of MD is analysed and strategies for parallelisation are developed. Due to separate treatment of computation and communication a high level of portability is achieved and both tasks can be optimized independently. It is found out that communication trees are highly efficient means for sending, receiving and gathering the computed data, especially for large processor numbers. A current implementation on a transputer system is presented. Due to the tight memory budget slight modifications are necessary. Nevertheless, we get excellent performance with an average degree of parallelization of 82%.

...read moreread less

Proceedings Article•

Data Visualization and Performance Analysis in the Prism Programming Environment

[...]

Steve Sistare, Donald C. Allen, Rich Bowker, Karen C. Jourdenais, Josh Simons, Rich Title - Show less +2 more

6 Apr 1992

Journal Article•10.1016/0167-8191(92)90076-J•

Improved universal k-selection in hypercubes

[...]

Hong Shen¹•Institutions (1)

Åbo Akademi University¹

1 Feb 1992

TL;DR: An improved algorithm for universal k-selection in hypercubes shows a maximum speedup of O(log k) over the known result for the same problem in the case kp = O(n).

...read moreread less

Abstract: This paper presents an improved algorithm for universal k-selection in hypercubes. The algorithm has a worst-case time complexity of O(n/p log p log (kp)/n) for selecting k smallest numbers from n given numbers in a hypercube of p processors (p⩽n). This result shows a maximum speedup of O(log k) over the known result for the same problem in the case kp = O(n).

...read moreread less

Journal Article•10.1016/0167-8191(92)90047-B•

An improved parallel algorithm for 0/1 knapsack problem

[...]

Gen-Huey Chen¹, Jin Hwang Jang¹•Institutions (1)

National Taiwan University¹

1 Jul 1992

TL;DR: Experiments show that the new parallel algorithm proposed for solving the 0/1 knapsack problem has a better performance than Chen et al.'s algorithm.

...read moreread less

Abstract: A parallel algorithm based on a technique called delayed dominance is proposed for solving the 0/1 knapsack problem. This parallel algorithm is a modification of Chen, Chern and Jang's algorithm. Experiments show that the new algorithm has a better performance than Chen et al.'s algorithm.

...read moreread less

Journal Article•10.1016/0167-8191(92)90061-B•

Parallel algorithms for finding the most vital edge with respect to minimum spanning tree

[...]

Lih-Hsing Hsu¹, Peng-Fei Wang², Chu-Tao Wu²•Institutions (2)

National Chiao Tung University¹, National Tsing Hua University²

1 Oct 1992

TL;DR: This paper presents several cost-optimal parallel algorithms, under different computation models, to find the most vital edge in a weighted graph.

...read moreread less

Abstract: Given a weighted graph G, the weight of a spanning tree T, denoted by w(T), is defined as the total weight of all edges in T. A spanning tree T in G is called a minimum spanning tree if w(T)⩽w(T′) for all spanning trees T′ in G. Let w(G) denote the weight of the minimum spanning tree of G if G is connected; otherwise, w(G) = ∞. An edge e is called a most vital edge in G if w(G−e) ⩾ w(G−e′) for every edge e′ of G where G−e′ denotes the partial graph obtained by removing e′ from G. In this paper, we present several cost-optimal parallel algorithms, under different computation models, to find the most vital edge in a weighted graph.

...read moreread less

Journal Article•10.1016/0167-8191(92)90111-J•

A dedicated massively parallel architecture for the Boltzman machine

[...]

A. De Gloria¹, Paolo Faraboschi¹, S. Ridella¹•Institutions (1)

University of Genoa¹

1 Jan 1992

TL;DR: This paper shows a massive parallel architecture specifically designed to support the Boltzmann machine neural network, its simplicity and reliability together with a low implementation cost.

...read moreread less

Abstract: A key task for neural network research is the development of neurocomputers able to speed-up the learning algorithms to allow their application and test in real cases. This paper shows a massive parallel architecture specifically designed to support the Boltzmann machine neural network. The heart of this architecture is its simplicity and reliability together with a low implementation cost. Despite the impressive speedup obtained by accelerating the standard BM algorithm the architecture does not use particular techniques to expose parallelism in the simulating annealing task, such as the change of state of multiple neurons. Features of the architecture include: (1) speed: the architecture allows a speedup of N (N is the number neurons constituting the BM) with respect to standard implementation on sequential machines; (2) low cost: the architecture requires the same amount of memory of a sequential application, the only additional cost is due to the inclusion of an adder for each neuron; (3) WSI capabilities: the processor interconnection is limited to a single bus for any number of implemented processors, the architecture is scalable in terms of number of processors without any software or hardware modification, the simplicity of the processors enables to implement built-in self-test techniques: (4) High weight dynamics: the architecture performs computation by using 32-bit integer values, therefore offering a wide range of variability of weights.

...read moreread less

Journal Article•10.1016/0167-8191(92)90007-T•

The scheduling of sparse matrix-vector multiplication on a massively parallel dap computer

[...]

Johannes Andersen¹, Gautam Mitra¹, D. Parkinson¹•Institutions (1)

Brunel University London¹

1 Jun 1992

TL;DR: An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP) and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer.

...read moreread less

Abstract: An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP). This approach seeks to reduce the inter-processor data movements and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer. The resulting data structure is of particular relevance to iterative schemes for solving linear systems. Performance results for matrices taken from well-known Linear Programming (LP) test problems are presented and analysed.

...read moreread less

Journal Article•10.1016/S0167-8191(09)80001-X•

Parallel Computing 91

[...]

Hans-Christian Hege, Renate Knecht¹•Institutions (1)

Forschungszentrum Jülich¹

1 Apr 1992

Journal Article•10.1016/0167-8191(92)90018-3•

Comparison of communications on the Intel iPSC/860 and Touchstone Delta

[...]

Roger W. Hockney¹, Edward A. Carmona²•Institutions (2)

University of Southampton¹, Phillips Laboratory²

1 Sep 1992

TL;DR: The Touchstone Delta is found to have an asymptotic bandwidth of 6.7 MB/s which is 2.4 times faster than the iPSC/860, but only about a quarter of the advertised rate.

...read moreread less

Abstract: The Touchstone Delta is found to have an asymptotic bandwidth of 6.7 MB/s which is 2.4 times faster than the iPSC/860, but only about a quarter of the advertised rate of 25 MB/s. The Delta's measured startup time of 61 μs is very little less than the iPSC value of 76 μs, however unlike the iPSC, it is independent (within the error of measurement) of the separation between nodes.

...read moreread less

Proceedings Article•

The Design of the General Parallel Monitoring System

[...]

M. van Riek, Bernard Tourancheau

6 Apr 1992

Journal Article•10.1016/0167-8191(92)90032-3•

Comparative analysis of methods for broadcase elimination

[...]

Marjan Gusev, Jurij F. Tasic¹•Institutions (1)

University of Ljubljana¹

1 Aug 1992

TL;DR: A comparative analysis is completed on the methods for broadcast elimination by reindexing multistep algorithms at the algorithm representation level and decomposing the algorithm and pipelining and routing dataflow for each step.

...read moreread less

Abstract: A comparative analysis is completed on the methods for broadcast elimination. Some authors use approaches to determine the best affine schedule. Another approach is by reindexing multistep algorithms at the algorithm representation level, decomposing the algorithm and pipelining and routing dataflow for each step. Transformations at the algorithm model level and some heuristic approaches are also considered.

...read moreread less

Journal Article•10.1016/0167-8191(92)90060-K•

A parallel algorithm for determining all eigenvalues of large real symmetric tridiagonal matrices

[...]

Achim Basermann, Peter Weidner

1 Oct 1992

TL;DR: Both the sequential and parallel execution time of the algorithm ALLEV (ALL Eigen Values) presented in this paper are considerably shorter than the execution times of the vectorized EISPACK-routine TQL1 which uses the QL method.

...read moreread less

Abstract: A method for determining all eigenvalues of large real symmetric tridiagonal matrices on multiprocessor system with vector facilities is presented. For finding the eigenvalues of a tridiagonal matrix, the method of the Sturm sequence is a standard method. The method uses bisection first to isolate all eigenvalues, bisection is and then to extract the eigenvalues to a predefined accuracy. For extracting the eigenvalues, bisection is accelerated by a superlinearly convergent zero finder, the Pegasus method. The evaluation of the Sturm sequence is the central component for both isolation and extraction. Some new ideas are presented, such as a method for weighting the values of the characteristics polynomial to avoid under- or overflow, a method for combining the Pegasus method with preceding bisection steps and a vectorization and parallelization strategy over intervals. The method was implemented and the results were measured on a SUPRENUM multiprocessor system with 16 processors and on a CRAY Y-MP8/832 with 8 processors. On the latter machine, both the sequential and parallel execution time of our algorithm ALLEV (ALL Eigen Values) presented in this paper are considerably shorter than the execution times of the vectorized EISPACK-routine TQL1 which uses the QL method.

...read moreread less

Journal Article•10.1016/0167-8191(92)90010-5•

LU decomposition optimized for a parallel computer with a hierarchical distributed memory

[...]

Scott M. Stark¹, Antony N. Beris¹•Institutions (1)

University of Delaware¹

1 Sep 1992

TL;DR: This study shows that efficient memory use, both in terms of shared memory and cache utilization, is the key to optimal performance when dealing with memory hierarchies such as in the TC2000.

...read moreread less

Abstract: The implementation of an efficient hybrid parallel block LU decomposition procedure for dense systems on a BBN TC2000 parallel computer is discussed. The TC2000 is of the MIMD architecture with distributed memory. The key characteristic of this architecture is a hierarchical memory structure (register, cache, local, shared). This study shows that efficient memory use, both in terms of shared memory and cache utilization, is the key to optimal performance when dealing with memory hierarchies such as in the TC2000. Although for a system of equations of fixed size, the Mflops per processor rate decreases as the number of processors increases, almost constant performance has been obtained when the number of equations is increased simultaneously to the number of processors used.

...read moreread less

Book Chapter•10.1016/B978-0-444-88712-2.50009-3•

Applications of the “phase abstractions” for portable and scalable parallel programming

[...]

Lawrence H. Snyder¹•Institutions (1)

University of Washington¹

3 Jan 1992

TL;DR: Recently developed parallel programming abstractions are illustrated in a complete example by programming the Jacobi iterative approximation computation to illustrate the way in which the new concepts can assist scaling and portability.

...read moreread less

Abstract: Recently developed parallel programming abstractions are illustrated in a complete example by programming the Jacobi iterative approximation computation. The program, written in pseudocode, is designed to illustrate the way in which the new concepts can assist scaling and portability. The specific abstractions exhibited, collectively called the “phase abstractions,” are the data, code and port ensembles and the XYZ programming levels .

...read moreread less

Journal Article•10.1016/0167-8191(92)90108-J•

A time-parallel multigrid-extrapolation method for parabolic partial differential equations

[...]

Graham Horton¹, Ralf Knirsch¹•Institutions (1)

University of Erlangen-Nuremberg¹

1 Jan 1992

TL;DR: The efficiencies obtained by an implementation on a message-passing multiprocessor demonstrate the suitability of the time-parallel extrapolation method for this type of equation.

...read moreread less

Abstract: We consider the problem of solving unsteady partial differential equations on an MIMD machine. Conventional parallel methods use a data partitioning type approach in which the solution grid at each time-step is divided amongst the available processors. The sequential nature of the time integration is, however, retained. The algorithm presented in this paper makes use of a time-parallel approach, whreby several processors may be employed to solve at several time-steps simultaneously. The time-parallel method enables the inherent parallelism of the extrapolation scheme to be efficiently exploited, allowing a significant increase both in accuracy and in the degree of parallelism. The efficiencies obtained by an implementation on a message-passing multiprocessor demonstrate the suitability of the time-parallel extrapolation method for this type of equation.

...read moreread less

...

Expand