Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 1996
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 1996
Showing papers presented at "Parallel Computing in 1996"
Journal Article•10.1016/0167-8191(96)00024-5•
A high-performance, portable implementation of the MPI message passing interface standard

[...]

William Gropp1, Ewing Lusk1, Nathan E. Doss2, Anthony Skjellum2•
Argonne National Laboratory1, Mississippi State University2
1 Sep 1996
TL;DR: The MPI Message Passing Interface (MPI) as mentioned in this paper is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Abstract: MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.

2,420 citations

Journal Article•10.1016/0167-8191(95)00004-6•
Computer aided parallelisation tools (CAPTools)—conceptual overview and performance on the parallelisation of structured mesh codes

[...]

C. S. Ierotheou1, Stephen Johnson1, Mark Cross1, P. F. Leggett1•
University of Greenwich1
1 Feb 1996
TL;DR: The authors discuss the parallelisations of a number of case study codes showing how the various component tools may be used to develop a highly efficient parallel implementation in a few hours or days.
Abstract: Computer Aided Parallelisation Tools (CAPTools) is a toolkit designed to automate as much as possible of the process of parallelising scalar FORTRAN 77 codes. The toolkit combines a very powerful dependence analysis together with user supplied knowledge to build an extremely comprehensive and accurate dependence graph. The initial version has been targeted at structured mesh computational mechanics codes (eg. heat transfer, Computational Fluid Dynamics (CFD)) and the associated simple mesh decomposition paradigm is utilised in the automatic code partition, execution control mask generation and communication call insertion. In this, the first of a series of papers [1–3] the authors discuss the parallelisations of a number of case study codes showing how the various component tools may be used to develop a highly efficient parallel implementation in a few hours or days. The details of the parallelisation of the TEAMKE1 CFD code are described together with the results of three other numerical codes. The resulting parallel implementations are then tested on workstation clusters using PVM and an i860-based parallel system showing efficiencies well over 80%.

86 citations

Journal Article•10.1016/0167-8191(96)00022-1•
Parallel iterative S-step methods for unsymmetric linear systems

[...]

Anthony T. Chronopoulos1, Charles D. Swanson2•
Wayne State University1, Cray2
1 Aug 1996
TL;DR: GCR (Generalized Conjugate Residual) and Omin (Orthomin) are iterative methods for approximating the solution of unsymmetric linear systems and the new S-step Omin, implemented on the CRAY C90, attained an execution rate greater than 10 Gflops.
Abstract: GCR (Generalized Conjugate Residual) and Omin (Orthomin) are iterative methods for approximating the solution of unsymmetric linear systems. The S-step generalization of these methods has been derived and studied in past work. The S-step methods exhibit improved convergence properties. Also, their data locality and parallel properties are enhanced by forming blocks of s search direction vectors. However, s is limited (to s ≤ 5) by numerical stability considerations. The following new contributions are described in this article. The Modified Gram-Schmidt method is used to AT A-orthogonalize the s direction vectors within each S-step block. It is empirically shown that use of values of s, up to s = 16, preserves the numerical stability of the new iterative methods. Finally, the new S-step Omin, implemented on the CRAY C90, attained an execution rate greater than 10 Gflops (Billion Floating Point Operations per sec).

77 citations

Journal Article•10.1016/0167-8191(95)00010-0•
Minimum-weight degree-constrained spanning tree problem: heuristics and implementation on an SIMD parallel machine

[...]

Bruce Boldon1, Narsingh Deo1, Nishit Kumar1•
University of Central Florida1
1 Mar 1996
TL;DR: Four heuristics for approximate solutions to the minimum-weight degree-constrained spanning tree problem are developed and implemented on a massivelyparallel SIMD machine, MasPar MP-1 and shown to produce solutions close to the optimal in less than 10 seconds.
Abstract: The minimum spanning tree problem with an added constraint that no node in the spanning tree has the degree more than a specified integer, d, is known as the minimum-weight degree-constrained spanning tree (d-MST) problem. Such a constraint arises, for example, in VLSI routing trees, in backplane wiring, or in minimizing single-point failures for communication networks. The d-MST problem is NP-complete. Here, we develop four heuristics for approximate solutions to the problem and implement them on a massivelyparallel SIMD machine, MasPar MP-1. An extensive empirical study shows that for random graphs on up to 5000 nodes (about 12.5 million edges), the heuristics produce solutions close to the optimal in less than 10 seconds. The heuristics were also tested on a number of TSP benchmark problems to compute spanning trees with a degree bound d = 3.

67 citations

Journal Article•10.1016/S0167-8191(96)00096-8•
Key Concepts for Parallel Out-Of-Core LU Factorization

[...]

Jack Dongarra1, Sven Hammarling, David W. Walker2•
University of Tennessee1, Oak Ridge National Laboratory2
1 Apr 1996
TL;DR: A left- looking variant of the LU factorization algorithm is shown to require less I/O to disk than the right-looking variant, and is used to develop a parallel, out-of-core implementation.
Abstract: This paper considers key ideas in the design of out-of-core dense LU factorization routines. A left-looking variant of the LU factorization algorithm is shown to require less I/O to disk than the right-looking variant, and is used to develop a parallel, out-of-core implementation. This implementation makes use of a small library of parallel I/O routines, together with ScaLAPACK and PBLAS routines. Results for runs on an Intel Paragon are presented and interpreted using a simple performance model.

60 citations

Journal Article•10.1016/0167-8191(95)00057-7•
A performance model for krylov subspace methods on mesh-based parallel computers

[...]

E. de Sturler1•
École Polytechnique Fédérale de Lausanne1
1 Jan 1996
TL;DR: A performance model for Krylov subspace methods implemented on distributed memory parallel computers for which the underlying communication network is a two-dimensional mesh is developed, which turns out that the performance is characterized completely by the sequential runtime and Pmax.
Abstract: We develop a performance model for Krylov subspace methods implemented on distributed memory parallel computers for which the underlying communication network is a two-dimensional mesh. The model is based on the runtime of a single iteration or cycle of iterations (for methods like GMRES(m)), because the iteration count is problem dependent. Moreover, we intend to use the model only for parallel implementations that do not change the mathematical properties of the method (significantly). The main purpose of this model is a qualitative analysis of the performance; the model is not meant for very accurate predictions.We express the efficiency, speed-up, and runtime as functions of the number of processors scaled by the number of processors that gives the minimal runtime for the given problem size (Pmax). This provides a natural way to analyze the performance characteristics for the range of the numbers of processors that can be used effectively. The approach is particularly interesting because it turns out that the performance is characterized completely by the sequential runtime and Pmax. The efficiency as a function of the number of processors relative to Pmax is independent of the problem size and parameters describing the machine and solution method. Analogous relations can be obtained for the speed-up and runtime. Pmax itself, of course, depends on N and the other parameters, and a simple equation for Pmax is given.The performance model is also used to evaluate the improvements in the performance if we reduce the communication as described in 7,9,8. Although the scope of the performance model is limited by assumptions on the load balance and the processor grid, there are several obvious generalizations. One important and straightforward generalization is to higher dimensional meshes. We will discuss such generalizations at the end of this article.

47 citations

Book Chapter•10.1007/3-540-62095-8_50•
Partitioning an Array onto a Mesh of Processors

[...]

Fredrik Manne1, Tor Sørevik1•
University of Bergen1
18 Aug 1996
TL;DR: This paper considers the problem of partitioning an array into a number of blocks such that the maximum amount of work in any block is as low as possible and presents new approximation algorithms for computing a well balanced generalized block distribution and an algorithm for computing an optimal semi-generalized block distribution.
Abstract: Achieving an even load balance with a low communication overhead is a fundamental task in parallel computing. In this paper we consider the problem of partitioning an array into a number of blocks such that the maximum amount of work in any block is as low as possible. We review different proposed schemes for this problem and the complexity of their communication pattern. We present new approximation algorithms for computing a well balanced generalized block distribution as well as an algorithm for computing an optimal semi-generalized block distribution. The various algorithms are tested and compared on a number of different matrices.

45 citations

Journal Article•10.1016/0167-8191(95)00059-3•
Finite termination of asynchronous iterative algorithms

[...]

Serap A. Savari1, Dimitri P. Bertsekas1•
Massachusetts Institute of Technology1
1 Jan 1996
TL;DR: This paper proposes several approaches to modify the original algorithm and/or supplement it with an interprocessor communication protocol so that this difficulty does not arise and the resulting procedures can be recast as termination detection schemes for arbitrary finite, distributed computations.
Abstract: We consider n-processor distributed systems where the ith processor executes asynchronously the iteration xi = Â?i,(x). It is natural to terminate the iteration of the ith processor when some local condition, such as xiÂ?i(x): `smallÂ?, holds. However, local termination conditions of this type may not lead to global termination because of the asynchronous character of the algorithm. In this paper, we propose several approaches to modify the original algorithm and/or supplement it with an interprocessor communication protocol so that this difficulty does not arise. Some of the resulting procedures can be recast as termination detection schemes for arbitrary finite, distributed computations.

44 citations

Journal Article•10.1016/0167-8191(95)00058-5•
Communication-efficient matrix multiplication on hypercubes

[...]

Himanshu Gupta1, P. Sadayappan2•
Stanford University1, Ohio State University2
1 Jan 1996
TL;DR: An efficient dense matrix multiplication algorithm for distributed memory computers with a hypercube topology that performs better than all previously proposed algorithms for a wide range of matrix sizes and number of processors, especially for large matrices.
Abstract: In this paper we present an efficient dense matrix multiplication algorithm for distributed memory computers with a hypercube topology. The proposed algorithm performs better than all previously proposed algorithms for a wide range of matrix sizes and number of processors, especially for large matrices. We analyze the performance of the algorithms for two types of hypercube architectures, one in which each node can use (to send and receive) at most one communication link at a time and the other in which each node can use all communication links simultaneously.

41 citations

Journal Article•10.1016/S0167-8191(97)00014-8•
Early experiences in evaluating the parallel disk model with the ViC implementation

[...]

Thomas H. Cormen1, Melissa Hirschl1•
Dartmouth College1
1 Sep 1996
TL;DR: This paper analyzes timing results on a uniprocessor with several disks for two PDM algorithms, out-of-core radix sort and BMMC permutations, to determine the strengths and weaknesses of the PDM.
Abstract: Although several algorithms have been developed for the Parallel Disk Model (PDM), few have been implemented. Consequently, little has been known about the accuracy of the PDM in measuring I/O time and total time to perform an out-of-core computation. This paper analyzes timing results on a uniprocessor with several disks for two PDM algorithms, out-of-core radix sort and BMMC permutations, to determine the strengths and weaknesses of the PDM. The results indicate the following. First, good PDM algorithms are usually not I/O bound. Second, of the four PDM parameters, two (problem size and memory size) are good indicators of I/O time and running time, but the other two (block size and number of disks) are not. Third, because PDM algorithms tend not to be I/O bound, asynchronous I/O effectively hides I/O times. The software interface to the PDM is part of the ViC* run-time library. The interface is a set of wrappers that are designed to be both efficient and portable across several parallel file systems and target machines.

39 citations

Book Chapter•10.1007/3-540-62095-8_20•
A Hierarchical Approach for Performance Analysis of ScaLAPACK-Based Routines Using the Distributed Linear Algebra Machine

[...]

Krister Dackland1, Bo Kågström1•
Umeå University1
18 Aug 1996
TL;DR: An hierarchical approach for design of performance models for parallel algorithms in linear algebra based on a parallel machine model and the hierarchical structure of the ScaLAPACK library is presented.
Abstract: Performance models are important in the design and analysis of linear algebra software for scalable high performance computer systems. They can be used for estimation of the overhead in a parallel algorithm and measuring the impact of machine characteristics and block sizes on the execution time. We present an hierarchical approach for design of performance models for parallel algorithms in linear algebra based on a parallel machine model and the hierarchical structure of the ScaLAPACK library. This suggests three levels of performance models corresponding to existing ScaLAPACK routines. As a proof of the concept a performance model of the high level QR factorization routine pdgeqrf is presented. We also derive performance models of lower level ScaLAPACK building blocks such as pdgeqr2, pdlarft, pdlarfb, pdlarfg, pdlarf, pdnrm2, and pdscal, which are used in the high level model for pdgeqrf. Predicted performance results are compared to measurements on an Intel Paragon XP/S system. The accuracy of the top level model is over 90% for measured matrix and block sizes and different process grid configurations.
Book Chapter•10.1007/3-540-62095-8_17•
A Parallel Version of the Quasi-Minimal Residual Method, Based on Coupled Two-Term Recurrences

[...]

H. Martin Bücker1, Manfred Sauren1•
Forschungszentrum Jülich1
18 Aug 1996
TL;DR: A new version of QMR is proposed with the following properties: Firstly, the Lanczos process is based on coupled two-term recurrences; secondly, both sequences of Lanczos vectors are scalable; and finally, there is only a single global synchronization point per iteration.
Abstract: For the solution of linear systems of equations with unsymmetric coefficient matrix, Freund and Nachtigal (SIAM J. Sci. Comput. 15 (1994), 313–337) proposed a Kryloy subspace method called Quasi-Minimal Residual method (QMR). The two main ingredients of QMR are the unsymmetric Lanczos algorithm and the quasi-minimal residual approach that minimizes a factor of the residual vector rather than the residual itself. The Lanczos algorithm spans a Krylov subspace by generating two sequences of biorthogonal vectors called Lanczos vectors. Due to the orthogonalization and scaling of the Lanczos vectors, algorithms that make use of the Lanczos process contain inner products leading to global communication and synchronization on parallel processors. For massively parallel computers, these effects cause delays preventing scalability of the implementation. Consequently, parallel algorithms should avoid global synchronization as far as possible. We propose a new version of QMR with the following properties: Firstly, the Lanczos process is based on coupled two-term recurrences; secondly, both sequences of Lanczos vectors are scalable; and finally, there is only a single global synchronization point per iteration. The efficiency of this algorithm is demonstrated by numerical experiments on a PARAGON system using up to 121 processors.
Journal Article•10.1016/0167-8191(96)00002-6•
Exploitation of symbolic information in interprocedural dependence analysis

[...]

Stephen Johnson1, Mark Cross1, Martin G. Everett1•
University of Greenwich1
1 Feb 1996
TL;DR: The limitations of conventional dependence analysis techniques are summarised, and a series of extensions which enable the production of a much more accurate dependence graph are described which reduce the total dependencies estimated from conventional analysis by up to 50%.
Abstract: The requirement for a very accurate dependence analysis to underpin software tools to aid the generation of efficient parallel implementations of scalar code is argued. The current status of dependence analysis is shown to be inadequate for the generation of efficient parallel code, causing too many conservative assumptions to be made. This paper summarises the limitations of conventional dependence analysis techniques, and then describes a series of extensions which enable the production of a much more accurate dependence graph. The extensions include analysis of symbolic variables, the development of a symbolic inequality disproof algorithm and its exploitation in a symbolic Banerjee inequality test; the use of inference engine proofs; the exploitation of exact dependence and dependence pre-domination attributes; interprocedural array analysis; conditional variable definition tracing; integer array tracing and division calculations. Analysis case studies on typical numerical code is shown to reduce the total dependencies estimated from conventional analysis by up to 50%. The techniques described in this paper have been embedded within a suite of tools, CAPTools, which combines analysis with user knowledge to produce efficient parallel implementations of numerical mesh based codes.
Journal Article•10.1016/0167-8191(96)00036-1•
Parallel solutions of compressible flows using overlapping and non-overlapping mesh partitioning strategies

[...]

Stéphane Lanteri1•
French Institute for Research in Computer Science and Automation1
1 Oct 1996
TL;DR: This paper presents parallel solutions of realistic three-dimensional flows obtained on the Intel Paragon, the Cray T3D and the IBM SP2 MPPs (Massively Parallel Processors) and compares in details performance results obtained with parallel solution strategies based on overlapping and non-overlapping mesh partitions.
Abstract: Defining a good strategy for the parallelisation of an unstructured mesh based solver is a challenge, particularly when one aims at reaching a high level of performance while maintaining portability of the source code between scalar, vector and parallel machines. In this paper, we present parallel solutions of realistic three-dimensional flows obtained on the Intel Paragon, the Cray T3D and the IBM SP2 MPPs (Massively Parallel Processors). The solver under consideration is a representative subset of an existing industrial code, N3S-MUSCL which implements a mixed finite element/finite volume formulation on unstructured tetrahedral meshes. The adopted parallelisation strategy combines mesh partitioning techniques and a message-passing programming model. We compare in details performance results obtained with parallel solution strategies based on overlapping and non-overlapping mesh partitions.
Journal Article•10.1016/0167-8191(96)00034-8•
Early prediction of MPP performance: the SP2, T3D, and Paragon experiences

[...]

Zhiwei Xu1, Kai Hwang2•
Chinese Academy of Sciences1, University of Hong Kong2
1 Oct 1996
TL;DR: The main contribution of this work lies in providing a systematic procedure to estimate the computational work-load, to determine the application attributes, and to reveal the communication overhead in using these MPPs.
Abstract: The performance of Massively Parallel Processors (MPPs) is attributed to a large number of machine and program factors. Software development for MPP applications is often very costly. The high cost is partially caused by a lack of early prediction of MPP performance. The program development cycle may iterate many times before achieving the desired performance level. In this paper, we present an early prediction scheme we have developed at the University of Southern California for reducing the cost of application software development. Using workload analysis and overhead estimation, our scheme optimizes the design of parallel algorithm before entering the tedious coding, debugging, and testing cycle of the applications. The scheme is generally applied at user/programmer level, not tied to any particular machine platform or any specific software environment. We have tested the effectiveness of this early performance prediction scheme by running the MIT/STAP benchmark programs on a 400-node IBM SP2 system at the Maui High-Performance Computing Center (MHPCC), on a 400-node Intel Paragon system at the San Diego Supercomputing Center (SDSC), and on a 128-node Cray T3D at the Cray Research Eagan Center in Wisconsin. Our prediction shows to be rather accurate compared with the actual performance measured on these machines. We use the SP2 data to illustrate the early prediction scheme. The main contribution of this work lies in providing a systematic procedure to estimate the computational work-load, to determine the application attributes, and to reveal the communication overhead in using these MPPs. These results can be applied to develop any MPP applications other than the STAP benchmarks by which this prediction scheme was developed.
Journal Article•10.1016/0167-8191(95)00070-4•
Interactive debugging and performance analysis of massively parallel applications

[...]

Roland Wismüller1, Michael Oberhuber1, Johann Krammer1, Olav Hansen•
Ludwig Maximilian University of Munich1
1 Mar 1996
TL;DR: A novel tool environment, consisting of a parallel debugger (DETOP), a performance analyzer (PATOP), and a common monitoring system for PowerPC-based parallel computers, is presented, which addresses the topics of scalability, usability for dynamic, multithreaded programming models, minimal intrusion, debugging and tuning methodology and comfortable user interfaces.
Abstract: In the field of high performance computing, massively parallel processing systems (MPPs) get more and more important. A rising number of complex applications is parallelized for execution on these machines. Still a significant portion of the time needed for parallelization is spent for the process of debugging and performance tuning. A main reason for this fact is the absence of adequate tools supporting this phase of program development. In this article, we present a novel tool environment, consisting of a parallel debugger (DETOP), a performance analyzer (PATOP), and a common monitoring system for PowerPC-based parallel computers. The environment specifically addresses the topics of scalability, usability for dynamic, multithreaded programming models, minimal intrusion, debugging and tuning methodology and comfortable user interfaces. We derive requirements for tools monitoring the runtime behavior of parallel programs, present the concepts used to meet these requirements in our tool environment, and describe its implementation and its usage. DETOP is based on the event-action paradigm and supports both data parallel codes and programs based on functional decomposition. Special features are provided for applications that dynamically create new threads or consist of multiple executables. PATOP supports a systematic search for performance bottlenecks in massively parallel applications using the concept of attributed measurements and distributed evaluation. Both tools are based on a common, distributed on-line monitoring system providing the necessary runtime information.
Journal Article•10.1016/S0167-8191(96)00032-4•
Analyzing parallel program performance using normalized performance indices and trace transformation techniques

[...]

Jerry C. Yan1, Sekhar R. Sarukkai1•
Ames Research Center1
1 Nov 1996
TL;DR: How a performance tuning tool-set, AIMS, guides the user towards developing efficient and scalable production-level parallel programs by locating performance improvement opportunities and determining optimization benefits is described.
Abstract: In this paper we describe how a performance tuning tool-set, AIMS, guides the user towards developing efficient and scalable production-level parallel programs by locating performance improvement opportunities and determining optimization benefits. AIMS's Xisk helps identify potential optimizations by computing various pre-defined normalized performance indices from program traces. Inspection of these index point to specific optimizations that may benefit program performance. After identifying and characterizing performance problems, AIMS's MK can provide quantitative estimates of performance benefits to help the user avoid arduous optimizations that may not lead to expected performance improvements by. MK also helps identify potential pitfalls or benefits of changing any of various system parameters. Based on MK's performance projection, an informed decision regarding the most beneficial program optimizations or upgrades in execution environments can be chosen.
Journal Article•10.1016/S0167-8191(96)00039-7•
Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication

[...]

Christophe Calvin
1 Nov 1996
TL;DR: Methods for the many mono-dimensional and multi-dimensional FFT algorithms which minimize the communication overhead are presented and implementations onto mesh and torus topologies are derived by using emulation results of hypercube communications on these topologies.
Abstract: We present in this paper methods for the many mono-dimensional and multi-dimensional FFT algorithms which minimize the communication overhead. We describe the implementation of these algorithms on a hypercube architecture. Moreover, we derive these implementations onto mesh and torus topologies by using emulation results of hypercube communications on these topologies. We compute optimal sizes of blocks in order to obtain a maximal overlap of communications by computations. Experimental results on Intel iPSC/860 and Paragon, IBM SP1 and Cray T3D machines are given, which confirm the theoretical analysis.
Proceedings Article•
Parallel computing : state-of-the-art and perspectives

[...]

Eh D'hollander, Gr Joubert, Fj Peters, D. Trystram, Jan Van Campenhout 
1 Jan 1996
TL;DR: The aim of this volume is to provide an overview of the state-of-the-art of the development, application and future trends in parallel computing.
Abstract: The aim of this volume is to provide an overview of the state-of-the-art of the development, application and future trends in parallel computing. The emphasis is on the applications of all types of parallel computers, which is reflected by the invited and contributed papers and posters included in these proceedings. The applications section looks at various aspects of parallel computers, including: simulation; flow calculations; sorting; image processing; industrial computing; financial applications; and optimization and linear algebra. The software section looks at software engineering for application development, including: programming environments; object-orientated methods; data parallelism; languages; compilers; monitoring and debugging; operating systems; scheduling; and load balancing and threads. The hardware section examines: systolic processors; networks; communication; and performance measurement and analysis. The industrial section looks at research and development activities of manufacturers of parallel computers and associated software tools and systems.
Journal Article•10.1016/0167-8191(96)00023-3•
Adaptive broadcasting with faulty nodes

[...]

Leszek Gąsieniec1, Andrzej Pelc2•
Max Planck Society1, Université du Québec2
1 Sep 1996
TL;DR: This work considers broadcasting from a fault-free source to all nodes of a completely connected n-node network in the presence of k faulty nodes, and constructs a broadcasting algorithm working in worst-case time O(log2 n).
Abstract: We consider broadcasting from a fault-free source to all nodes of a completely connected n-node network in the presence of k faulty nodes. Every node can communicate with at most one other node in a unit of time and during this period every pair of communicating nodes can exchange information packets. Faulty nodes cannot send information. Broadcasting is adaptive, i.e., a node schedules its next communication on the basis of information currently available to it. Assuming that the fraction of faulty nodes is bounded by a constant smaller than 1, we construct a broadcasting algorithm working in worst-case time O(log2 n).
Journal Article•10.1016/0167-8191(95)00056-9•
Wavelet packet best basis selection on moderate parallel MIMD architectures

[...]

Andreas Uhl1•
University of Salzburg1
1 Jan 1996
TL;DR: A subband based parallelization which overcomes most of the difficulties of a straightforward parallel version of the sequential algorithm is introduced and is easier to implement than its classical version.
Abstract: Wavelet packet best basis selection has become a popular method in image compression. This paper introduces a subband based parallelization which overcomes most of the difficulties of a straightforward parallel version of the sequential algorithm. Beside the higher efficiency the algorithm is easier to implement than its classical version.
Journal Article•10.1016/0167-8191(96)00021-X•
Scalability study of the KSR-1

[...]

Umakishore Ramachandran1, Gautam Shah1, S. Ravikumar1, Jeyakumar Muthukumarasamy1•
Georgia Institute of Technology1
1 Aug 1996
TL;DR: The communication network of KSR-1 is fairly silient in supporting simultaneous remote memory accesses from several processors and the multiple communi cation paths realized through this pipelining help in the efficient implementation of tournament-style barrier synchronization algorithms.
Abstract: Scalability of parallel architectures is an interesting area of current research. Shared memory parallel programming is attractive stemming from its relative ease in transitioning from sequential programming. However, there has been concern in the architectural community regarding the scalability of shared memory parallel architectures owing to the potential for large latencies for remote memory accesses. KSR-1 is a commercial shared memory parallel architecture, and the scalability of KSR-1 is the focus of this research. The study is conducted using a range of experiments spanning latency measurements, synchronization, and analysis of parallel algorithms for three computational kernels and an application. The key conclusions from this study are as follows: The communication network of KSR-1, a pipelined unidirectional ring, is fairly resilient in supporting simultaneous remote memory accesses from several processors. The multiple communication paths realized through this pipelining help in the efficient implementation of tournament-style barrier synchronization algorithms. Parallel algorithms that have fairly regular and contiguous data access patterns scale well on this architecture. The architectural features of KSR-1 such as the poststore and prefetch are useful for boosting the performance of parallel applications. The sizes of the caches available at each node may be too small for efficiently implementing large data structures. The network does saturate when there are simultaneous remote memory accesses from a fully populated (32 node) ring.
Journal Article•10.1016/0167-8191(96)00005-1•
Automatic parallel code generation for message passing on distributed memory systems

[...]

Stephen Johnson1, C. S. Ierotheou1, Mark Cross1•
University of Greenwich1
1 Feb 1996
TL;DR: The code generation strategies described here are embedded within the Computer Aided Parallelisation tools (CAPTools) software as a key part of a toolkit for automating as much as possible of the parallelisation process for mesh based numerical codes.
Abstract: The availability of a very accurate dependence graph for a scalar code is the basis for the automatic generation of an efficient parallel implementation. The strategy for this task which is encapsulated in a comprehensive data partitioning code generation algorithm is described. This algorithm involves the data partition, calculation of assignment ranges for partitioned arrays, addition of a comprehensive set of execution control masks, altering loop limits, addition and optimisation of communications for all data. In this context, the development and implementation of strategies to merge communications wherever possible has proved an important feature in producing efficient parallel implementations for numerical mesh based codes. The code generation strategies described here are embedded within the Computer Aided Parallelisation tools (CAPTools) software as a key part of a toolkit for automating as much as possible of the parallelisation process for mesh based numerical codes. The algorithms used enables parallelisation of real computational mechanics codes with only minor user interaction and without any prior manual customisation of the serial code to suit the parallelisation tool.
Book Chapter•10.1007/3-540-62095-8_10•
A Software Architecture for Massively Parallel Input-Output

[...]

Peter Brezany1, Thomas A. Mück1, Erich Schikuta1•
University of Vienna1
18 Aug 1996
TL;DR: The main focus of this paper is the parallel I/O runtime system support provided for software-generated programs produced by parallelizing compilers in the context of High Performance FORTRAN efforts.
Abstract: For an increasing number of data intensive scientific applications, parallel I/O concepts are a major performance issue Tackling this issue, we provide an outline of an input/output system designed for highly efficient, scalable and conveniently usable parallel I/O on distributed memory systems The main focus of this paper is the parallel I/O runtime system support provided for software-generated programs produced by parallelizing compilers in the context of High Performance FORTRAN efforts Specifically, our design is presented in the context of the Vienna Fortran Compilation System
Journal Article•10.1016/0167-8191(95)00064-X•
Parallel reduction of banded matrices to bidiagonal form

[...]

Bruno Lang
1 Jan 1996
TL;DR: A parallel algorithm for reducing banded matrices to bidiagonal form based on Householder transforms will outperform the standard method on a serial computer with a distinct memory hierarchy and can be efficiently implemented in a distributed memory environment.
Abstract: A parallel algorithm for reducing banded matrices to bidiagonal form is presented In contrast to the rotation-based Â?standard approachÂ?, our algorithm is based on Householder transforms, therefore exhibiting considerably higher data locality (BLAS level 2 instead of level 1) The update of the transformation matrices which involves the vast majority of the operations can even be blocked to allow the use of level 3 BLAS Thus, our algorithm will outperform the standard method on a serial computer with a distinct memory hierarchy In addition, the algorithm can be efficiently implemented in a distributed memory environment, as is demonstrated by numerical results on the Intel Paragon
Journal Article•10.1016/0167-8191(95)00071-2•
Heterogeneous computing machines and Amdahl's law

[...]

David Moncrieff1, Richard E. Overill2, Stephen Wilson3•
Florida State University1, King's College London2, Rutherford Appleton Laboratory3
1 Mar 1996
TL;DR: It is demonstrated that, for some choices of this function, the performance of a heterogeneous system significantly exceeds that of each of the component homogeneous systems.
Abstract: We consider the implications of Amdahl's Law for a heterogeneous system of computers each of which is a homogeneous parallel processor. We introduce a parallelism profile function to describe the level of parallelism achieved in different fragments of the computation and demonstrate that, for some choices of this function, the performance of a heterogeneous system significantly exceeds that of each of the component homogeneous systems.
Journal Article•10.1016/0167-8191(95)00066-6•
Parallel computation of exact Euclidean distance transform

[...]

Yu-Hua Lee, Shi-Jinn Horng, Tzong-Wann Kao, Ferng-Shi Jaung, Yuung-Jih Chen, Horng-Ren Tsai 
1 Feb 1996
TL;DR: This paper studies the parallel computation of the exact Euclidean distance transform for two parallel architectures: the EREW PRAM model and the SIMD hypercube computer.
Abstract: Distance transform is extensively used in image processing, such as expanding, shrinking, thinning, computing shape factor, etc. There are many approximate Euclidean distance transform (EDT) algorithms in the literature, but finding the exact Euclidean distance transform (EDT) with respect to the Euclidean distance metric is better in various application fields. Unless the digital image is very small, it is rather time consuming to find the exact Euclidean distance transform of an image. So, it is important to improve the computing speed. In this paper we study the parallel computation of the exact Euclidean distance transform for two parallel architectures: the EREW PRAM model and the SIMD hypercube computer. The parallel algorithm is given for the computation of exact Euclidean distance transform for all pixels with respect to black pixels in an N × N black and white image. The running time is O(log2 N) both in the EREW PRAM model and the hypercube computer with N × N processors.
Journal Article•10.1016/S0167-8191(96)00049-X•
Mapping affine loop nests

[...]

Michèle Dion1, Yves Robert1•
École normale supérieure de Lyon1
1 Dec 1996
TL;DR: It is shown that maximizing the number of local communications in theaccess graph is an NP-complete problem in the strong sense and several heuristics based upon the access graph for mapping affine loop nests onto DMPCs are presented.
Abstract: This paper deals with the problem of aligning data and computations when mapping affine loop nests onto Distributed Memory Parallel Computers (DMPCs). We formulate the problem by introducing a new graph, the access graph, to model affine communications (with rectangular access matrices) more adequately than with the previously introduced tool, the communication graph. We show that maximizing the number of local communications in the access graph is an NP-complete problem in the strong sense and we present several heuristics based upon the access graph for mapping affine loop nests onto DMPCs.
Journal Article•10.1016/0167-8191(95)00003-8•
Integrating user knowledge with information from parallelisation tools to facilitate the automatic generation of efficient parallel FORTRAN code

[...]

P. F. Leggett1, A. T. J. Marsh1, Stephen Johnson1, Mark Cross1•
University of Greenwich1
1 Feb 1996
TL;DR: The necessary components that such a parallelisation toolkit should possess to provide an effective environment to identify, extract and embed user relevant user knowledge are considered and their effectiveness discussed.
Abstract: User supplied knowledge and interaction is a vital component of a toolkit for producing high quality parallel implementations of scalar FORTRAN numerical code. In this paper we consider the necessary components that such a parallelisation toolkit should possess to provide an effective environment to identify, extract and embed user relevant user knowledge. We also examine to what extent these facilities are available in leading parallelisation tools; in particular we discuss how these issues have been addressed in the development of the user interface of the Computer Aided Parallelisation Tools (CAPTools). The CAPTools environment has been designed to enable user exploration, interaction and insertion of user knowledge to facilitate the automatic generation of very efficient parallel code. A key issue in the user's interaction is control of the volume of information so that the user is focused on only that which is needed. User control over the level and extent of information revealed at any phase is supplied using a wide variety of filters. Another issue is the way in which information is communicated. Dependence analysis and its resulting graphs involve a lot of sophisticated rather abstract concepts unlikely to be familiar to most users of parallelising tools. As such, considerable effort has been made to communicate with the user in terms that they will understand. These features, amongst others, and their use in the parallelisation process are described and their effectiveness discussed.
Journal Article•10.1016/0167-8191(96)00015-4•
Real and complex fast Fourier transforms on the Fujitsu VPP 500

[...]

Markus Hegland1•
Australian National University1
1 Jun 1996
TL;DR: Very long vector length stride one implementations of multiple FFTs on one node, as described by the author in 1994, are combined with optimized transpositions.
Abstract: Fast Fourier transforms parallelize well but need large amounts of communication. An algorithm which concentrates all the communication in one or two transposition steps is the transpose split algorithm. Different transposition algorithms can be used depending on data size and communication latency. A new transpose split algorithm for real and hermitian data is presented for one, two and three dimensional transforms. This algorithm is implemented on the Fujitsu VPP 500. The Fujitsu VPP 500 is a parallel processor with a moderate number of very fast vector processors connected by a crossbar switch. Each processor has a peak performance of 1.6 Gflop/s and can simultaneously read and write 400 MByte/s. Very long vector length stride one implementations of multiple FFTs on one node, as described by the author in 1994, are combined with optimized transpositions. One third of peak performance was achieved on a configuration with up to 32 processors.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve