TL;DR: This paper argues for an alternative way of designing coordination models for parallel and distributed environments based on a complete symmetry between and decoupling of producers and consumers, as well as a clear distinction between the computational and the coordination/communication work performed by each process.
Abstract: This paper argues for an alternative way of designing coordination models for parallel and distributed environments based on a complete symmetry between and decoupling of producers and consumers, as well as a clear distinction between the computational and the coordination/communication work performed by each process. The novel idea is to allow both producer and consumer processes to communicate with each other in a fashion that does not dictate any one of them to have specific knowledge about the rest of the processes involved in a coordinated activity. Furthermore, the model is inherently control-driven where communicating processes observe state changes and react to the presence of events and where the main communication mechanism is limited broadcasting (as opposed to either point-to-point or unrestricted broadcasting communication). Although a direct realisation of this model in terms of a concrete coordination language does already exist, we argue that the underlying principles can be applied to other similar models. We demonstrate our point by comparing our model with an established and widely used coordination framework, namely the Linda-type Shared Dataspace model, and we show how the functionality of the former can be embedded into the latter, thus yielding an alternative Linda-based coordination framework.
TL;DR: This paper describes the static and dynamic task allocation tools in PVM environment for distributed memory parallel systems and augmented simulated annealing and heuristic move exchange methods in distributed form are implemented.
Abstract: This paper describes the static and dynamic task allocation tools in PVM environment for distributed memory parallel systems. For the static mapping the objective function is used to evaluate the optimality of the allocation of a task graph onto a processor graph. Together with our optimization method also augmented simulated annealing and heuristic move exchange methods in distributed form are implemented. For dynamic task allocation the semidistributed approach was designed based on the division of processor network topology into independent and symmetric spheres. Distributed static mapping (DSM) and dynamic load balancing (DLB) tools are controlled by user window interface. DSM and DLB tools are integrated together with software monitor (PG-PVM) in the graphical GRAPNEL environment.
TL;DR: The formal framework proposed here combines a formal operational model of abstract programs, a set of decision methods for the analysis of RP schemes, a formal Operational model for the interpreted programs, and translation results stating how some behavioural properties of the concrete programs can be correctly checked on the corresponding scheme.
Abstract: RP programs are imperative programs with parallelism and recursion and only a limited way of synchronizing parallel processes. The formal framework we propose here combines (1) a formal operational model of abstract programs (or RP schemes), (2) a set of decision methods for the analysis of RP schemes, (3) a formal operational model for the interpreted programs, and (4) translation results stating how some behavioural properties of the concrete programs can be correctly checked on the corresponding scheme.
TL;DR: The ILP model for the MIPS R8000 is extended by including memory optimization and the entire model is presented in detail, aiming to produce optimal schedules.
Abstract: In parallelizing the code for high-performance processors, software pipelining of innermost loops is of fundamental importance. In order to benefit from software pipelining, two separate tasks need to be performed: (i) software pipelining proper (find the rate-optimal legal schedule), and (ii) register allocation (allocate registers to the found schedule). Software pipelining and register allocation can be formulated as an integer linear programming (ILP) problem, aiming to produce optimal schedules. In this paper, we discuss the application of the integer linear programming to software pipelining on the MIPS R8000 superscalar microprocessor. Some of the results were presented in the PLDI96 [14], where they were compared to the MIPSpro software pipeliner. In this paper we further extend the ILP model for the MIPS R8000 by including memory optimization and present the entire model in detail.
TL;DR: A software pipeline construction scheme for DO-loops, while-loop, and loops with multiple exits, which unifies, simplifies, and generalizes, the separate techniques previously required to build a complete software pipeline from a local schedule computed by modulo scheduling is presented.
Abstract: We present a software pipeline construction scheme for DO-loops, while-loops, and loops with multiple exits, which unifies, simplifies, and generalizes, the separate techniques previously required to build a complete software pipeline from a local schedule computed by modulo scheduling. In the setting of this software pipeline construction scheme, we demonstrate a simple way of implementing a general form of modulo expansion. Then we introduce inductive relaxation, a technique that replaces generalized modulo expansion when the variable to expand is a simple induction. These techniques do not require any architectural support from the target processor, and have been extensively tested as part of the software pipeliner that comes with the 3.0 compiler releases for the Cray T3ETM massively parallel computer.
TL;DR: This paper presents a formal basis for validating large distributed systems described by composition of coloured Petri-Net = modules and proposes compositional proof technique for such systems.
Abstract: In this paper we present a formal basis for validating large distributed systems. Distributed systems are described by composition of coloured Petri-Net = modules. Compositional proof technique for such systems, where properties are = specified in terms of a linear time temporal logic, is proposed.
TL;DR: A general background is presented for developping parallel applications in the domain of Computational Fluid Dynamics based upon block Jacobi preconditioned iterative methods for solving partial differential equations and the simulation of an unsteady compressible flow is discussed.
Abstract: In the present paper, a general background is presented for developping parallel applications in the domain of Computational Fluid Dynamics. This frame of work is based upon block Jacobi preconditioned iterative methods for solving partial differential equations. It is shown how the parallelism potential of such a preconditioning can be efficiently exploited by associating it with Finite Element discretization and Object Oriented Programming. The resulting parallel applications are characterized by coarse granularity, ease of maintaining good load balance and the possibility of using the same object in both a serial or a parallel computing context. As an application of our parallel approach, the simulation of an unsteady compressible flow is discussed.
TL;DR: It is shown that HPF can easily express the natural parallelism of the application and is presented as a realistic, but non adaptive irregular application.
Abstract: Data-parallel languages, in particular HPF, provide a high-level view of operators overs parallel data structures and hide the details of data partitioning and communication. One of the most difficult issues in compiling such languages is managing irregular data-dependent parallelism. This paper presents the study of a realistic, but non adaptive irregular application. We show that HPF can easily express the natural parallelism of the application. Experimental results and a detailed examination of the compiler process are presented.
TL;DR: The paper describes different methods of performance optimization of serial and parallel algorithms for modern superscalar RISC processor based computers and the comparative performance analysis of different computers and their architectural peculiarities.
Abstract: The paper describes different methods of performance optimization of serial and parallel algorithms for modern superscalar RISC processor based computers. The limitations imposed on the performance by hierarchical organization of computer memories are discussed, followed by the comparative performance analysis of different computers and their architectural peculiarities. Finally the parallelization aspects of the solution of 3-dimensional CFD problems are considered, along with the comparison of communication characteristics of parallel computers.
TL;DR: A 3D architecture is presented based on the principles parallel access window, shifting and pipelining which can be used in designing special designed coprocessors for cellular processing.
Abstract: Cellular Processing, especially in the 3D realtime case, needs high computing performance With a special designed coprocessor the requirements can be fulfilled at relatively low cost First the architectural principles are described which can be used in designing such coprocessors Second a 3D architecture is presented based on the principles parallel access window, shifting and pipelining The implementation uses two Field Programmable Logic Arrays thereby performing 66 million of 3D celloperations per second
TL;DR: This paper describes a more efficient alternative, namely Improved ParIMGS (IParIMGS) which avoids the global communication of inner products and only requires local communications, therefore, the cost of communication can be significantly reduced.
Abstract: In this paper we study the parallel aspects of IMGS, Incomplete Modified Gram-Schmidt preconditioner which can be used for efficiently solving sparse and large linear systems and least squares problems on massively parallel distributed memory computers. The performance of this preconditioning technique on this kind of architecture is always limited because of the global communication required for the inner products, even for ParIMGS, a parallel version of IMGS where we create some possibilities such that the computation can be overlapped with the communication. We will describe a more efficient alternative, namely Improved ParIMGS (IParIMGS) which avoids the global communication of inner products and only requires local communications. Therefore, the cost of communication can be significantly reduced. Several numerical experiments carried out on Parsytec GC/PowerPlus are presented as well.
TL;DR: This work examines the present trends toward standardisation of communication for parallel machines and networks of workstations, and discusses some software and hardware features to improve performance compared to the usual PVM-Unix-TCP/IP-Ethernet stack of protocols.
Abstract: Standardisation, which is the rule for PCs and workstation, is quickly expending for parallel machines and networks of workstations. The use of commodities is the key issue to reduce the cost /performance ratio: standard microprocessors, OS, libraries... are used. Standardisation of communications is more difficult if both high performance and protection in multi-user context are wanted. We examine the present trends toward standardisation of communication for parallel machines and networks of workstations. We discuss some software and hardware features to improve performance compared to the usual PVM-Unix-TCP/IP-Ethernet stack of protocols.
TL;DR: It is shown that, because it generates parallel and asynchronous execution of a large number of small tasks, the task migration paradigm allows a direct exploitation of these irregularly structured problems on distributed memory architectures.
Abstract: The most successful compilation techniques for distributed memory architectures are based on static analysis of the memory accesses. Loop iterations with similar comportment on the parallel memories are combined in order to form coarse grain parallel tasks. But for irregularly structured applications, the behavior of each iteration of a parallel loop on the memories is data dependent and cannot be predicted at compile-time and the only exploitable parallelism is fine-grain. We show that, because it generates parallel and asynchronous execution of a large number of small tasks, the task migration paradigm allows a direct exploitation of these irregularly structured problems on distributed memory architectures.
TL;DR: Logarithmic time lower bounds for computing the distance between two arbitrary vertices, in a proper interval graph represented by a family of intervals on a real line, and in a bipartite permutation graph representing by a permutation function, on exclusive write PRAM are proved here.
Abstract: Logarithmic time lower bounds for computing the distance between two arbitrary vertices, in a proper interval graph represented by a family of intervals on a real line, and in a bipartite permutation graph represented by a permutation function, on exclusive write PRAM are proved here. The lower bounds are also valid for these classes of graphs represented by adjacency matrices and for their superclasses. Shortest paths on interval and permutation graphs, which, respectively, strictly contain proper interval and bipartite permutation graphs, are known to be computable in logarithmic time on exclusive write PRAM. It follows that the lower bounds derived here are tight.
TL;DR: This study introduces a new file structure called Spatial Grid File, which enables us to index data objects by different and independent high-dimensional attributes and is very suitable for parallelization.
Abstract: In multimedia databases spatial or high-dimensional data manipulation is important for storage and retrieval. In this study, we introduce a new file structure called Spatial Grid File. This file enables us to index data objects by different and independent high-dimensional attributes. And, with it, well-known spatial query types, such as range queries, nearest neighbor queries and spatial join operations can be efficiently performed. Although the performance of the Spatial-Grid file structure is based on the indexing methods used, it has a unique feature of combining set of spatial data each having different properties. Furthermore, this file structure is very suitable for parallelization.
TL;DR: The notion of cellular program, which represents the inversion of a certain combination of bits on a cellular plane, is introduced and the independence of a computational result of the solitary defects of cells is investigated.
Abstract: Parallel computations on the lower level in the negabinaries coding system are considered. All arithmetical operations are realized by means of five simple rules. Every rule represents the inversion of a certain combination of bits on a cellular plane. The notion of cellular program is introduced. The independence of a computational result of the solitary defects of cells is investigated. Examples illustrate this approarch. The questions concerning the property of rule systems are dis ussed.
TL;DR: This tutorial contains a brief review of the theoretical background, the restricted ALT-language and simple examples, and may be used in teaching Parallel Computing.
Abstract: ALT (Animating Language Tools is a computer tool for designing and simulating computational processes in cellular arrays. It combines interfaces for visual and textual representation of fine-grained parallel algorithms. A special high-level language is developed whose statements are graphically given arrays and subarrays. The simulating process allows to observe computation dynamics at different levels of detail length: at program blocks, at statements to be executed in parallel over the given array, in quasiparallel mode at cell-state changes. The tutorial contains a brief review of the theoretical background, the restricted ALT-language and simple examples. ALT is running on PC under MS DOS and may be used in teaching Parallel Computing.
TL;DR: The group-theoretic approach to the decomposition of the basic operations of the digital signal processing (DSP) such as discrete Fourier transform and convolution is proposed and the description of a vector DFT algorithm is adduced.
Abstract: The group-theoretic approach to the decomposition of the basic operations of the digital signal processing (DSP) such as discrete Fourier transform (DFT) and convolution is proposed. The distinctive feature of the approach is its primordial orientation to parallel processing. The recurrent description of the decomposition process producing fast parallel algorithms are effective both for parallel and sequential processing. The main properties of these algorithms are formulated and the description of a vector DFT algorithm is adduced.
TL;DR: An improved version of the unsymmetric Lanczos process combining elements of numerical stability and parallel algorithm design is proposed, derived such that all inner products and matrix-vector multiplications of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time.
Abstract: For the eigenvalues of a large and sparse unsymmetric coefficient matrix, we have proposed an improved version of the unsymmetric Lanczos process combining elements of numerical stability and parallel algorithm design. The algorithm is derived such that all inner products and matrix-vector multiplications of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time. Therefore, the cost of global communication on parallel distributed memory computers can be significantly reduced. The resulting algorithm maintains the favorable properties of the Lanczos process while not increasing computational costs. In this paper, we describe an efficient implementation of this method which is particularly well suited to problems with irregular sparsity pattern. The corresponding communication cost is independent of the sparsity pattern with several performance improvement techniques such as overlapping computation and communication, balancing the computational load. The performance is demonstrated by numerical experimental results carried out on massively parallel distributed memory computer Parsytec GC/PowerPlus.
TL;DR: It is studied how self-connection weight values influence the main characteristic of CNAM, namely the strong stability to k-distortions of stored prototypes, to provide a maximal strong stability for each prototype.
Abstract: Cellular-Neural Associative Memory (memory by Hophield with local connection structure) with weight matrix designed by anyone of the existing methods ensuring individual stability of network is concidered. It is studied how self-connection weight values influence the main characteristic of CNAM, namely the strong stability to k-distortions of stored prototypes. Expression for determining the self-connection weight values is obtained, such that provides a maximal strong stability for each prototype. Two strategies are proposed to determine the most acceptable value according to the requiered accuracy. The obtained results are valid not only for CNAM but also for full-connected Hopfield associative memory designed with the help of any learning method.
TL;DR: Two methods which allow to solve non-symmetric sparsed systems of linear equations (SLEs) — Biconjugative Gradients (BiCG) and General Minimal Residuals (GMRES) methods — are described.
Abstract: Two methods which allow to solve non-symmetric sparsed systems of linear equations (SLEs) — Biconjugative Gradients (BiCG) and General Minimal Residuals (GMRES) methods — are described in this paper. Analysis of requirements to memory and computational speed is given; some results of application to finite- differences and finiteelements SLEs are shown. Some features of these methods connected with ABS-class presented by Abaffi and Spedicato are also discussed.
TL;DR: T-model is such an algorithm model of the task which reflects structure, description and volume of information flows which permits to evaluate of the quality of parallel execution, to find bottlenecks and deadlocks and to correct the discovered weak points.
Abstract: The conception of creation and debugging of parallel task model is discussed. T-model is such an algorithm model of the task which reflects structure, description and volume of information flows. Creation of T-model permits to evaluate of the quality of parallel execution, to find bottlenecks and deadlocks and to correct the discovered weak points, changing the volume and exchanges discipline as well as carrying out the decomposition again. T-model description consists hardware description and description of parallel processes.
TL;DR: The base module of a multiprocessor system with structuralprocedural computation is considered and this module provides the system performance close to peak.
Abstract: The base module of a multiprocessor system with structuralprocedural computation is considered. This module provides the system performance close to peak.
TL;DR: A visual system is reported on, which uses a three dimensional model to describe the data dependencies in the computational domain and to visualise transformations of this domain relevant to the main steps in parallel program design.
Abstract: Parallel program design and analysis is a complex activity, where many difficulties stem from the principle inadequacy of pure textual formalism to specify parallelism in an understandable fashion. In this paper we report on a visual system, which uses a three dimensional model to describe the data dependencies in the computational domain and to visualise transformations of this domain relevant to the main steps in parallel program design. We believe three dimensional interactive graphics provides an extra degree of freedom for conveying information. Using 3D graphics, we can visualise all the data dependencies evolving in time as a three dimensional graph, displaying it on a screen. Visual abstraction, animation of transformations and visualisation of their effects will shift information to the perceptual level to help assimilate mathematical notions.
TL;DR: Experiments indicate the performance of multithreading, with a small number of threads per processor, is very comparable to that of programs written using message-passing and has an added advantage over message-Passing in that it is relatively insensitive to initial data distribution.
Abstract: Recent trend in high-performance computing focuses on networks of workstations (NOWs) as a way of realizing cost-effective parallel machines. This has been due to the availability of powerful wide-issue processors, high-speed networks, and software infrastructure systems. Due to its distributed nature, message-passing has been the choice of communication model for NOWs. This paper, however, examines the viability of using multithreading on NOWs. A matrix multiplication algorithm was studied by simulating a shared-memory abstraction on top of Parallel Virtual Machine (PVM) to characterize the behavior of multi-threading. Our experiments indicate the performance of multithreading, with a small number of threads per processor, is very comparable to that of programs written using message-passing. Our studies also show multithreading has an added advantage over message-passing in that it is relatively insensitive to initial data distribution.
TL;DR: It is observed that for the shared-nothing parallel TP system, this negative effect of 2PL can be alleviated significantly by scheduling transactions judiciously and a new transaction scheduling algorithm called FCFSP (FCFS with Priority) is proposed thereby.
Abstract: Shared-nothing parallel transaction processing (TP) systems have great potential to serve the ever-increasing demands for high transaction processing rate. This potential, however, may not be reached due to the negative effect of the widely used two-phase locking (2PL) concurrency control method. We observed that for the shared-nothing parallel TP system, this negative effect of 2PL can be alleviated significantly by scheduling transactions judiciously. In this paper, a new transaction scheduling algorithm called FCFSP (FCFS with Priority) is proposed thereby. In order to study the performance of transaction scheduling algorithms, a comprehensive simulator for shared-nothing parallel TP systems is developed. Using the developed simulator, the performance of FCFSP is compared with that of the conventional FCFS and the previously proposed SCST (Synchronizing Completion of SubTransactions) transaction scheduling algorithms. Simulation results demonstrate the effectiveness of FCFSP. Simulation results also show that FCFSP outperforms FCFS greatly, and overcomes the drawback of SCST.
TL;DR: A multithreaded vector co-processor design is described, intended to be placed with its private vector memory, on an expansion board, linked to the scalar processor and its cache-based memory hierarchy.
Abstract: A multithreaded vector co-processor design is described. It is intended to be placed with its private vector memory, on an expansion board, linked to the scalar processor and its cache-based memory hierarchy. The vector co-processor can run up to 8 vector tasks (threads) in parallel. Vector registers can be accessed either as independent sets of scalar values or as array sets. The Tomasulo's algorithm, simplified to keep the issue and termination logics simple in a multithreaded context, dynamically schedules the dependent instructions. A locking feature is provided to handle both the reductions and the complex recurrences in a vector form.