TL;DR: In this article, a computer cluster architecture including a plurality of CPUs at each of a plurality-of- nodes is described, where each CPU has the property of coherency and includes a primary cache.
Abstract: A computer cluster architecture including a plurality of CPUs at each of a plurality of nodes. Each CPU has the property of coherency and includes a primary cache. A local bus at each node couples: all the local caches, a local main memory having physical space assignable as-shared space and non-shared space and a local external coherency unit (ECU). An inter-node communication bus couples all the ECUs. Each ECU includes a monitoring section for monitoring the local and inter-node busses and a coherency section for a) responding to a non-shared cache-line request appearing on the local bus by directing the request to the non-shared space of the local memory and b) responding to a shared cache-line request appearing on the local bus by examining its coherence state to further determine if inter-node action is required to service the request and, if such action is required, transmitting a unique identifier and a coherency command to all the other ECUs. Each unit of information present in the shared space of the local memory is assigned, by the local ECU, a coherency state which may be: exclusive (the local copy of the requested information is unique in the cluster); 2) modified (the local copy has been updated by a CPU in the same node); 3) invalid (a local copy either does not exist or is known to be out-of-date); or 4) shared (the local copy is one of a plurality of current copies present in a plurality of nodes).
TL;DR: The Data Diffusion Machine (DDM) as mentioned in this paper is a cache-only memory architecture that relies on a hierarchical network structure, and it can be seen as an extension of the COMA.
Abstract: The Data Diffusion Machine (DDM), a cache-only memory architecture (COMA) that relies on a hierarchical network structure, is described. The key ideas behind DDM are introduced by describing a small machine, which could be a COMA on its own or a subsystem of a larger COMA, and its protocol. A large machine with hundreds of processors is also described. The DDM prototype project is discussed, and simulated performance results are presented. >
TL;DR: MemSpy is described, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs and introduces the notion of data oriented, in addition to code oriented, performance tuning.
Abstract: To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task.This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.
TL;DR: In this paper, the authors propose a memory control unit for controlling access by one or more devices within a processor to a memory array unit external to the processor via ports of the processor.
Abstract: A memory control unit for controlling access, by one or more devices within a processor, to a memory array unit external to the processor via one or more memory ports of the processor. The memory control unit includes a switch network to transfer data between the one or more devices of the processor and the one or more memory ports of the processor. The memory control unit also includes a switch arbitration unit to arbitrate for the switch network, and a port arbitration unit to arbitrate for the one or more memory ports.
TL;DR: In this article, the authors present a unified image and graphics processing system, which includes a parallel vector processing unit, a graphics subsystem, a shared memory and a set of high-speed data buses for connecting all of the other components.
Abstract: The present invention provides a unified image and graphics processing system that provides both image and graphics processing at high speeds. The system includes a parallel vector processing unit, a graphics subsystem, a shared memory and a set of high-speed data buses for connecting all of the other components. Generally, the parallel vector processing unit includes a series of vector processors. Each processor includes a vector address generator for efficient generation of memory addresses for regular address sequences. In order to synchronize and control the vector processors' accesses to shared memory, the parallel vector processing unit includes shared memory access logic. The logic is incorporated into each vector processor. The graphics subsystem includes a series of polygon processors in a pipelined configuration. Each processor is connected in the pipeline by a first-in-first-out (FIFO) buffer for passing data results. Additionally, each polygon processor is connected to a local shared memory in which programm instructions and data are stored. The graphics subsystem also includes a device addressing mechanism for identifying a destination device using a tagged address. The shared memory, the parallel vector processor and the graphics subsystem also incorporate an abbreviated addressing scheme, which reduces the amount of information required to request sequential addresses from the shared memory.
TL;DR: In this paper, the design, implementation, and performance of heterogeneous distributed shared memory (HDSM) systems are studied and a prototype HDSM system that integrates very different types of hosts has been developed, and a number of applications of this system are reported.
Abstract: The design, implementation, and performance of heterogeneous distributed shared memory (HDSM) are studied. A prototype HDSM system that integrates very different types of hosts has been developed, and a number of applications of this system are reported. Experience shows that despite a number of difficulties in data conversion, HDSM is implementable with minimal loss in functional and performance transparency when compared to homogeneous DSM systems. >
TL;DR: This work presents a methodology to easily write efficient, high performance and portable massively parallel programs, based on the definition of a new explicitly parallel programming language, namely P 3 L, and of a set of compiling tools that perform automatic adaptation of the program features to the target architecture hardware.
TL;DR: This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance on MIMD distributed-memory machines; they are analyzed and empirically evaluated for stencil computations.
Abstract: The Fortran D compiler uses data decomposition specifications to automatically translate Fortran programs for execution on MIMD distributed-memory machines. This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance; they are analyzed and empirically evaluated for stencil computations. Profitability formulas are derived for each optimization. Results show that exploiting parallelism for pipelined computations, reductions, and scans is vital. Message vectorization, collective communication, and efficient coarse-grain pipelining also significantly affect performance.
TL;DR: A novel approach, the constraint-based approach, to the problem of automatic data partitioning for numeric programs, implemented as part of a compiler called P scARADIGM, that accepts Fortran 77 programs, and specifies the partitioning scheme to be used for each array in the program.
Abstract: Distributed-memory parallel computers are increasingly being used to provide high levels of performance for scientific applications Unfortunately, such machines are not very easy to program A number of research efforts seek to alleviate this problem by developing compilers that take over the task of generating communication The communication overheads and the extent of parallelism exploited in the resulting target program are determined largely by the manner in which data is partitioned across different processors of the machine Most of the compilers provide no assistance to the programmer in the crucial task of determining a good data partitioning scheme
This thesis presents a novel approach, the constraint-based approach, to the problem of automatic data partitioning for numeric programs In this approach, the compiler identifies some desirable requirements on the distribution of various arrays being referenced in each statement based on performance considerations These desirable requirements are referred to as constraints For each constraint, the compiler determines a quality measure that captures its importance with respect to the performance of the program The quality measure is obtained through static performance estimation, without actually generating the target data-parallel program with explicit communication Each data distribution decision is taken by combining all the relevant constraints The compiler attempts to resolve any conflicts between constraints such that the overall execution time of the parallel program is minimized
This approach has been implemented as part of a compiler called P scARADIGM, that accepts Fortran 77 programs, and specifies the partitioning scheme to be used for each array in the program We have obtained results on some programs taken from the Linpack and Eispack libraries, and the Perfect Benchmarks These results are quite promising, and demonstrate the feasibility of automatic data partitioning for a significant class of scientific application programs with regular computations
TL;DR: Discusses the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers and proposes the square block scattered decomposition as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems.
Abstract: Discusses the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers. The square block scattered decomposition is proposed as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems. An object-oriented interface to the library permits more portable applications to be written, and is easy to learn and use, since details of the parallel implementation are hidden from the user. Experiments on the Intel Touchstone Delta system with a prototype code that uses the square block scattered decomposition to perform LU factorization are presented and analyzed. It was found that the code was both scalable and efficient, performing at about 14 GFLOPS (double precision) for the largest problem considered. >
TL;DR: The cache performance of a commercial System V UNIX rtrttrtittg on a four-CPU multiprocessor is characterized and three major sources of OS misses are revealed: instruction fetehea, process migratiom and data accesses in block operations.
Abstract: Good cache memory performance is essential to achieving high CPU utilization in shared-memory multiprocessors. While the performance of caches is determined by both application end operating system (OS ) references, most research has focused on the cache performance of applications afone. This is partiafly due to the difficulty of measuring OS activity and as a resrtl~ the cache performance of the OS is largely unknown. In this paper, we characterize the cache performance of a commercial System V UNIX rtrttrtittg on a four-CPU multiprocessor. The related issue of the performance impact of the OS synchronization activity is tdso stttdicd. For our study, we use a hardware monitor that records the cache misses in the machine without perturbing it. We study three multiprocessor workloads: a parallel Compilq a multiprogrsmmed load and a commercial database. Our results show that OS misses occur frequently enough to stall CPUS for 17-21 ‘Yoof their non-idle time. Further, if we include application misses induced by OS interference in the cache, then the SQU time reaches 25%. A detailed analysis reveals three major sources of OS misses: instruction fetehea, process migratiom and data accesses in block operations. As for synchronization behavior, we find that OS syncfrrordzation has low overhead if supported correctly end that OS locks show good locality and low contention.
TL;DR: In this paper, a write generate mode is implemented for updating cache by first allocating lines of shared memory as write before read areas and cache tags are updated directly on cache misses without reading from memory.
Abstract: A plurality of program processors, shared memory, dual port memory, external cache memory and a control processor form a multiprocessor system. A shared memory bus links the program processors, shared memory, dual port memory and external cache memory. Program processor I/O occurs through a pair of serial I/O channels coupled to one port of the dual port memory. A write generate mode is implemented for updating cache by first allocating lines of shared memory as write before read areas. For such lines, cache tags are updated directly on cache misses without reading from memory. A hit is forced for such line, resulting in valid data at the updated part and invalid data at the remaining portion. Thus, part of the line is written to and the rest invalidated. The invalid portions are not read, unless preceded by a write operation. The mode reduces the number of bus cycles by making write misses more efficient.
TL;DR: A fundamental relationship between three quantities that characterize an irregular parallel computation is shown: the total available parallelism, the optimal grain size, and the statistical variance of execution times for individual tasks, which yields a dynamic scheduling algorithm that substantially reduces the overhead of executing irregular parallel operations.
Abstract: This paper develops a methodology for compiling and executing irregular parallel programs. Such programs implement parallel operations whose size and work distribution depend on input data. We show a fundamental relationship between three quantities that characterize an irregular parallel computation: the total available parallelism, the optimal grain size, and the statistical variance of execution times for individual tasks. This relationship yields a dynamic scheduling algorithm that substantially reduces the overhead of executing irregular parallel operations.We incorporated this algorithm into an extended Fortran compiler. The compiler accepts as input a subset of Fortran D which includes blocked and cyclic decompositions and perfect alignment; it outputs Fortran 77 augmented with calls to library routines written in C. For irregular parallel operations, the compiled code gathers information about available parallelism and task execution time variance and uses this information to schedule the operation. On distributed memory architectures, the compiler encodes information about data access patterns for the runtime scheduling system so that it can preserve communication locality.We evaluated these compilation techniques using a set of application programs including climate modeling, circuit simulation, and x-ray tomography, that contain irregular parallel operations. The results demonstrate that, for these applications, the dynamic techniques described here achieve near-optimal efficiency on large numbers of processors. In addition, they perform significantly better, on these problems, than any previously proposed static or dynamic scheduling algorithm.
TL;DR: This paper presents the basic features of Vienna Fortran along with a set of examples illustrating the use of these features and presents the advantages of a shared memory programming paradigm while explicitly controlling the placement of data.
Abstract: Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.
TL;DR: A system which automatically determines efficient ways of mapping data onto processors is described and evaluated, which is applicable and effective across a variety of architectures.
Abstract: The performance of a program on a distributed-memory parallel computer depends on the algorithms employed, the structure and speed of the machine's communication network, and the ways in which data are distributed to the processors. This paper addresses the last of these concerns, the problem of data mapping.The paper describes and evaluated a system which automatically determines efficient ways of mapping data onto processors. The system is applicable and effective across a variety of architectures. Simulation results for machine with different interconnection schemes, including linear arrays, two-dimensional meshes, and the hypercubes, and measured running times for the CM-2 show that good data mapping often improves performance by at least 20% and in some cases by more than a factor of two.
TL;DR: This paper describes a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively, based on LAPACK sequential codes and use a panel-wrapped mapping ofMatrices to nodes.
Abstract: In this paper, we describe a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively. The methods are based on LAPACK sequential codes and use a panel-wrapped mapping of matrices to nodes. Results from experiments on the Intel Touchstone Delta are given.
TL;DR: It is found that substantial benefits can be accrued by using relaxed models but the magnitudes of the benefits depend on the architecture being modeled, the benchmarks, and how the code is scheduled.
Abstract: Recent advances in technology are such that the speed of processors is increasing faster than memory latency is decreasing. Therefore the relative cost of a cache miss is becoming more important. However, the full cost of a cache miss need not be paid every time in a multiprocessor. The frequency with which the processor must stall on a cache miss can be reduced by using a relaxed model of memory consistency.In this paper, we present the results of instruction-level simulation studies on the relative performance benefits of using different models of memory consistency. Our vehicle of study is a shared-memory multiprocessor with processors and associated write-back caches connected to global memory modules via an Omega network. The benefits of the relaxed models, and their increasing hardware complexity, are assessed with varying cache size, line size, and number of processors. We find that substantial benefits can be accrued by using relaxed models but the magnitudes of the benefits depend on the architecture being modeled, the benchmarks, and how the code is scheduled. We did not find any major difference in levels of improvement among the various relaxed models.
TL;DR: This paper introduces the notion of a weak snapshot scan, a slightly weaker primitive that has a more efficient implementation, and gives two examples of algorithms whose performance can be enhanced while retaining a simple modular structure: bounded concurrent timestamping, and bounded randomized consensus.
Abstract: . A snapshot scan algorithm takes an "instantaneous" picture of a region of shared memory that may he updated by concurrent processes. Many complex shared memory algorithms can be greatly simplified by structuring them around the snapshot scan abstraction. Unforinnately, the substantial decrease in conceptual complity is quite often counterbalanced by an increase in computational complexity. In this paper, we introduce the notion of a weak snapshot scan, a slightly weaker primitive that has a more efficient implementation. We propose the following methodology for using this abstraction: first, design and verify an algorithm using the more powerful snapshot scan, and second, replace the more powerful but less efficient snapshot with the weaker but more efficient snapshot, and show that the weaker abstraction nevertheless suffices to ensure the correctness of the enclosing algorithm. We give two examples of algorithms whose performance can be enhanced while retaining a simple modular structure: bounded concurrent timestamping, and bounded randomized consensus. The resulting timestamping protocol is the fastest known bounded concurrent timestamping protocol. The resulting randomized consensus protocol matches the computational complexity of the best known protocol that uses only bouned values.
TL;DR: The report discusses the main issues raised in the CRPC workshop, and describes proposed desirable features of a message passing standard for distributed memory environments.
Abstract: This report presents a summary of the main ideas presented at the First CRPC Work-shop on Standards for Message Passing in a Distributed Memory Environment, held April 29-30, 1992, in Williamsburg, Virginia. This workshop attracted 68 attendees including representative from major hardware and software vendors, and was the first in a series of workshops sponsored by the Center for Research on Parallel Computation. The aim of this series of workshops is to develop and implement a standard for message passing on distributed memory concurrent computers, thereby making it easier to develop efficient, portable application codes for such machines. The report discusses the main issues raised in the CRPC workshop, and describes proposed desirable features of a message passing standard for distributed memory environments.
TL;DR: The basic architecture of the shared-memory multiprocessor is described, and computational and I/O performance is measured for both serial and parallel programs.
Abstract: Initial performance results and early experiences are reported for the Kendall Square Research multiprocessor. The basic architecture of the shared-memory multiprocessor is described, and computational and I/O performance is measured for both serial and parallel programs. Experiences in porting various applications are described.
TL;DR: The synchronization topic of MIMD combining trees their motivation, their structure, their parameters and the principles using fetchand-add are developed and the combining window is shown, which shows how to bound node buffer size.
Abstract: Philip Bitar Aquarius Project Computer Science Division University of California Berkeley, CA 94720 bitar@berkeley.edu We develop the synchronization topic of MIMD combining trees their motivation, their structure, their parameters and we illustrate these principles using fetchand-add. We define the concept of combining window. an interval of time during which a request is held in a combining node in order to allow it to combine with subsequent incoming requests. We show that the combining window is necessary in order to realize the dual fonns of concurrency execution and storage concurrency that a combining tree is designed to achieve. Execution concurrency among the nodes of a combining tree enables the tree to achieve the speed up that it is designed to give. Without sufficient execution concurrency, the tree will not achieve the desired speed up. Storage concurrency among the nodes of a combining tree enables the tree to achieve the buffer storage that is necessary in order to implement the combining of requests. Without sufficient storage concurrency, node buffers will overflow. More specifically, the combining window shows how to bound node buffer size.
TL;DR: The paper presents the major features of Vienna Fortran 90 and gives examples of their use and the advantages of the shared memory programming paradigm with mechanisms for explicit user control of those aspects of the program which have the greatest impact on efficiency.
Abstract: Vienna Fortran 90 is a language extension of Fortran 90 which enables the user to write programs for distributed memory multiprocessors using global data references only. Performance of software on such systems is profoundly influenced by the manner in which data is distributed to the processors. Hence, Vienna Fortran 90 provides the user with a wide range of facilities for the mapping of data to processors. It combines the advantages of the shared memory programming paradigm with mechanisms for explicit user control of those aspects of the program which have the greatest impact on efficiency. The paper presents the major features of Vienna Fortran 90 and gives examples of their use. >
TL;DR: This paper presents a mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases and gives a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process.
Abstract: The standard formulation of the conjugate gradient algorithm involves two inner product computations. The results of these two inner products are needed to update the search direction and the computed solution. In a distributed memory parallel environment, the computation and subsequent distribution of these two values requires two separate communication and synchronization phases. In this paper, we present a mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases. We give a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process. We also present empirical evidence of the stability of this modified algorithm.
TL;DR: A hybrid strategy to my tracing parallelitation is applied, using ray-dataflow within an image partition approach, which allows the flexible and efiectiue management of huge dataset on architectures with limited local memory.
Abstract: A parallel solution to the visualiaation of high resolution uolume data is presented. Baaed on the ray tracing (RT) uiaualization technique, the system works on a distributed memory MIMD architecture. A hybrid strategy to my tracing parallelitation is applied, using ray-dataflow within an image partition approach. This strategy allows the flexible and efiectiue management of huge dataset on architectures with limited local memory. The dataaet is distributed over the nodes using a slice-partitioning technique. The simple data partition chosen implies a atraighforward communications pattern of the visualization processes and this improves both software design and eficiency, while providing deadlock prevention. The partitioning technique used and the network interconnection topology allow for the efjicient implementation of a statical load balancing technique through pre-rendering of a low resolution image. Details related to the practical issues involved in the parallelitation of volumetric RT are discussed, with particular reference to deadlock and termination issues.
TL;DR: This paper outlines two methods which it is believed will play an important role in any distributed memory compiler able to handle sparse and unstructured problems and describes a viable mechanism for tracking and reusing copies of off-processor data.
Abstract: Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods.
TL;DR: PDS is a collection of Fortran subroutines for solving unconstrained nonlinear optimization problems using direct search methods usingDirect search methods for execution on shared memory parallel machines.
Abstract: : PDS is a collection of Fortran subroutines for solving unconstrained nonlinear optimization problems using direct search methods. The software is written so that execution on sequential machines is straightforward while execution on Intel distributed memory machines, such as the iPSC/2, the iPSC/860 or the Touchstone Delta, can be accomplished simply by including a few well-defined routines containing calls to Intel-specific Fortran libraries. Those interested in using the methods on other distributed memory machines, even something as basic as a network of workstations or personal computers, need only modify these few subroutines to handle the global communication requirements. Furthermore, since the parallelism is clearly defined at the "doloop" level, it is a simple matter to insert compiler directives that allow for execution on shared memory parallel machines. Included here is an example of such directives, contained in comment statements, for execution on a Sequent Symmetry S81.
TL;DR: In this article, a method of communication between processors used with a multiprocessor system, comprises the steps of: storing information for specifying a processor connected to the shared memory for direct access thereto in a predetermined register of shared memory, feeding a communication instruction for instructing a first processor to communicate with a second processor via shared memory; checking, in response to the communication instruction, whether or not the first and second processors are connected to shared memory to enable direct access.
Abstract: An information processing system comprises: plural processors; a shared memory connected to the plurality of processors for enabling communication between the processors; a unit disposed in the shared memory for storing information for specifying a processor connected thereto; and a unit for checking, when a first processor communicates with a second processor, whether or not the first and second processors are connected to the shared memory for direct access thereto by referring to the information storing means. A method of communication between processors used with a multiprocessor system, comprises the steps of: storing information for specifying a processor connected to the shared memory for direct access thereto in a predetermined register of the shared memory; feeding a communication instruction for instructing a first processor to communicate with a second processor via the shared memory; checking, in response to the communication instruction, whether or not the first and second processors are connected to the shared memory to enable direct access; storing communication information from the first processor in the shared memory, in response to confirmation that the first and second processors are connected to the shared memory; feeding a communication read interruption from the shared memory to the second processor; and reading out, in response to the communication read interruption, the communication information from the shared memory to feed the communication information to the second processor.
TL;DR: Coarse grain message passing and shared memory algorithms for solving the quasi-triangular Sylvester equation are discussed and estimators based on the Frobenius norm and the 1-norm, respectively are presented.
Abstract: Coarse grain message passing and shared memory algorithms for solving the quasi-triangular Sylvester equation are discussed. The basic algorithm is of block type, i.e., rich in matrix-matrix operations. The focus is on computing reliable estimates of the ${\operatorname{sep}}^{ - 1} $ function (a natural condition number for the Sylvester equation and the invariant subspace problem). Estimators based on the Frobenius norm and the 1-norm, respectively, are presented. Accuracy, efficiency, and reliability results are presented. The applicability of the estimators to both the shared memory and distributed memory paradigms are discussed. Some performance results of the parallel block algorithms with condition estimators are also presented. The reliability of both estimators are very good. The Frobenius norm–based estimator is much more efficient in both sequential and parallel settings (on average between four to five times). Further, it is applicable to both the standard and generalized problems.