TL;DR: A new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models is introduced and is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization.
Abstract: Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such architectural optimizations can cause memory accesses to be executed in an order different from what the programmer expects. The set of allowable memory access orderings forms the memory consistency model or event ordering model for an architecture.This paper introduces a new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models. A framework for classifying shared accesses and reasoning about event ordering is developed. The release consistency model is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization. Possible performance gains from the less strict constraints of the release consistency model are explored. Finally, practical implementation issues are discussed, concentrating on issues relevant to scalable architectures.
TL;DR: This paper focuses on the design and use of Munin's memory coherence mechanisms, and compares the approach to previous work in this area.
Abstract: We are developing Munin, a system that allows programs written for shared memory multiprocessors to be executed efficiently on distributed memory machines. Munin attempts to overcome the architectural limitations of shared memory machines, while maintaining their advantages in terms of ease of programming. Our system is unique in its use of loosely coherent memory, based on the partial order specified by a shared memory parallel program, and in its use of type-specific memory coherence. Instead of a single memory coherence mechanism for all shared data objects, Munin employs several different mechanisms, each appropriate for a different class of shared data object. These type-specific mechanisms are part of a runtime system that accepts hints from the user or the compiler to determine the coherence mechanism to be used for each object. This paper focuses on the design and use of Munin's memory coherence mechanisms, and compares our approach to previous work in this area.
TL;DR: A priority-based synchronization protocol that explicitly uses shared-memory primitives is defined and analyzed, and the underlying priority consideration for a shared memory synchronization protocol are studied and priority assignments to be used by the protocol are derived.
Abstract: A priority-based synchronization protocol that explicitly uses shared-memory primitives is defined and analyzed. A solution that has been proposed for bounding and minimizing synchronization delays in real-time systems is briefly reviewed. The waiting times introduced by synchronization requirements in multiple-processor environments are identified, and a set of goals for priority-based multiprocessor synchronization protocols is derived. The underlying priority consideration for a shared memory synchronization protocol are studied and priority assignments to be used by the protocol are derived. >
TL;DR: It is shown that the correct choice of algorithm is determined largely by the memory access behavior of the applications, and some limitations of distributed shared memory are noted.
Abstract: Four basic algorithms for implementing distributed shared memory are compared. Conceptually, these algorithms extend local virtual address spaces to span multiple hosts connected by a local area network, and some of them can easily be integrated with the hosts' virtual memory systems. The merits of distributed shared memory and the assumptions made with respect to the environment in which the shared memory algorithms are executed are described. The algorithms are then described, and a comparative analysis of their performance in relation to application-level access behavior is presented. It is shown that the correct choice of algorithm is determined largely by the memory access behavior of the applications. Two particularly interesting extensions of the basic algorithms are described, and some limitations of distributed shared memory are noted. >
TL;DR: This work examines the effectiveness of optimizations aimed to allowing distributed machine to efficiently compute inner loops over globally defined data structures by targeting loops in which some array references are made through a level of indirection.
TL;DR: A new programming environment for distributed memory architectures is presented, providing a global name space and allowing direct access to remote parts of data values and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes is presented.
Abstract: Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece “owned” by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. This paper presents a new programming environment for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. We describe the analysis and program transformations required to implement this environment, and present the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes.
TL;DR: An optical volume memory based on the two-photon effect which allows for high density and parallel access and has the advantages of having high capacity and throughput which may overcome the disadvantages of current memories.
Abstract: The advent of optoelectronic computers and highly parallel electronic processors has brought about a need for storage systems with enormous memory capacity and memory bandwidth. These demands cannot be met with current memory technologies (i.e., semiconductor, magnetic, or optical disk) without having the memory system completely dominate the processors in terms of the overall cost, power consumption, volume, and weight. As a solution, we propose an optical volume memory based on the two-photon effect which allows for high density and parallel access. In addition, the two-photon 3-D memory system has the advantages of having high capacity and throughput which may overcome the disadvantages of current memories.
TL;DR: In this paper, a method and system for independently resetting primary and secondary processors 20 and 120 respectively under program control in a multiprocessor, cache memory system is presented.
Abstract: A method and system for independently resetting primary and secondary processors 20 and 120 respectively under program control in a multiprocessor, cache memory system. Processors 20 and 120 are reset without causing cache memory controllers 24 and 124 to reset.
TL;DR: A user-transparent checkpointing recovery scheme and a new twin-page disk storage management technique are presented for implementing recoverable distributed shared virtual memory.
Abstract: The problem of rollback recovery in distributed shared virtual environments, in which the shared memory is implemented in software in a loosely coupled distributed multicomputer system, is examined. A user-transparent checkpointing recovery scheme and a new twin-page disk storage management technique are presented for implementing recoverable distributed shared virtual memory. The checkpointing scheme can be integrated with the memory coherence protocol for managing the shared virtual memory. The twin-page disk design allows checkpointing to proceed in an incremental fashion without an explicit undo at the time of recovery. The recoverable distributed shared virtual memory allows the system to restart computation from a checkpoint without a global restart. >
TL;DR: A discussion is presented of two ways of mapping the cells in a two-dimensional area of a chip onto processors in an n-dimensional hypercube such that both small and large cell moves can be applied.
Abstract: A discussion is presented of two ways of mapping the cells in a two-dimensional area of a chip onto processors in an n-dimensional hypercube such that both small and large cell moves can be applied. Two types of move are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support such a parallel cost evaluation. A novel tree broadcasting strategy is presented for the hypercube that is used extensively in the algorithm for updating cell locations in the parallel environment. A dynamic parallel annealing schedule is proposed that estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control. The performance on an Intel iPSC-2/D4/MX hypercube is reported. >
TL;DR: In this paper, the authors propose a message passing mechanism for a plurality of processors interconnected by a shared intelligent memory for secure passing of messages between tasks operated on said processors, where each processor includes serving means for getting the messages to the task operated by each processor.
Abstract: In the environment of a plurality of processors interconnected by a shared intelligent memory, a mechanism for the secure passing of messages between tasks operated on said processors is provided. Inter-task message passing is provided by shared intelligent memory for storing the messages transmitted by sending tasks. Further, each processor includes serving means for getting the messages to be sent to the task operated by said each processor. The passing of messages from a processor to the shared intelligent memory and from the latter to another processor is made, using a set of high-level microcoded commands. A process is provided using the message passing mechanism together with redundancies built into the shared memory, to ensure fault-tolerant message passing in which the tasks operated primarily on a processor are automatically replaced by back-up tasks executed on another processor if the first processor fails.
TL;DR: This paper examines lhe design of a highly efficient, reliable, machine-independent prolOColused by the remote memory server to communicate with the client machines, and outlines the algorilhms and data structures employed by theRemote Memory Model to efficiently locate the data stored on lhe server.
Abstract: This paper describes a new model for constructing distributed systems called lhe Remote Memory Model. The remote memory model consisls of several client machines, one or morc dedicated machines called remote memory servers, and a communication channel interconnecting lhem. In the remote memory model, client machines share lhe memory resources located on the remote memory server. Client machines that exhaust lheir local memory move portions of lheir address space to the remote memory server and retrieve pieces as needed. Because lhe remote memory server uses a machineindependent prolOCOl to communicate wilh client machines, lhe remote memory server can support multiple heterogeneous client machines simultaneously. This paper describes lhe remote memory model and discusses lhe advantages and issues of systems that use this model. It examines lhe design of a highly efficient, reliable, machine-independent prolOColused by the remote memory server to communicate with the client machines. It also outlines the algorilhms and data structures employed by the remote memory server to efficiently locate the data stored on lhe server. Finally, it presenls measuremenls of a prototype implementation that clearly demonstrate the viability and competitive performance of the remote memory model.
TL;DR: In this paper, the active and backup processors are coupled asynchronously with some hardware assist functions comprising a memory change detector which captures memory changes in the memory of the active processor and a mirroring control circuit which causes the memory changes when committed by establish recovery point signals generated by the active processors.
Abstract: A checkpointing mechanism implemented in a data processing system comprising a dual processor configuration gives the system a fault tolerance capability while minimizing the complexity of both the software and the hardware. The active and backup processors are coupled asynchronously with some hardware assist functions comprising a memory change detector which captures the memory changes in the memory of the active processor and a mirroring control circuit which causes the memory changes when committed by establish recovery point signals generated by the active processor to be dumped into the memory of the back up processor so that the backup processor can resume the operations of the active processor from the last established recovery point. The active and backup processors may each be connected to a dedicated memory and recovery point storing means, or to a memory including two dual sides shared by all the processors for storing data structures and recovery points.
TL;DR: A new model of asynchronous shared memory parallel computation is introduced, and it is shown that this model fulfils all the listed requirements and also analyzes in this model the complexity of several fundamental parallel algorithms.
Abstract: The contributions of this paper are twofold. First, we outline criteria by which any model of asynchronous shared memory parallel computation can be judged. Previous models are considered with respect to these factors. Next, we introduce a new model, and show that this model fulfils all the listed requirements. We also analyze in our model the complexity of several fundamental parallel algorithms.
TL;DR: The goal of the Pandore system is to allow the execution of parallel algorithms on DMPCs (Distributed Memory Parallel Computers) without having to take into account the low-level characteristics of the target distributed computer to program the algorithm.
Abstract: The goal of the Pandore system is to allow the execution of parallel algorithms on DMPCs (Distributed Memory Parallel Computers) without having to take into account the low-level characteristics of the target distributed computer to program the algorithm. No explicit process definition and interprocess communications are needed. Parallelization is achieved through logical data organization. The Pandore system provides the user with a mean to specify data partitioning and data distribution over a domain of virtual processors for each parallel step of his algorithm.At compile time, Pandore splits the original program into parallel processes. Each process will execute some appropriate parts of the original code, according to the given data decomposition. In order to achieve a correct utilization of the data structures distributed over the processors, the Pandore system provides an execution scheme based on a communication layer, which is an abstraction of a message-passing architecture. This intermediate level is them implemented using the effective primitives of the real architecture (in our specific case, an Intel iPSC/2).
TL;DR: The paper presents a new programming environment, Kali, which provides a global name space and allows direct access to remote data values and a system of annotations, allowing the user to control those aspects of the program critical to performance, such as data distribution and load balancing.
Abstract: Programming nonshared memory systems is more difficult than programming shared memory systems, in part because of the relatively low level of current programming environments for such machines. A new programming environment is presented, Kali, which provides a global name space and allows direct access to remote data values. In order to retain efficiency, Kali provides a system on annotations, allowing the user to control those aspects of the program critical to performance, such as data distribution and load balancing. The primitives and constructs provided by the language is described, and some of the issues raised in translating a Kali program for execution on distributed memory systems are also discussed.
TL;DR: ASAR (Automatic and Symbolic PARallelization) is described which consists of a source-to-source parallelizer and a set of interactive graphic tools and is designed for easy modification for other languages such as Fortran.
Abstract: This paper describes ASPAR (Automatic and Symbolic PARallelization) which consists of a source-to-source parallelizer and a set of interactive graphic tools. While the issues of data dependency have already been explored and used in many parallel computer systems such as vector and shared memory machines, distributed memory parallel computers require, in addition, explicit data decomposition. New symbolic analysis and data-dependency analysis methods are used to determine an explicit data decomposition scheme. Automatic parallelization models using high level communications are also described in this paper. The target applications are of the “regular-mesh" type typical of many scientific calculations.
The system has been implemented for the language C, and is designed for easy modification for other languages such as Fortran.
TL;DR: In this article, a multiprocessor system linked by a fiber optic ring network uses some of the bandwidth of the ring network as a shared memory resource, which can carry message packets from one processor to another or network memory packets which circulate indefinitely on the network.
Abstract: A multiprocessor system linked by a fiber optic ring network uses some of the bandwidth of the ring network as a shared memory resource. Data slots are defined on the network which can carry message packets from one processor to another or network memory packets which circulate indefinitely on the network. One use of these network memory packets is as a lock management system for controlling concurrent access to a shared database by the multiple processors. The network memory packets are treated as lock entities. A processor indicates that it wants to procure a lock entity by circulating a packet, having a first network memory type, around the network. If no conflicting packets are detected when the circulated packet returns, the type of the slot is changed to a second network memory type indicating a procured lock entity.
TL;DR: A probabilistic protocol is presented that solves this Processor Identiy Problem for asynchronous processors that communicate through a common shared memory and simplifies shared memory processor design by eliminating the need to encode processor identifiers in system hardware or software structures.
TL;DR: A general-purpose multiprocessor solution for ray tracing which may be used to reduce execution time without restricting development of the ray tracing code is described.
Abstract: The ray tracing algorithm continues to attract much research and development to improve the quality of the images that are generated, and to reduce the time taken to produce them. By identifying the key requirements of a development system from the user's point of view, we describe a general-purpose multiprocessor solution for ray tracing which may be used to reduce execution time without restricting development of the ray tracing code. The solution is based upon a distributed memory multiprocessor system in which each processor addresses a small amount of memory relative to the size of the model database. Methods for exploiting the coherence of references to entries in the database are described which use a combination of dynamic and static caching techniques. This scheme allows databases of arbitrary size to be supported on multiprocessors with limited distributed memory.
TL;DR: It is shown that among the different classical processors networks topologies (ring, 2d-torus or n-cube), the hypercube topology minimizes the communications.
Abstract: This paper introduces the parallelization on a distributed memory multicomputer of two iterative methods for finding all the roots of a given polynomial. The parallel algorithms share the computation of the roots among the processors and perform a total exchange of the data at each step. Since the amount of communications is the main drawback of this approach, we study the effect of the network topology on the performance of the algorithms. Particularly, we show that among the different classical processors networks topologies (ring, 2d-torus or n-cube), the hypercube topology minimizes the communications. For each topology is computed the optimal number of processors. Experiments on the hypercube FPS T40 illustrate the results.
TL;DR: This paper presents three case studies of Gaussian elimination in vector multiprocessor computing, a model system for Gaussian elimation, and methodologies for systolic arrays for dependence mapping method, complexity results, folding.
Abstract: Introduction: background - Gaussian elimination, speedup and efficiency vector and parallel architectures: pipeline computers vector computers parallel computers three case studies. Part 1 Parallel algorithm design - vector multiprocessor computing - vectorization of vector-vectr operations, Gaussian elimination in terms of vector-vector kernels, vector register re-use, Gaussian elimination interms of matrix-vector kernels, cache re-use, Gaussian elimination in terms of matrix-matrix kernels, vectorization epilogue, fine-grain parallelism, parallel Gaussian elimination hypercube computing - topological properties of hypercubes, broadcasting, centralized Gaussian elimination, local pipelined algorithms, a word on speedup evaluation, matrices over finite fields systolic computing - 2D arrays, solving the triangular system on the fly, 1D arrays, matrices over finite fields. Part 2 Models and tools: task graph scheduling - task system for Gaussian elimation, bounds for parallel execution, an optimal schedule, with an arbitrary number of processors analysis of distributed algorithms - data allocation strategies, speedup evaluation on distributed memory machines design methodologies for systolic arrays - dependence mapping method, complexity results, folding.
TL;DR: In this paper, a linear block code error detection scheme is implemented with each shared memory, wherein the effect of random memory faults is sufficiently detected such that the inherent fault tolerance of a pair-spare architecture is not compromised.
Abstract: A highly reliable data processing system using the pair-spare architecture obviates the need for separate memory arrays for each processor. A single memory is shared between each pair of processors wherein a linear block code error detection scheme is implemented with each shared memory, wherein the effect of random memory faults is sufficiently detected such that the inherent fault tolerance of a pair-spare architecture is not compromised.
TL;DR: Presentation of a testing of a 3D parallel implicit reservoir simulator for an Intel iPSC/2 hypercube with 16 vector processors, which demonstrates that up up to 96% of the available CPU time on the hypercube can be used.
Abstract: Presentation of a testing of a 3D parallel implicit reservoir simulator for an Intel iPSC/2 hypercube with 16 vector processors. The simulator is based on an oil/water model. A correlation of computation efficiency with problem size and the number of processors demonstrates that up up to 96% of the available CPU time on the hypercube can be used. Such high efficiencies were achieved by developing special algorithms well suited for multiple processors and distributed memory.
TL;DR: In this article, a data driven method for coordinating the processing of arithmetic tasks in a multiple computer system having a multiplicity of arithmetic processors by determining whether an arithmetic task is in a blocked condition or is in an execution ready condition is presented.
Abstract: A data driven method for coordinating the processing of arithmetic tasks in a multiple computer system having a multiplicity of arithmetic processors by determining whether an arithmetic task is in a blocked condition or is in an execution ready condition. A source distributed processor stores data in a local memory for processing by a local processor and then transfers the processed data to a global memory for buffering in preparation for subsequent processing by a destination distributed processor. The source distributed processor generates a produce message to a destination distributed processor to indicate that the data to be transferred is available in a buffer in the global memory. The destination distributed processor loads the data to be transferred from the buffer in the global memory and then generates a consume message to the source distributed processor to indicate that the data has been transferred from the global memory and the buffer in the global memory is now available.
TL;DR: The authors describe a new concurrent B-tree algorithm designed to work well in large-scale parallel or distributed systems in which the number of processors sharing the tree is large or the communication delay between processors is large relative to the speed of local computation.
Abstract: The authors describe a new concurrent B-tree algorithm. The algorithm is designed to work well in large-scale parallel or distributed systems in which the number of processors sharing the tree is large or the communication delay between processors (or between processors and the global memory for a shared-memory system) is large relative to the speed of local computation. The basis of the algorithm is an abstraction that is similar to coherent shared memory, but provides a weaker semantics; this abstraction is called multiversion memory. Multi-version memory uses caches but weakens the semantics of ordinary shared memory by allowing process reading data to be given an old version of the data. This semantics is adequate for the non-leaf nodes in the B-tree algorithms presented. >
TL;DR: In this article, a digital data processing system including a plurality of processors processes a program in parallel to load process data into a two-dimensional matrix having plurality of matrix entries, and each processor can separately generate process data for different matrix entries from the preliminary data, there is no conflict in accessing of the memory locations among the various processors during of the process data.
Abstract: A digital data processing system including a plurality of processors processes a program in parallel to load process data into a two-dimensional matrix having a plurality of matrix entries. So that the processors will not have to synchronize loading of process data into particular locations in the matrix, the matrix has a third dimension defining a plurality of memory locations, with each series of locations along the third dimension being associated with one of the matrix entries. Each processor initially loads preliminary process data into a memory location along the third dimension. After that has been completed, each processor generates process data for an entry of the two-dimensional matrix from the preliminary process data in the locations along the third dimension related thereto. Since the processors separately load preliminary process data into different memory locations, along the third dimension, there is no conflict with accessing of memory locations among the various processors during generation of preliminary process data. Further, since the processors can separately generate process data for different matrix entries from the preliminary data, there is no conflict in accessing of the memory locations among the various processors during of the process data.
TL;DR: It is demonstrated that S-threads permit a parallelization of SAC-2 down to the lowest algebraic level, and how a key parameter of the S- threads memory design influences parallel performance is shown.
Abstract: We describe the design of PARSAC-2, a parallel version of the SAC-2 Computer Algebra system In PARSAC-2, parallelism is based on multiple threads (lightweight processes) executing on a shared memory multiprocessor The S-threads subsystem provides threads which are capable of parallel list processing on a shared heap The S-threads heap memory is designed to allow concurrent list cell allocation by multiple threads with minimal synchronization overhead S-threads may also perform parallel garbage collection, and a slightly weaker form of storage management called preventive garbage collection We present an example of algorithm development in PARSAC by parallelizing the SAC-2 algorithm IPRODK, an integer multiplication routine based on Karatsuba's method Using empirical data from this experiment, we demonstrate that S-threads permit a parallelization of SAC-2 down to the lowest algebraic level Finally, we show how a key parameter of the S-threads memory design influences parallel performance
TL;DR: A DMH system is presented, the tradeoffs between conservative and aggressive update propagation strategies are defined, and promising new strategies are identified.
Abstract: A distributed memory hierarchy (DMH) is a memory system consisting of storage modules distributed over a high-bandwidth local area network. It provides for transaction applications an abstraction of single virtual memory space to which shared data are mapped. As in a conventional memory hierarchy (MH) in a single-machine system, a DMH is responsible for locating, migrating, and caching data pages; however, unlike a conventional MH, a DMH must do so across the storage modules in a network. In addition, a DMH must handle the problem of propagation of transaction updates preserving serializability of transactions. The performance of a DMH system is strongly influenced by concurrency control and update propagation. It is also crucial that performance analysis accounts for memory resources and network requirements. A DMH system is presented, the tradeoffs between conservative and aggressive update propagation strategies are defined, and promising new strategies are identified. >