TL;DR: A new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models is introduced and is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization.
Abstract: Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such architectural optimizations can cause memory accesses to be executed in an order different from what the programmer expects. The set of allowable memory access orderings forms the memory consistency model or event ordering model for an architecture.This paper introduces a new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models. A framework for classifying shared accesses and reasoning about event ordering is developed. The release consistency model is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization. Possible performance gains from the less strict constraints of the release consistency model are explored. Finally, practical implementation issues are discussed, concentrating on issues relevant to scalable architectures.
TL;DR: This work discusses the experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system, which allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory.
Abstract: Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value.
TL;DR: This work evaluates the implementation of Munin and describes the execution of two Munin programs that achieve performance within ten percent of message passing implementations of the same programs.
Abstract: Munin is a distributed shared memory (DSM) system that allows shared memory parallel programs to be executed efficiently on distributed memory multiprocessors. Munin is unique among existing DSM systems in its use of multiple consistency protocols and in its use of release consistency. In Munin, shared program variables are annotated with their expected access pattern, and these annotations are then used by the runtime system to choose a consistency protocol best suited to that access pattern. Release consistency allows Munin to mask network latency and reduce the number of messages required to keep memory consistent. Munin's multiprotocol release consistency is implemented in software using a delayed update queue that buffers and merges pending outgoing writes. A sixteen-processor prototype of Munin is currently operational. We evaluate its implementation and describe the execution of two Munin programs that achieve performance within ten percent of message passing implementations of the same programs. Munin achieves this level of performance with only minor annotations to the shared memory programs.
TL;DR: Four techniques for reducing the amount of communication needed to keep the distributed memories consistent are presented: software release consistency; multiple consistency protocols; write-shared protocols; and an update-with-timeout mechanism.
Abstract: Distributed shared memory (DSM) is an abstraction of shared memory on a distributed-memory machine. Hardware DSM systems support this abstraction at the architecture level; software DSM systems support the abstraction within the runtime system. One of the key problems in building an efficient software DSM system is to reduce the amount of communication needed to keep the distributed memories consistent. In this article we present four techniques for doing so: software release consistency; multiple consistency protocols; write-shared protocols; and an update-with-timeout mechanism. These techniques have been implemented in the Munin DSM system. We compare the performance of seven Munin application programs: first to their performance when implemented using message passing, and then to their performance when running on a conventional software DSM system that does not embody the preceding techniques. On a 16-processor cluster of workstations, Munin's performance is within 5% of message passing for four out of the seven applications. For the other three, performance is within 29 to 33%. Detailed analysis of two of these three applications indicates that the addition of a function-shipping capability would bring their performance to within 7% of the message-passing performance. Compared to a conventional DSM system, Munin achieves performance improvements ranging from a few to several hundred percent, depending on the application.
TL;DR: Compared with existing approaches for heterogeneous computing, this system transparently handles both the hardware differences and the distributed nature of the system and is therefore much easier to program.