Top 526 papers published in the topic of Distributed memory in 1996

Showing papers on "Distributed memory published in 1996"

TreadMarks: shared memory computing on networks of workstations

[...]

Cristiana Amza¹, Alan L. Cox¹, Sandhya Dwarkadas¹, P. Keleher¹, Honghui Lu¹, Ramakrishnan Rajamony¹, Weimin Yu¹, Willy Zwaenepoel¹ - Show less +4 more•Institutions (1)

Rice University¹

01 Feb 1996-IEEE Computer

TL;DR: This work discusses the experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system, which allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory.

...read moreread less

Abstract: Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value.

...read moreread less

951 citations

Book Chapter•10.1007/3-540-62852-5_6•

A Note on Distributed Computing

[...]

Jim Waldo¹, Geoff Wyant¹, Ann M. Wollrath¹, Samuel C. Kendall¹•Institutions (1)

Sun Microsystems¹

8 Jul 1996

TL;DR: In this paper, the authors look at a number of distributed systems that have attempted to paper over the distinction between local and remote objects, and show that such systems fail to support basic requirements of robustness and reliability.

...read moreread less

Abstract: We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure. We look at a number of distributed systems that have attempted to paper over the distinction between local and remote objects, and show that such systems fail to support basic requirements of robustness and reliability. These failures have been masked in the past by the small size of the distributed systems that have been built. In the enterprise-wide distributed systems foreseen in the near future, however, such a masking will be impossible. We conclude by discussing what is required of both systems-level and application-level programmers and designers if one is to take distribution seriously.

...read moreread less

451 citations

Proceedings Article•10.1145/232973.232983•

Memory Bandwidth Limitations of Future Microprocessors

[...]

Doug Burger¹, James R. Goodman¹, Alain Kagi¹•Institutions (1)

University of Wisconsin-Madison¹

1 May 1996

TL;DR: It is predicted that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips, and pin bandwidth limitations will make more complex on-chip caches cost-effective.

...read moreread less

Abstract: This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.

...read moreread less

396 citations

Proceedings Article•10.1145/237090.237179•

Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

[...]

Daniel J. Scales, Kourosh Gharachorloo, Chandramohan A. Thekkath

1 Sep 1996

TL;DR: The primary focus of this paper is to describe the techniques used in Shasta to reduce the checking overhead for supporting fine granularity sharing in software, including careful layout of the shared address space and scheduling the checking code for efficient execution on modern processors.

...read moreread less

Abstract: This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient cache coherence protocol that incorporates a number of optimizations, including support for multiple communication granularities and use of relaxed memory models. This system is fully functional and runs on a cluster of Alpha workstations.The primary focus of this paper is to describe the techniques used in Shasta to reduce the checking overhead for supporting fine granularity sharing in software. These techniques include careful layout of the shared address space, scheduling the checking code for efficient execution on modern processors, using a simple method that checks loads using only the value loaded, reducing the extra cache misses caused by the checking code, and combining the checks for multiple loads and stores. To characterize the effect of these techniques, we present detailed performance results for the SPLASH-2 applications running on an Alpha processor. Without our optimizations, the checking overheads are excessively high, exceeding 100% for several applications. However, our techniques are effective in reducing these overheads to a range of 5% to 35% for almost all of the applications. We also describe our coherence protocol and present some preliminary results on the parallel performance of several applications running on our workstation cluster. Our experience so far indicates that once the cost of checking memory accesses is reduced using our techniques, the Shasta approach is an attractive software solution for supporting a shared address space with fine-grain access to data.

...read moreread less

363 citations

Proceedings Article•10.1145/237090.237144•

Synchronization and communication in the T3E multiprocessor

[...]

Steven L. Scott¹•Institutions (1)

Cray¹

1 Sep 1996

TL;DR: The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers), which provide a rich set of atomic memory operations and a flexible, user-level messaging facility.

...read moreread less

Abstract: This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization.The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/eureka networks that can be arbitrarily embedded into the 3D torus interconnect.

...read moreread less

317 citations

Proceedings Article•10.1145/232973.232984•

Missing the Memory Wall: The Case for Processor/Memory Integration

[...]

Ashley Saulsbury¹, Fong Pong², Andreas Nowatzyk²•Institutions (2)

Swedish Institute of Computer Science¹, Sun Microsystems²

1 May 1996

TL;DR: It is shown that processor memory integration can be used to build competitive, scalable and cost-effective MP systems and results from execution driven uni- and multi-processor simulations show that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor.

...read moreread less

Abstract: Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.

...read moreread less

245 citations

Proceedings Article•10.1109/HICSS.1996.495480•

Parallel and Distributed Simulation

[...]

Alois Ferscha¹•Institutions (1)

University of Vienna¹

3 Jan 1996

TL;DR: The DEVS formalism for system modeling, and a DEVS implementation based on object oriented technology enabling a full exploita,tion of a para.llel simulation model execution are proposed and presented in the first paper.

...read moreread less

Abstract: Parallel and distributed simulation (PDS) over the one and a half decade of its existence has turned out to be more foundational than merely solving the causality preservation di lemma in partially ordered event structures as they occur in parallel and distributed simulation executions. Today’s availability of parallel and distributed computing and communication technology ha.s given a new relevance to the field that could not, ha.ue been foreseen in its early days. The potentia,l improvement of elapsed time for large simula.tion experiments via the involvement of a set of individua.1 processing nodes of a parallel (shared or distributed memory) computer or distributed system is more promising t,oclay than at any time in history of the field. Accelerating simulation experiments for la.rge syst,em models is naturally the outstanding PDS resea.rch goal. But above this, many scientists, also from within the core of classical PDS field, have seen imp& of the PDS theory and methodology also in nonsta.nda.rd a.pplication areas such as parallel program execution. The intent, of this minitrack was to provide a forum for t,he exploration of new high performance simula.tion concepts and techniques, as well as their successful a.pplication on today’s and tomorrow’s parallel execution pla.tforms. Twenty-six contributions have been submitted by a.utjhors from France, Germany, Japa.n, Poland, Sweden, UK and the USA, after more than seventy authors a~nnouacecl pa.pers via abstracts. Eighty-four reviewers wit#h their expertise helped to select ten full papers a.nd two short papers for publication in the proceedings, a,ncl for oral presentation at the conference. Ea.& paper was reviewed by at least three, at most, six, and on a,verage by 3.96 reviewers. Due to t)he a.va.ilability of pa.pers in electronic format, the reviewing process could be managed almost exclusively by email. A concept8ual framework and an environment for high performance simulation is presented in the first paper. Zeigler, Moon, D. Kim and J.G. Kim propose the DEVS formalism for system modeling, and a DEVS implementation based on object oriented technology enabling a full exploita,tion of a para.llel simulation model execution. DEVS modeling is illustra.ted in the context of wildland fire simulation, performance comparisons a.re conducted for a, watershed model executing sequentially on a. Spare-1000 a.ncl in para.llel on a CM-5. The paper by Konas presents potentials of simulation a.t the confluence of object oriented system design and parallel processing: modularity, extensibility and reusability provide a na.tural a.nd well structured approach to the construction of complex simulation models, while at the same time promoting a higher execution efficiency on a. parallel platform. Young and Wilsey propose a new distributed fossil collection technique for the Time Wa.rp distributed discrete event simulation protocol. Basically, every fossil collector associated with a logical process (LP) by observing event arrival t imes establishes a. statistical model for rollback distances to determine in conjunction with a user defined rise fa.ctor tha.t controls the aggressiveness of fossil collectors a. probabilistic GVT bound. The approach appears beneficial for the reduction of the amount of used memory over GVT based fossil collection, but requires a.dditional checkpointing for possible “ca.tastrophic” rollbacks, i.e. restoration of states that have been fossil collected due to an overestimation of the actual GVT. RGnngren, Barriga a,nd Ayani have developed a benchmark suite for the performance evalua.tion of parallel simulation kernels on different a.rchitect,ures and the scalability analysis of certain simulation problems. It appears particularly hard t,o isolate performance influences stemming from the kernel a.s such or from the event structure underlying the simulation model executed by the kernel. Trying to a,bstra,ct a.s much a,s possible from the la.tter performance impact, the authors construct a. synthetic benchmark scala,ble

...read moreread less

216 citations

Patent•

Operating system for a non-uniform memory access multiprocessor system

[...]

Jeffrey S. Kimmel¹, Robert A. Alfieri¹, Miles A. de Forest¹, William K. Mcgrath¹, Michael J. Mcleod¹, Mark A. O'Connell¹, Guy A. Simpson¹ - Show less +3 more•Institutions (1)

EMC Corporation¹

20 Jun 1996

TL;DR: An operating system for a non-uniform memory access (NUMA) multiprocessor system that utilizes a software abstraction of the NUMA system hardware representing a hierarchical tree structure to maintain the most efficient level of affinity and to maintain balanced processor and memory loads is presented in this paper.

...read moreread less

Abstract: An operating system for a non-uniform memory access (NUMA) multiprocessor system that utilizes a software abstraction of the NUMA system hardware representing a hierarchical tree structure to maintain the most efficient level of affinity and to maintain balanced processor and memory loads. The hierarchical tree structure includes leaf nodes representing the job processors, a root node representing at least one system resource shared by all the job processors, and a plurality of intermediate level nodes representing resources shared by different combinations of the job processors. The operating system includes a medium term scheduler for monitoring the progress of active thread groups distributed throughout the system and for assisting languishing thread groups, and a plurality of dispatchers each associated with one of the job processors for monitoring the status of the associated job processor and for obtaining thread groups for the associated job processor to execute. The operating system further includes a memory manager for allocating virtual and physical memory using a plurality of memory pools and frame treasuries.

...read moreread less

211 citations

Patent•

Transparent Local And Distributed Memory Management System

[...]

Blaine Garst, Ali T. Ozer, Bertrand Serlet, Trey Redwood City Matteson

31 Jan 1996

TL;DR: In this paper, an autorelease pool is created at the beginning of a new duty cycle, which retains the newly allocated memory space during the duty cycle and is automatically disposed of at the end of a duty cycle.

...read moreread less

Abstract: The present invention discloses a system for transparent local and distributed memory management. The invention overcomes the prior art's requirement of keeping track of whether a memory space allocated to a new object or a new program or data structure can be reclaimed. According to the present invention an autorelease pool is created at the beginning of a new duty cycle. The autorelease pool retains the newly allocated memory space during the duty cycle. The autorelease pool is automatically disposed of at the end of the duty cycle. As a result of disposing the autorelease pool, the newly allocated memory space is reclaimed (i.e., deallocated). The present invention is useful in distributed networks where different programming conventions on remote and local machines made the prior art's memory management task particularly difficult. The present invention is also useful in an object-oriented programming environment.

...read moreread less

188 citations

Patent•

Remote checkpoint memory system and protocol for fault-tolerant computer system

[...]

Jack J. Stiffler

27 Nov 1996

TL;DR: In this paper, a mechanism for maintaining a consistent, periodically updated state in main memory without constraining normal computer operation is provided, thereby enabling a computer system to recover from faults without loss of data or processing continuity.

...read moreread less

Abstract: A mechanism for maintaining a consistent, periodically updated state in main memory without constraining normal computer operation is provided, thereby enabling a computer system to recover from faults without loss of data or processing continuity. In this invention, a first computer includes a processor and input/output elements connected to a main memory subsystem including a primary element. A second computer has a remote checkpoint memory element, which may include one or more buffer memories and a shadow memory, which is connected to the main memory subsystem of the first computer. During normal processing, an image of data written to the primary memory element is captured by the remote checkpoint memory element. When a new checkpoint is desired (thereby establishing a consistent state in main memory to which all executing applications can safely return following a fault), the data previously captured is used to establish a new checkpointed state in the second computer. In case of failure of the first computer, the second computer can be restarted to operate from the last checkpoint established for the first computer. This structure and protocol can guarantee a consistent state in main memory, thus enabling fault-tolerant operation.

...read moreread less

160 citations

Patent•

Method for sharing variable-grained memory of workstations by sending particular block including line and size of the block to exchange shared data structures

[...]

Daniel J. Scales, Kourosh Gharachorloo

17 Jul 1996

TL;DR: In this article, a software implemented method enables data sharing between the workstations using variable sized quantities of data using variable access information including the size of a particular block and an identity of workst stations having a copy of the block.

...read moreread less

Abstract: In a distributed shared memory system, workstations are connected to each other by a network. Each workstation includes a processor, a memory having addresses, and an input/output interface to interconnect the workstations. A software implemented method enables data sharing between the workstations using variable sized quantities of data. A set of the addresses of the memories are designated as virtual shared addresses to store shared data. A portion of the virtual shared addresses are allocated to store a shared data structure as one or more blocks. The shared data structure is accessible by programs executing in any of the processors. The size of a particular allocated block can vary for different shared data structures. Each block includes an integer number of lines, and each line includes a predetermined number of bytes of shared data. Access information of a particular block is stored in the memory of a home one of the workstations. The access information includes the size of the particular block and an identity of workstations having a copy of the block.

...read moreread less

Dissertation•

Lazy release consistency for distributed shared memory

[...]

P. Keleher¹•Institutions (1)

Rice University¹

3 Oct 1996

TL;DR: A new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance, is presented.

...read moreread less

Abstract: A software distributed shared memory (DSM) system allows shared memory parallel programs to execute on networks of workstations. This thesis presents a new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance. The lazy release consistent protocols achieve this reduction in communication by piggybacking consistency information on top of existing synchronization transfers. Some of the protocols also improve performance by speculatively moving data. We evaluate the impact of these features by comparing the performance of a software DSM using lazy protocols with that of a DSM using previous eager protocols. We found that seven of our eight applications performed better on the lazy system, and four of the applications showed performance speedups of at least 18%. As part of this comparison, we show that the cost of executing the slightly more complex code of the lazy protocols is far less important than the reduction in communication requirements. We also compare the lazy performance with that of a hardware supported shared memory system that uses processors and caches similar to those of the workstations running our DSM. Our DSM system was able to approach, and in one case even surpass, the performance of the hardware system for applications with coarse-grained parallelism, but the hardware system performed significantly better for programs with fine-grained parallelism. Overall, the results indicate that DSMs using lazy protocols have become a viable alternative for high-performance parallel processing.

...read moreread less

Patent•

High memory capacity dimm with data and state memory

[...]

James Laudon, Daniel E. Lenoski, John Manton

14 May 1996

TL;DR: In this article, a high memory capacity dual in-line memory modules (DIMM) for use in a directory-based, distributed shared memory multiprocessor computer system is presented.

...read moreread less

Abstract: A high memory capacity dual in-line memory modules (DIMM) for use in a directory-based, distributed shared memory multiprocessor computer system includes a data memory for storing data and a state memory for storing state or directory information corresponding to at least a portion of the data. The DIMM allows the data and the state information to be accessed independently. The DIMM can be configured in a plurality of storage capacities.

...read moreread less

Patent•

Single chip multiprocessor architecture with internal task switching synchronization bus

[...]

Michael D. Rostoker¹, Douglas B. Boyle¹•Institutions (1)

LSI Corporation¹

3 May 1996

TL;DR: A plurality of processors which can be the same or different are formed on a single integrated circuit chip together with a memory controller and an I/O controller, and are interconnected by a data transfer bus.

...read moreread less

Abstract: A plurality of processors which can be the same or different are formed on a single integrated circuit chip together with a memory controller and an I/O controller, and are interconnected by a data transfer bus. The processors can have larger word lengths and operate at higher speeds than comparable single chip processors due to reduced latency and signal path lengths. The processors are further interconnected by a processor synchronization bus which enables one processor to cause another processor to perform a task by generating an interrupt and passing the required parameters. The parameters can be passed via shared memory, or via a bidirectional data section of the processor synchronization bus. A processor running a large scale CAD or similar application can cause a smaller processor to perform I/O tasks in native code. A multiprocessor system can be configured as including a Single-Chip module (SCM), a Multi-Chip Module (MCM), Board-Level Product (BPL), or as a box-level product which includes a power supply.

...read moreread less

Patent•

Replicated resource management system for managing resources in a distributed application and maintaining a relativistic view of state

[...]

Jason Jeffords¹, Roger Dev¹•Institutions (1)

University of Rochester¹

11 Jan 1996

TL;DR: In this article, the distributed memory space is divided into a plurality of memory pools, each pool containing a collection of resource objects, and each object sending its state vector to other objects, each object maintaining a state matrix of the state vectors.

...read moreread less

Abstract: A method and apparatus for accessing resource objects contained in a distributed memory space in a communications network, including dividing the distributed memory space into a plurality of memory pools, each pool containing a collection of resource objects, providing a plurality of resource manager objects, each resource manager object having an associated set of memory pools and a registry of network unique identifiers for the resource objects in those pools, and accessing a given resource object via its network identifier. Another aspect of the invention is to provide a relativistic view of state of a plurality of objects, each object generating a state vector representing that object's view of its own state and the state of all other objects, each object sending its state vector to other objects, and each object maintaining a state matrix of the state vectors.

...read moreread less

Patent•

Affinity scheduling of data within multi-processor computer systems

[...]

Vernon K. Boland¹•Institutions (1)

NCR Corporation¹

17 Dec 1996

TL;DR: In this article, an improved affinity scheduling system for assigning processes to processors within a multiprocessor computer system which includes a plurality of processors and cache memories associated with each processor is presented.

...read moreread less

Abstract: An improved affinity scheduling system for assigning processes to processors within a multiprocessor computer system which includes a plurality of processors and cache memories associated with each processor. The affinity scheduler affinitizes processes to processors so that processes which frequently modify the same data are affined to the same local processor—the processor whose cache memory includes the data being modified by the processes. The scheduler monitors the scheduling and execution of processes to identify processes which frequently modify data residing in the cache memory of a non-local processor. When a process is identified which requires access to data residing in the cache memory of a non-local processor with greater frequency than the process requires access to data residing in the cache memory of its affined local processor, the affinity of the process is changed to the non-local processor.

...read moreread less

Proceedings Article•10.1109/HPCA.1996.501170•

Improving release-consistent shared virtual memory using automatic update

[...]

Liviu Iftode¹, Cezary Dubnicki¹, Edward W. Felten¹, Kai Li¹•Institutions (1)

Princeton University¹

3 Feb 1996

TL;DR: This paper proposes a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications and shows that the AURC approach can substantially improve the performance of LRC.

...read moreread less

Abstract: Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach and the hardware approach that uses directory-based caches. Automatic update is a simple communication mechanism, implemented in the SHRIMP multicomputer, that forwards local writes to remote memory transparently. In this paper we propose a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications. We compare the performance of this protocol against a software-only LRC implementation on several Splash-2 applications and show that the AURC approach can substantially improve the performance of LRC. For 16 processors, the average speedup has increased from 5.9 under LRC to 8.3 under AURC.

...read moreread less

Proceedings Article•10.1145/237090.237181•

An integrated compile-time/run-time software distributed shared memory system

[...]

Sandhya Dwarkadas¹, Alan L. Cox¹, Willy Zwaenepoel¹•Institutions (1)

Rice University¹

1 Sep 1996

TL;DR: An integrated compile-time and run-time software DSM system to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, to retain its ease of programming, and to retain the broader class of applications it supports.

...read moreread less

Abstract: On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports.To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remains identical to the original pure run-time DSM system. No user intervention is required to obtain the benefits of our system. The compiler computes data access patterns for the individual processors. It then performs a source-to-source transformation, inserting in the program calls to inform the run-time system of the computed data access patterns. The run-time system uses this information to aggregate communication, to aggregate data and synchronization into a single message, to eliminate consistency overhead, and to replace global synchronization with point-to-point synchronization wherever possible.We extended the Parascope programming environment to perform the required analysis, and we augmented the TreadMarks run-time DSM library to take advantage of the analysis. We used six Fortran programs to assess the performance benefits: Jacobi, 3D-FFT, Integer Sort, Shallow, Gauss, and Modified Gramm-Schmidt, each with two different data set sizes. The experiments were run on an 8-node IBM SP/2 using user-space communication. Compiler optimization in conjunction with the augmented run-time system achieves substantial execution time improvements in comparison to the base TreadMarks, ranging from 4% to 59% on 8 processors. Relative to message passing implementations of the same applications, the compile-time run-time system is 0-29% slower than message passing, while the base run-time system is 5-212% slower. For the five programs that XHPF could parallelize (all except IS), the execution times achieved by the compiler optimized shared memory programs are within 9% of XHPF.

...read moreread less

Patent•

Messaging in distributed memory multiprocessing system having shell circuitry for atomic control of message storage queue's tail pointer structure in local memory

[...]

Richard E. Kessler¹, Steven M. Oberlin¹, Steven L. Scott¹•Institutions (1)

Cray¹

13 Mar 1996

TL;DR: In this paper, a message queue in a local memory of the destination processing element stores the transmitted message and a control word stored in the local memory includes a limit field designating the size of the message queue and a tail field indicating an index into the corresponding message queue.

...read moreread less

Abstract: A messaging facility in a multiprocessor computer system includes assembly circuitry in a source processing element for assembling a message to be sent from the source processing element to a destination processing element based on information provided from a processor in the source processing element. A network router transmits the assembled message from the source processing element to the destination processing element via an interconnect network. A message queue in a local memory of the destination processing element stores the transmitted message. A control word stored in the local memory of the destination processing element includes a limit field designating a size of the message queue and a tail field designating an index into the corresponding message queue to indicate a location in the message queue where the transmitted message is to be stored. Shell circuitry in the destination processing element atomically reads and updates the tail field.

...read moreread less

Proceedings Article•10.1145/232973.233000•

Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors

[...]

Mark Horowitz¹, Margaret Martonosi², Todd C. Mowry³, Michael D. Smith⁴•Institutions (4)

Stanford University¹, Princeton University², University of Toronto³, Harvard University⁴

1 May 1996

TL;DR: It is demonstrated that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions.

...read moreread less

Abstract: Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations---one based on a cache-outcome condition code and another based on low-overhead traps---and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.

...read moreread less

Patent•

Advanced parallel array processor computer package

[...]

Michael Charles Dapp¹, James Warren Dieffenderfer¹, Richard Ernest Miles¹, Richard Edward Nier¹, Vincent John Smoral¹, James Robert Stupp¹ - Show less +2 more•Institutions (1)

IBM¹

30 Sep 1996

Journal Article•10.1006/JPDC.1996.0011•

Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines

[...]

Sandeep K. S. Gupta¹, Sandeep Kaushik², Chua-Huang Huang³, P. Sadayappan³•Institutions (3)

Duke University¹, Intel², Ohio State University³

01 Feb 1996-Journal of Parallel and Distributed Computing

TL;DR: These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays.

...read moreread less

Proceedings Article•10.1145/232973.232980•

MGS: A Multigrain Shared Memory System

[...]

Donald Yeung¹, John Kubiatowicz¹, Anant Agarwal¹•Institutions (1)

Massachusetts Institute of Technology¹

1 May 1996

TL;DR: This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS, and finds that unmodified shared memory applications can exploit multigrain sharing.

...read moreread less

Abstract: Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systems Distributed Scalable Shared-memory Multiprocessors (DSSMPs).This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain shared memory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multigrain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharing only when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced.Using MGS, an in-depth study of several shared memory applications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploit multigrain sharing. Keeping the number of processors fixed, applications execute up to 85% faster when each DSSMP node is a multiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance advantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applications allows a DSSMP to almost match the performance of a tightly-coupled multiprocessor.

...read moreread less

Journal Article•10.1006/JPDC.1996.0136•

Analyses and Optimizations for Shared Address Space Programs

[...]

Arvind Krishnamurthy¹, Katherine Yelick¹•Institutions (1)

University of California, Berkeley¹

01 Nov 1996-Journal of Parallel and Distributed Computing

TL;DR: Comp compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space are presented and programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization.

...read moreread less

Journal Article•10.1007/BF03356758•

Interprocedural array region analyses

[...]

Béatrice Creusillet¹, François Irigoin¹•Institutions (1)

Mines ParisTech¹

1 Dec 1996

TL;DR: Exact array region analysis is introduced, which exactly represents the effects of statements and procedures upon array variables, and two new types of array region analyses are introduced: in and out regions.

...read moreread less

Abstract: Many program optimizations require exact knowledge of the sets of array elements that are referenced in or that flow between statements or procedures. Some examples are array privatization, generation of communications in distributed memory machines, or compile-time optimization of cache behavior in hierarchical memory machines. Exact array region analysis is introduced in this article. These regions exactly represent the effects of statements and procedures upon array variables. To represent the flow of these data, we also introduce two new types of array region analyses: in and out regions. The intraprocedural propagation is presented, as well as a general linear framework for interprocedural analyses, which handles array reshapes. The intra- and inter-procedural propagation of array regions is implemented in pips, the interprocedural parallelizer of fortran programs developed at Ecole des mines de Paris.

...read moreread less

Journal Article•10.1002/(SICI)1097-024X(199603)26:3<357::AID-SPE15>3.0.CO;2-#•

Vmalloc: A General and Efficient Memory Allocator

[...]

Kiem-Phong Vo¹•Institutions (1)

Bell Labs¹

01 Mar 1996-Software - Practice and Experience

TL;DR: The new library Vmalloc generalizes malloc to give programmers more control over memory allocation and shows that Vm alloc is competitive to the best of these allocators.

...read moreread less

Abstract: Despite its popularity, malloc's shortcomings frequently cause programmers to code around it. The new library Vmalloc generalizes malloc to give programmers more control over memory allocation. Vmalloc introduces the idea of organizing memory into separate regions, each with a discipline to get raw memory and a method to manage allocation. Applications can write their own disciplines to manipulate arbitrary type of memory or just to better organize memory in a region by creating new regions out of its memory. The provided set of allocation methods include general purpose allocation, fast special cases and aids for memory debugging or profiling. A compatible malloc interface enables current applications to select allocation methods using environment variables so they can tune for performance or perform other tasks such as profiling memory usage, generating traces of allocation calls or debugging memory errors. A performance study comparing Vmalloc and currently popular malloc implementations shows that Vmalloc is competitive to the best of these allocators. Applications can gain further performance improvement by using the right mixture of regions with different Vmalloc methods.

...read moreread less

Journal Article•10.1109/59.485987•

A heuristic method for reactive power planning

[...]

Jose Roberto Sanches Mantovani¹, Ariovaldo V. Garcia•Institutions (1)

Sao Paulo State University¹

01 Feb 1996-IEEE Transactions on Power Systems

TL;DR: An approach for solving power system reactive power planning problems is presented, which is based on binary search techniques and the use of a special heuristic to obtain a discrete solution.

...read moreread less

Abstract: An approach for solving power system reactive power planning problems is presented, which is based on binary search techniques and the use of a special heuristic to obtain a discrete solution. Two versions were developed, one to run on conventional (sequential) computers and the other to run on a distributed memory (hypercube) machine. This latter parallel processing version employs an asynchronous programming model. Once the set of candidate buses has been defined, the program gives the location and size of the reactive sources needed (if any) in keeping with operating and security constraints.

...read moreread less

Proceedings Article•10.2514/6.1996-4045•

Aerodynamic shape optimization of supersonic aircraft configurations via an adjoint formulation on distributed memory parallel computers

[...]

James Reuther, Mark J. Rimlinger, Juan J. Alonso, Antony Jameson

4 Sep 1996

TL;DR: This work describes the application of control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design.

...read moreread less

Patent•

Memory space management method, data transfer method, and computer device for distributed computer system

[...]

Toshio Okamoto¹, Yoshiyuki Tsuda¹•Institutions (1)

Toshiba¹

8 Jul 1996

TL;DR: In this paper, a scheme for realizing a high speed data transfer between memory spaces shared among computers in a distributed computer system, without requiring a complicated and inefficient communication protocol processing at the computer side, is presented.

...read moreread less

Abstract: A scheme for realizing a high speed data transfer between memory spaces shared among computers in a distributed computer system, without requiring a complicated and inefficient communication protocol processing at the computer side One region which is at least a part of a virtual memory space or a real memory space managed by one computer and another region which is at least a part of a virtual memory space or a real memory space managed by another computer are shared between these two computers, and a dedicated virtual connection is set up between these two shared regions Then, a data transfer between these two shared regions is carried out by using the dedicated virtual connection A virtual connection identifier of the dedicated virtual connection is registered into a corresponding page table entry in the page table, so that this virtual connection identifier can be obtained at a time of the data transfer by referring to the page table alone

...read moreread less

Book Chapter•10.1007/3-540-61626-8_58•

Compiler Reduction of Invalidation Traffic in Virtual Shared Memory Systems

[...]

Michael O'Boyle¹, Rupert W. Ford¹, Andy Nisbet¹•Institutions (1)

University of Manchester¹

26 Aug 1996

TL;DR: This paper presents new compiler analysis for the elimination of invalidation traffic in virtual shared memory, using a hybrid distributed invalidation coherence scheme that aggressively exploits the SPMD execution model and uses array section analysis to accurately determine only those instances when invalidation is necessary, thus avoiding the additional read misses of previous schemes.

...read moreread less

Abstract: This paper presents new compiler analysis for the elimination of invalidation traffic in virtual shared memory, using a hybrid distributed invalidation coherence scheme. The invalidation and acknowledgement messages are removed; this reduces both network invalidation traffic and the latency of a write fault. It aggressively exploits the SPMD execution model and uses array section analysis to accurately determine only those instances when invalidation is necessary, thus avoiding the additional read misses of previous schemes. Equations determining precisely what data should be invalidated are presented and translated into a form amenable to compiler analysis. Preliminary experimental results on a 30 node prototype architecture demonstrate the performance attainable using this scheme.

...read moreread less

...

Expand