TL;DR: This work discusses the experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system, which allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory.
Abstract: Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value.
TL;DR: In this paper, the authors look at a number of distributed systems that have attempted to paper over the distinction between local and remote objects, and show that such systems fail to support basic requirements of robustness and reliability.
Abstract: We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure. We look at a number of distributed systems that have attempted to paper over the distinction between local and remote objects, and show that such systems fail to support basic requirements of robustness and reliability. These failures have been masked in the past by the small size of the distributed systems that have been built. In the enterprise-wide distributed systems foreseen in the near future, however, such a masking will be impossible. We conclude by discussing what is required of both systems-level and application-level programmers and designers if one is to take distribution seriously.
TL;DR: It is predicted that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips, and pin bandwidth limitations will make more complex on-chip caches cost-effective.
Abstract: This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
TL;DR: The primary focus of this paper is to describe the techniques used in Shasta to reduce the checking overhead for supporting fine granularity sharing in software, including careful layout of the shared address space and scheduling the checking code for efficient execution on modern processors.
Abstract: This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient cache coherence protocol that incorporates a number of optimizations, including support for multiple communication granularities and use of relaxed memory models. This system is fully functional and runs on a cluster of Alpha workstations.The primary focus of this paper is to describe the techniques used in Shasta to reduce the checking overhead for supporting fine granularity sharing in software. These techniques include careful layout of the shared address space, scheduling the checking code for efficient execution on modern processors, using a simple method that checks loads using only the value loaded, reducing the extra cache misses caused by the checking code, and combining the checks for multiple loads and stores. To characterize the effect of these techniques, we present detailed performance results for the SPLASH-2 applications running on an Alpha processor. Without our optimizations, the checking overheads are excessively high, exceeding 100% for several applications. However, our techniques are effective in reducing these overheads to a range of 5% to 35% for almost all of the applications. We also describe our coherence protocol and present some preliminary results on the parallel performance of several applications running on our workstation cluster. Our experience so far indicates that once the cost of checking memory accesses is reduced using our techniques, the Shasta approach is an attractive software solution for supporting a shared address space with fine-grain access to data.
TL;DR: The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers), which provide a rich set of atomic memory operations and a flexible, user-level messaging facility.
Abstract: This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization.The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/eureka networks that can be arbitrarily embedded into the 3D torus interconnect.
TL;DR: It is shown that processor memory integration can be used to build competitive, scalable and cost-effective MP systems and results from execution driven uni- and multi-processor simulations show that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor.
Abstract: Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.
TL;DR: The DEVS formalism for system modeling, and a DEVS implementation based on object oriented technology enabling a full exploita,tion of a para.llel simulation model execution are proposed and presented in the first paper.
Abstract: Parallel and distributed simulation (PDS) over the one and a half decade of its existence has turned out to be more foundational than merely solving the causality preservation di lemma in partially ordered event structures as they occur in parallel and distributed simulation executions. Today’s availability of parallel and distributed computing and communication technology ha.s given a new relevance to the field that could not, ha.ue been foreseen in its early days. The potentia,l improvement of elapsed time for large simula.tion experiments via the involvement of a set of individua.1 processing nodes of a parallel (shared or distributed memory) computer or distributed system is more promising t,oclay than at any time in history of the field. Accelerating simulation experiments for la.rge syst,em models is naturally the outstanding PDS resea.rch goal. But above this, many scientists, also from within the core of classical PDS field, have seen imp& of the PDS theory and methodology also in nonsta.nda.rd a.pplication areas such as parallel program execution. The intent, of this minitrack was to provide a forum for t,he exploration of new high performance simula.tion concepts and techniques, as well as their successful a.pplication on today’s and tomorrow’s parallel execution pla.tforms. Twenty-six contributions have been submitted by a.utjhors from France, Germany, Japa.n, Poland, Sweden, UK and the USA, after more than seventy authors a~nnouacecl pa.pers via abstracts. Eighty-four reviewers wit#h their expertise helped to select ten full papers a.nd two short papers for publication in the proceedings, a,ncl for oral presentation at the conference. Ea.& paper was reviewed by at least three, at most, six, and on a,verage by 3.96 reviewers. Due to t)he a.va.ilability of pa.pers in electronic format, the reviewing process could be managed almost exclusively by email. A concept8ual framework and an environment for high performance simulation is presented in the first paper. Zeigler, Moon, D. Kim and J.G. Kim propose the DEVS formalism for system modeling, and a DEVS implementation based on object oriented technology enabling a full exploita,tion of a para.llel simulation model execution. DEVS modeling is illustra.ted in the context of wildland fire simulation, performance comparisons a.re conducted for a, watershed model executing sequentially on a. Spare-1000 a.ncl in para.llel on a CM-5. The paper by Konas presents potentials of simulation a.t the confluence of object oriented system design and parallel processing: modularity, extensibility and reusability provide a na.tural a.nd well structured approach to the construction of complex simulation models, while at the same time promoting a higher execution efficiency on a. parallel platform. Young and Wilsey propose a new distributed fossil collection technique for the Time Wa.rp distributed discrete event simulation protocol. Basically, every fossil collector associated with a logical process (LP) by observing event arrival t imes establishes a. statistical model for rollback distances to determine in conjunction with a user defined rise fa.ctor tha.t controls the aggressiveness of fossil collectors a. probabilistic GVT bound. The approach appears beneficial for the reduction of the amount of used memory over GVT based fossil collection, but requires a.dditional checkpointing for possible “ca.tastrophic” rollbacks, i.e. restoration of states that have been fossil collected due to an overestimation of the actual GVT. RGnngren, Barriga a,nd Ayani have developed a benchmark suite for the performance evalua.tion of parallel simulation kernels on different a.rchitect,ures and the scalability analysis of certain simulation problems. It appears particularly hard t,o isolate performance influences stemming from the kernel a.s such or from the event structure underlying the simulation model executed by the kernel. Trying to a,bstra,ct a.s much a,s possible from the la.tter performance impact, the authors construct a. synthetic benchmark scala,ble
TL;DR: An operating system for a non-uniform memory access (NUMA) multiprocessor system that utilizes a software abstraction of the NUMA system hardware representing a hierarchical tree structure to maintain the most efficient level of affinity and to maintain balanced processor and memory loads is presented in this paper.
Abstract: An operating system for a non-uniform memory access (NUMA) multiprocessor system that utilizes a software abstraction of the NUMA system hardware representing a hierarchical tree structure to maintain the most efficient level of affinity and to maintain balanced processor and memory loads. The hierarchical tree structure includes leaf nodes representing the job processors, a root node representing at least one system resource shared by all the job processors, and a plurality of intermediate level nodes representing resources shared by different combinations of the job processors. The operating system includes a medium term scheduler for monitoring the progress of active thread groups distributed throughout the system and for assisting languishing thread groups, and a plurality of dispatchers each associated with one of the job processors for monitoring the status of the associated job processor and for obtaining thread groups for the associated job processor to execute. The operating system further includes a memory manager for allocating virtual and physical memory using a plurality of memory pools and frame treasuries.
TL;DR: In this paper, an autorelease pool is created at the beginning of a new duty cycle, which retains the newly allocated memory space during the duty cycle and is automatically disposed of at the end of a duty cycle.
Abstract: The present invention discloses a system for transparent local and distributed memory management. The invention overcomes the prior art's requirement of keeping track of whether a memory space allocated to a new object or a new program or data structure can be reclaimed. According to the present invention an autorelease pool is created at the beginning of a new duty cycle. The autorelease pool retains the newly allocated memory space during the duty cycle. The autorelease pool is automatically disposed of at the end of the duty cycle. As a result of disposing the autorelease pool, the newly allocated memory space is reclaimed (i.e., deallocated). The present invention is useful in distributed networks where different programming conventions on remote and local machines made the prior art's memory management task particularly difficult. The present invention is also useful in an object-oriented programming environment.
TL;DR: In this paper, a mechanism for maintaining a consistent, periodically updated state in main memory without constraining normal computer operation is provided, thereby enabling a computer system to recover from faults without loss of data or processing continuity.
Abstract: A mechanism for maintaining a consistent, periodically updated state in main memory without constraining normal computer operation is provided, thereby enabling a computer system to recover from faults without loss of data or processing continuity. In this invention, a first computer includes a processor and input/output elements connected to a main memory subsystem including a primary element. A second computer has a remote checkpoint memory element, which may include one or more buffer memories and a shadow memory, which is connected to the main memory subsystem of the first computer. During normal processing, an image of data written to the primary memory element is captured by the remote checkpoint memory element. When a new checkpoint is desired (thereby establishing a consistent state in main memory to which all executing applications can safely return following a fault), the data previously captured is used to establish a new checkpointed state in the second computer. In case of failure of the first computer, the second computer can be restarted to operate from the last checkpoint established for the first computer. This structure and protocol can guarantee a consistent state in main memory, thus enabling fault-tolerant operation.
TL;DR: In this article, a software implemented method enables data sharing between the workstations using variable sized quantities of data using variable access information including the size of a particular block and an identity of workst stations having a copy of the block.
Abstract: In a distributed shared memory system, workstations are connected to each other by a network. Each workstation includes a processor, a memory having addresses, and an input/output interface to interconnect the workstations. A software implemented method enables data sharing between the workstations using variable sized quantities of data. A set of the addresses of the memories are designated as virtual shared addresses to store shared data. A portion of the virtual shared addresses are allocated to store a shared data structure as one or more blocks. The shared data structure is accessible by programs executing in any of the processors. The size of a particular allocated block can vary for different shared data structures. Each block includes an integer number of lines, and each line includes a predetermined number of bytes of shared data. Access information of a particular block is stored in the memory of a home one of the workstations. The access information includes the size of the particular block and an identity of workstations having a copy of the block.
TL;DR: A new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance, is presented.
Abstract: A software distributed shared memory (DSM) system allows shared memory parallel programs to execute on networks of workstations. This thesis presents a new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance. The lazy release consistent protocols achieve this reduction in communication by piggybacking consistency information on top of existing synchronization transfers. Some of the protocols also improve performance by speculatively moving data.
We evaluate the impact of these features by comparing the performance of a software DSM using lazy protocols with that of a DSM using previous eager protocols. We found that seven of our eight applications performed better on the lazy system, and four of the applications showed performance speedups of at least 18%. As part of this comparison, we show that the cost of executing the slightly more complex code of the lazy protocols is far less important than the reduction in communication requirements. We also compare the lazy performance with that of a hardware supported shared memory system that uses processors and caches similar to those of the workstations running our DSM. Our DSM system was able to approach, and in one case even surpass, the performance of the hardware system for applications with coarse-grained parallelism, but the hardware system performed significantly better for programs with fine-grained parallelism.
Overall, the results indicate that DSMs using lazy protocols have become a viable alternative for high-performance parallel processing.
TL;DR: In this article, a high memory capacity dual in-line memory modules (DIMM) for use in a directory-based, distributed shared memory multiprocessor computer system is presented.
Abstract: A high memory capacity dual in-line memory modules (DIMM) for use in a directory-based, distributed shared memory multiprocessor computer system includes a data memory for storing data and a state memory for storing state or directory information corresponding to at least a portion of the data. The DIMM allows the data and the state information to be accessed independently. The DIMM can be configured in a plurality of storage capacities.
TL;DR: A plurality of processors which can be the same or different are formed on a single integrated circuit chip together with a memory controller and an I/O controller, and are interconnected by a data transfer bus.
Abstract: A plurality of processors which can be the same or different are formed on a single integrated circuit chip together with a memory controller and an I/O controller, and are interconnected by a data transfer bus. The processors can have larger word lengths and operate at higher speeds than comparable single chip processors due to reduced latency and signal path lengths. The processors are further interconnected by a processor synchronization bus which enables one processor to cause another processor to perform a task by generating an interrupt and passing the required parameters. The parameters can be passed via shared memory, or via a bidirectional data section of the processor synchronization bus. A processor running a large scale CAD or similar application can cause a smaller processor to perform I/O tasks in native code. A multiprocessor system can be configured as including a Single-Chip module (SCM), a Multi-Chip Module (MCM), Board-Level Product (BPL), or as a box-level product which includes a power supply.
TL;DR: In this article, the distributed memory space is divided into a plurality of memory pools, each pool containing a collection of resource objects, and each object sending its state vector to other objects, each object maintaining a state matrix of the state vectors.
Abstract: A method and apparatus for accessing resource objects contained in a distributed memory space in a communications network, including dividing the distributed memory space into a plurality of memory pools, each pool containing a collection of resource objects, providing a plurality of resource manager objects, each resource manager object having an associated set of memory pools and a registry of network unique identifiers for the resource objects in those pools, and accessing a given resource object via its network identifier. Another aspect of the invention is to provide a relativistic view of state of a plurality of objects, each object generating a state vector representing that object's view of its own state and the state of all other objects, each object sending its state vector to other objects, and each object maintaining a state matrix of the state vectors.
TL;DR: In this article, an improved affinity scheduling system for assigning processes to processors within a multiprocessor computer system which includes a plurality of processors and cache memories associated with each processor is presented.
Abstract: An improved affinity scheduling system for assigning processes to processors within a multiprocessor computer system which includes a plurality of processors and cache memories associated with each processor. The affinity scheduler affinitizes processes to processors so that processes which frequently modify the same data are affined to the same local processor—the processor whose cache memory includes the data being modified by the processes. The scheduler monitors the scheduling and execution of processes to identify processes which frequently modify data residing in the cache memory of a non-local processor. When a process is identified which requires access to data residing in the cache memory of a non-local processor with greater frequency than the process requires access to data residing in the cache memory of its affined local processor, the affinity of the process is changed to the non-local processor.
TL;DR: This paper proposes a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications and shows that the AURC approach can substantially improve the performance of LRC.
Abstract: Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach and the hardware approach that uses directory-based caches. Automatic update is a simple communication mechanism, implemented in the SHRIMP multicomputer, that forwards local writes to remote memory transparently. In this paper we propose a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications. We compare the performance of this protocol against a software-only LRC implementation on several Splash-2 applications and show that the AURC approach can substantially improve the performance of LRC. For 16 processors, the average speedup has increased from 5.9 under LRC to 8.3 under AURC.
TL;DR: An integrated compile-time and run-time software DSM system to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, to retain its ease of programming, and to retain the broader class of applications it supports.
Abstract: On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports.To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remains identical to the original pure run-time DSM system. No user intervention is required to obtain the benefits of our system. The compiler computes data access patterns for the individual processors. It then performs a source-to-source transformation, inserting in the program calls to inform the run-time system of the computed data access patterns. The run-time system uses this information to aggregate communication, to aggregate data and synchronization into a single message, to eliminate consistency overhead, and to replace global synchronization with point-to-point synchronization wherever possible.We extended the Parascope programming environment to perform the required analysis, and we augmented the TreadMarks run-time DSM library to take advantage of the analysis. We used six Fortran programs to assess the performance benefits: Jacobi, 3D-FFT, Integer Sort, Shallow, Gauss, and Modified Gramm-Schmidt, each with two different data set sizes. The experiments were run on an 8-node IBM SP/2 using user-space communication. Compiler optimization in conjunction with the augmented run-time system achieves substantial execution time improvements in comparison to the base TreadMarks, ranging from 4% to 59% on 8 processors. Relative to message passing implementations of the same applications, the compile-time run-time system is 0-29% slower than message passing, while the base run-time system is 5-212% slower. For the five programs that XHPF could parallelize (all except IS), the execution times achieved by the compiler optimized shared memory programs are within 9% of XHPF.
TL;DR: In this paper, a message queue in a local memory of the destination processing element stores the transmitted message and a control word stored in the local memory includes a limit field designating the size of the message queue and a tail field indicating an index into the corresponding message queue.
Abstract: A messaging facility in a multiprocessor computer system includes assembly circuitry in a source processing element for assembling a message to be sent from the source processing element to a destination processing element based on information provided from a processor in the source processing element. A network router transmits the assembled message from the source processing element to the destination processing element via an interconnect network. A message queue in a local memory of the destination processing element stores the transmitted message. A control word stored in the local memory of the destination processing element includes a limit field designating a size of the message queue and a tail field designating an index into the corresponding message queue to indicate a location in the message queue where the transmitted message is to be stored. Shell circuitry in the destination processing element atomically reads and updates the tail field.
TL;DR: It is demonstrated that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions.
Abstract: Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations---one based on a cache-outcome condition code and another based on low-overhead traps---and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.
TL;DR: These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays.
TL;DR: This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS, and finds that unmodified shared memory applications can exploit multigrain sharing.
Abstract: Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systems Distributed Scalable Shared-memory Multiprocessors (DSSMPs).This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain shared memory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multigrain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharing only when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced.Using MGS, an in-depth study of several shared memory applications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploit multigrain sharing. Keeping the number of processors fixed, applications execute up to 85% faster when each DSSMP node is a multiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance advantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applications allows a DSSMP to almost match the performance of a tightly-coupled multiprocessor.
TL;DR: Comp compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space are presented and programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization.
TL;DR: Exact array region analysis is introduced, which exactly represents the effects of statements and procedures upon array variables, and two new types of array region analyses are introduced: in and out regions.
Abstract: Many program optimizations require exact knowledge of the sets of array elements that are referenced in or that flow between statements or procedures. Some examples are array privatization, generation of communications in distributed memory machines, or compile-time optimization of cache behavior in hierarchical memory machines. Exact array region analysis is introduced in this article. These regions exactly represent the effects of statements and procedures upon array variables. To represent the flow of these data, we also introduce two new types of array region analyses: in and out regions. The intraprocedural propagation is presented, as well as a general linear framework for interprocedural analyses, which handles array reshapes. The intra- and inter-procedural propagation of array regions is implemented in pips, the interprocedural parallelizer of fortran programs developed at Ecole des mines de Paris.
TL;DR: The new library Vmalloc generalizes malloc to give programmers more control over memory allocation and shows that Vm alloc is competitive to the best of these allocators.
Abstract: Despite its popularity, malloc's shortcomings frequently cause programmers to code around it. The new library Vmalloc generalizes malloc to give programmers more control over memory allocation. Vmalloc introduces the idea of organizing memory into separate regions, each with a discipline to get raw memory and a method to manage allocation. Applications can write their own disciplines to manipulate arbitrary type of memory or just to better organize memory in a region by creating new regions out of its memory. The provided set of allocation methods include general purpose allocation, fast special cases and aids for memory debugging or profiling. A compatible malloc interface enables current applications to select allocation methods using environment variables so they can tune for performance or perform other tasks such as profiling memory usage, generating traces of allocation calls or debugging memory errors. A performance study comparing Vmalloc and currently popular malloc implementations shows that Vmalloc is competitive to the best of these allocators. Applications can gain further performance improvement by using the right mixture of regions with different Vmalloc methods.
TL;DR: An approach for solving power system reactive power planning problems is presented, which is based on binary search techniques and the use of a special heuristic to obtain a discrete solution.
Abstract: An approach for solving power system reactive power planning problems is presented, which is based on binary search techniques and the use of a special heuristic to obtain a discrete solution. Two versions were developed, one to run on conventional (sequential) computers and the other to run on a distributed memory (hypercube) machine. This latter parallel processing version employs an asynchronous programming model. Once the set of candidate buses has been defined, the program gives the location and size of the reactive sources needed (if any) in keeping with operating and security constraints.
TL;DR: In this paper, a scheme for realizing a high speed data transfer between memory spaces shared among computers in a distributed computer system, without requiring a complicated and inefficient communication protocol processing at the computer side, is presented.
Abstract: A scheme for realizing a high speed data transfer between memory spaces shared among computers in a distributed computer system, without requiring a complicated and inefficient communication protocol processing at the computer side One region which is at least a part of a virtual memory space or a real memory space managed by one computer and another region which is at least a part of a virtual memory space or a real memory space managed by another computer are shared between these two computers, and a dedicated virtual connection is set up between these two shared regions Then, a data transfer between these two shared regions is carried out by using the dedicated virtual connection A virtual connection identifier of the dedicated virtual connection is registered into a corresponding page table entry in the page table, so that this virtual connection identifier can be obtained at a time of the data transfer by referring to the page table alone
TL;DR: This paper presents new compiler analysis for the elimination of invalidation traffic in virtual shared memory, using a hybrid distributed invalidation coherence scheme that aggressively exploits the SPMD execution model and uses array section analysis to accurately determine only those instances when invalidation is necessary, thus avoiding the additional read misses of previous schemes.
Abstract: This paper presents new compiler analysis for the elimination of invalidation traffic in virtual shared memory, using a hybrid distributed invalidation coherence scheme. The invalidation and acknowledgement messages are removed; this reduces both network invalidation traffic and the latency of a write fault. It aggressively exploits the SPMD execution model and uses array section analysis to accurately determine only those instances when invalidation is necessary, thus avoiding the additional read misses of previous schemes. Equations determining precisely what data should be invalidated are presented and translated into a form amenable to compiler analysis. Preliminary experimental results on a 30 node prototype architecture demonstrate the performance attainable using this scheme.