Top 497 papers published in the topic of Distributed memory in 1994

Showing papers on "Distributed memory published in 1994"

Fast folding and comparison of RNA secondary structures

[...]

Ivo L. Hofacker¹, Walter Fontana², Peter F. Stadler², Peter F. Stadler¹, L. S. Bonhoeffer³, Manfred Tacker¹, Peter Schuster¹, Peter Schuster⁴, Peter Schuster² - Show less +5 more•Institutions (4)

University of Vienna¹, Santa Fe Institute², University of Oxford³, Institute of Molecular Biotechnology⁴

01 Feb 1994-Monatshefte Fur Chemie

TL;DR: The Vienna RNA package as mentioned in this paper is based on dynamic programming algorithms and aims at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities.

...read moreread less

Abstract: Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities. An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment. All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are.

...read moreread less

2,473 citations

Journal Article•10.1002/CPE.4330060203•

Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems

[...]

Stephen T. Barnard, Horst D. Simon

01 Apr 1994-Concurrency and Computation: Practice and Experience

TL;DR: In this paper, a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples, and it is shown that RSB in its simplest form is expensive.

...read moreread less

Abstract: SUMMARY If problems involving unstructured meshes are to be solved efficiently on distributed-memory parallel computers, the meshes must be partitioned and distributed across processors in a way that balances the computational load and minimizes communication. The recursive spectral bisection method (RSB) has been shown to be very effective for such partitioning problems compared to alternative methods, but RSB in its simplest form is expensive. Here a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples. 1. INTRODUCTION Unstructured meshes are used in several large-scale scientific and engineering problems, including finite-volume methods for computational fluid dynamics and finite-element methods for structural analysis. If unstructured problems such as these are to be solved on distributed-memory parallel computers, their data structures must be partitioned and distributed across processors; if they are to be solved efficiently, the partitioning must niaximize load balance and minimize interprocessor communication. Recently, the recursive spectral bisection method (RSB)[l] has been shown to be very effective for such partitioning problems compared to alternative methods. Unfortunately, RSB in its simplest form is expensive. We shall describe a multilevel version of RSB that attains about im order-of-magnitude improvement in run time on typical examples.

...read moreread less

616 citations

Proceedings Article•10.1145/195473.195575•

Fine-grain access control for distributed shared memory

[...]

Ioannis T. Schoinas¹, Babak Falsafi¹, Alvin R. Lebeck¹, Steven K. Reinhardt¹, James R. Larus¹, Darien Wood¹ - Show less +2 more•Institutions (1)

University of Wisconsin-Madison¹

1 Nov 1994

TL;DR: In this paper, the authors discuss implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions, and incorporate three techniques that require no additional hardware into Blizzard.

...read moreread less

Abstract: This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing.This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzard's performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5's will be able to support the same shared-memory interfaces as supercomputers.

...read moreread less

277 citations

Patent•

Message passing system for distributed shared memory multiprocessor system and message passing method using the same

[...]

Shigeki Yamada¹, Katsumi Maruyama¹, Minoru Kubota¹, Satoshi Tanaka¹•Institutions (1)

Nippon Telegraph and Telephone¹

3 Oct 1994

TL;DR: In this article, a multiprocessor system consisting of a processor, a distributed shared memory coupler, and a distributed memory protector is described, where the distributed shared memories are assigned global addresses common to all the processor modules, and each processor module has its addresses shared with the shared shared memory of each processor which is the destination of data transfer.

...read moreread less

Abstract: In a multiprocessor system, each processor module comprises a processor, a distributed shared memory, a distributed memory coupler for controlling copying between distributed shared memories and a distributed memory protector for protecting said distributed shared memory against illegal access. The distributed shared memories are assigned global addresses common to all the processor modules, and the distributed shared memory of each processor module has its addresses shared with the distributed shared memory of each processor module which is the destinatiion of data transfer. Message buffers and message control areas on the distributed shared memory are divided into areas specified by a combination of sending and receiving processor modules. A processing request area on the distributed shared memory is divided corresponding to each receiving processor module and arranged accordingly. The processing request area on the receiver's side distributed shared memory has a FIFO structure. The sender's side distributed memory coupler stores identifying information of the destination processor module between the processor module communication and, upon occurrence of a write into the distributed shared memory, sends a write address and write data to the destination processor module. The receiver's side distributed memory coupler copies the received write data into the distributed shared memory of the processor module to which the distributed shared memory coupler belongs, by receiving write address and write data from the sender's side distributed memory coupler.

...read moreread less

252 citations

Proceedings Article•10.1145/191995.192030•

A performance study of software and hardware data prefetching schemes

[...]

Tien-Fu Chen¹, Jean-Loup Baer²•Institutions (2)

National Chung Cheng University¹, University of Washington²

1 Apr 1994

TL;DR: Qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references, and an approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

...read moreread less

Abstract: Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

...read moreread less

242 citations

Journal Article•10.1006/JPDC.1994.1104•

Communication optimizations for irregular scientific computations on distributed memory architectures

[...]

Raja Das¹, Mustafa Uysal¹, Joel H. Saltz¹, Yuan-Shin Hwang¹•Institutions (1)

University of Maryland, College Park¹

01 Sep 1994-Journal of Parallel and Distributed Computing

TL;DR: A detailed performance and scalability analysis of the communication primitives is presented, carried out using a workload generator, kernels from real applications, and a large unstructured adaptive application.

...read moreread less

216 citations

Patent•

Multi-processor with crossbar link of processors and memories and method of operation

[...]

Robert J. Gove¹, Karl M. Guttag¹, Keith Balmer¹, Nicholas Ing-Simmons¹•Institutions (1)

Texas Instruments¹

21 Jun 1994

TL;DR: In this article, a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor is described, with several individual processors all having communication links to several memories without restriction.

...read moreread less

Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor. The processor is structured with several individual processors all having communication links to several memories without restriction. A crossbar switch serves to establish the processor memory links and the entire image processor, including the individual processors, the crossbar switch and the memories, are contained on a single silicon chip.

...read moreread less

180 citations

Journal Article•10.1016/0167-8191(94)90033-7•

The design of a standard message passing interface for distributed memory concurrent computers

[...]

David W. Walker¹•Institutions (1)

Oak Ridge National Laboratory¹

1 Apr 1994

TL;DR: An overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers, which includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies is presented.

...read moreread less

Abstract: This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of MPI has been a collective effort involving researchers in the United States and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.

...read moreread less

145 citations

Journal Article•10.1016/0167-8191(94)90080-9•

Message-passing multi-cell molecular dynamics on the Connection Machine 5

[...]

D. M. Beazley¹, Peter S. Lomdahl¹•Institutions (1)

Los Alamos National Laboratory¹

1 Feb 1994

TL;DR: In this article, a message-passing multi-cell approach is proposed for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message passing multicell approach.

...read moreread less

Abstract: We present a new scalable algorithm for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message-passing multi-cell approach We have implemented the algorithm on the Connection Machine 5 (CM-5) and demonstrate that meso-scale molecular dynamics with more than 108 particles is now possible on massively parallel MIMD computers Typical runs show single particle update-times of 015 μs in 2 dimentions (2D) and approximately 1 μs in 3 dimensions (3D) on a 1024 node CM-5 without vector units, corresponding to more than 18 Gflops overall performance We also present a scaling equation which agrees well with actually observed timings

...read moreread less

110 citations

Proceedings Article•10.5555/1267638.1267646•

Software write detection for a distributed shared memory

[...]

Matthew J. Zekauskas¹, Wayne A. Sawdon¹, Brian N. Bershad²•Institutions (2)

Carnegie Mellon University¹, University of Washington²

14 Nov 1994

TL;DR: A new method for write detection that relies on the compiler and runtime system to detect writes to shared data without invoking the operating system, and has low average write latency and supports fine-grained sharing with low overhead.

...read moreread less

Abstract: Most software-based distributed shared memory (DSM) systems rely on the operating system's virtual memory interface to detect writes to shared data. Strategies based on virtual memory page protection create two problems for a DSM system. First, writes can have high overhead since they are detected with a page fault. As a result, a page must be written many times to amortize the cost of that fault. Second, the size of a virtual memory page is too big to serve as a unit of coherency, inducing false sharing. Mechanisms to handle false sharing can increase runtime overhead and may cause data to be unnecessarily communicated between processors.In this paper, we present a new method for write detection that solves these problems. Our method relies on the compiler and runtime system to detect writes to shared data without invoking the operating system. We measure and compare implementations of a distributed shared memory system using both strategies, virtual memory and compiler/runtime, running a range of applications on a small scale distributed memory multicomputer. We show that the new method has low average write latency and supports fine-grained sharing with low overhead. Further, we show that the dominant cost of write detection with either strategy is due to the mechanism used to handle fine-grain sharing.

...read moreread less

109 citations

Patent•

Reconfigurable multi-processor operating in SIMD mode with one processor fetching instructions for use by remaining processors

[...]

Robert J. Gove¹, Keith Balmer¹, Nicholas Ing-Simmons¹, Karl M. Guttag¹•Institutions (1)

Texas Instruments¹

22 Jun 1994

...read moreread less

Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor The processor is structured with several individual processors all having communication links to several memories without restriction A crossbar switch serves to establish the processor memory links and an inter-processor communication link allows the processors to communicate with each other for the purpose of establishing operational modes A parameter memory, accessible via the crossbar switch, is used in conjunction with the communication link for control purposes The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip

...read moreread less

Patent•

System and method of memory access in apparatus having plural processors and plural memories

[...]

Robert J. Gove¹, Keith Balmer¹, Nicholas Ing-Simmons¹, Karl M. Guttag¹•Institutions (1)

Texas Instruments¹

22 Jun 1994

TL;DR: In this paper, a multi-processor system and method arranged, in one embodiment, as an image and graphics processor is described. But this system is based on a single silicon chip and does not have a crossbar switch to establish the memory links.

...read moreread less

Abstract: There is disclosed a multi-processor system and method arranged, in one embodiment, as an image and graphics processor. The image processor is structured with several individual processors all having communication links to several memories. A crossbar switch serves to establish the processor memory links. The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip.

...read moreread less

Proceedings Article•10.5555/1267638.1267647•

The design and evaluation of a shared object system for distributed memory machines

[...]

Daniel J. Scales¹, Monica S. Lam¹•Institutions (1)

Stanford University¹

14 Nov 1994

TL;DR: This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms and finds that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware.

...read moreread less

Abstract: This paper describes the design and evaluation of SAM, a shared object system for distributed memory machines. SAM is a portable run-time system that provides a global name space and automatic caching of shared data. SAM incorporates mechanisms to address the problem of high communication overheads on distributed memory machines; these mechanisms include tying synchronization to data access, chaotic access to data, prefetching of data, and pushing of data to remote processors. SAM has been implemented on the CM-5, Intel iPSC/860 and Paragon, IBM SP1, and networks of workstations running PVM. SAM applications run on all these platforms without modification.This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms. We find that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware. Our experience suggests that SAM is successful in allowing programmers to use distributed memory machines effectively with much less programming effort than required today.

...read moreread less

Proceedings Article•10.1109/HICSS.1994.323177•

Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor

[...]

M. Galles, E. Williams

1 Jan 1994

TL;DR: The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system.

...read moreread less

Abstract: This paper presents the architecture, implementation, and performance results for the SGI Challenge symmetric multiprocessor system. Novel aspects of the architecture are highlighted, as well as key design trade-offs targeted at increasing performance and reducing complexity. Multiprocessor design verification techniques and their impact is also presented. The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system. Hardware cache coherence mechanisms maintain a consistent view of shared memory for all processors, with no software overhead and minimal impact on processor performance. HDL simulation with random, self checking vector generation and a lightweight operating system on full processor models contributed to a concept to customer shipment cycle of 26 months. >

...read moreread less

Proceedings Article•10.5555/602770.602793•

Run-time and compile-time support for adaptive irregular problems

[...]

S.D. Sharma¹, Ravi Ponnusamy¹, Bongki Moon¹, Yuan-Shin Hwang¹, Raja Das¹, Joel H. Saltz¹ - Show less +2 more•Institutions (1)

University of Maryland, College Park¹

14 Nov 1994

TL;DR: CHAOS is described, a library of efficient runtime primitives that provides support for dynamic data partitioning, efficient preprocessing and fast data migration in adaptive irregular problems and is used to parallelize kernels from two adaptive applications.

...read moreread less

Abstract: In adaptive irregular problems, data arrays are accessed via indirection arrays, and data access patterns change during computation. Parallelizing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This paper describes CHAOS, a library of efficient runtime primitives that provides such support. To demonstrate the effectiveness of the runtime support, two adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a code for simulating gas flows (DSMC). We have also proposed minor extensions to Fortran D which would enable compilers to parallelize irregular for all loops in such adaptive applications by embedding calls to primitives provided by a runtime library. We have implemented our proposed extensions in the Syracuse Fortran 90D/HPF prototype compiler, and have used the compiler to parallelize kernels from two adaptive applications. >

...read moreread less

Journal Article•10.1006/JPDC.1994.1039•

Compiling Fortran 90D/HPF for distributed memory MIMD computers

[...]

Zeki Bozkus¹, Alok Choudhary¹, Geoffrey C. Fox¹, Tomasz Haupt¹, Sanjay Ranka¹, Min-You Wu¹ - Show less +2 more•Institutions (1)

Syracuse University¹

01 Apr 1994-Journal of Parallel and Distributed Computing

TL;DR: This thesis describes an advanced compiler that can generate efficient parallel programs when the source programming language naturally represents an application's parallelism and Fortran 90D/HPF, described in this thesis is such a language.

...read moreread less

Proceedings Article•10.1145/191995.192021•

Software versus hardware shared-memory implementation: a case study

[...]

Alan L. Cox¹, Sandhya Dwarkadas¹, P. Keleher¹, Honghui Lu¹, Ramakrishnan Rajamony¹, Willy Zwaenepoel¹ - Show less +2 more•Institutions (1)

Rice University¹

1 Apr 1994

TL;DR: The results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases.

...read moreread less

Abstract: We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect.Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480.Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck.

...read moreread less

Journal Article•10.1142/S0129626494000235•

Toward automatic distribution

[...]

Paul Feautrier

01 Sep 1994-Parallel Processing Letters

TL;DR: The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.

...read moreread less

Abstract: This paper considers the problem of distributing data and code among the processors of a distributed memory supercomputer. Provided that the source program is amenable to detailed dataflow analysis, one may determine a placement function by an incremental analogue of Gaussian elimination. Such a function completely characterizes the distribution by giving the identity of the virtual processor on which each elementary calculation is done. One has then to “realize” the virtual processors on the PE. The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.

...read moreread less

Journal Article•10.1109/38.291531•

Communication costs for parallel volume-rendering algorithms

[...]

Ulrich Neumann¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jul 1994-IEEE Computer Graphics and Applications

TL;DR: The article enumerates and classifies parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network and determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements.

...read moreread less

Abstract: The computational expense of volume rendering motivates the development of parallel implementations on multicomputers. Parallelism achieves higher frame rates, which provide more natural viewing control and enhanced comprehension of 3D structure. Although many parallel implementations exist, we have no framework to compare their relative merits independent of host hardware. The article attempts to establish that framework by enumerating and classifying parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network. It determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements. >

...read moreread less

Book Chapter•10.1007/BFB0025891•

Cid: A Parallel, Shared-Memory C for Distributed-Memory Machines

[...]

Rishiyur S. Nikhil

8 Aug 1994

TL;DR: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines that uses available C compilers and packet-transport primitives, and links with existing libraries.

...read moreread less

Abstract: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines. A major objective is to keep the entry cost low. For users-the language should be easily comprehensible to a C programmer. For implementors-it should run on standard hardware (including workstation farms); it should not require major new compilation techniques (which may not even be widely applicable); and it should be compatible with existing code, run-time systems and tools. Cid is implemented with a simple pre-processor and a library, uses available C compilers and packet-transport primitives, and links with existing libraries.

...read moreread less

Proceedings Article•10.1145/191995.192026•

Exploring the design space for a shared-cache multiprocessor

[...]

Basem A. Nayfeh¹, Kunle Olukotun¹•Institutions (1)

Stanford University¹

1 Apr 1994

TL;DR: This paper investigates the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors, and shows that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system.

...read moreread less

Abstract: In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques.

...read moreread less

Journal Article•10.1006/JPDC.1994.1108•

Scalability issues affecting the design of a dense linear algebra library

[...]

Jack Dongarra¹, Jack Dongarra², Robert A. van de Geijn, David W. Walker²•Institutions (2)

University of Tennessee¹, Oak Ridge National Laboratory²

01 Sep 1994-Journal of Parallel and Distributed Computing

TL;DR: This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers, and shows that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor.

...read moreread less

Book•

Advanced Topics in Dataflow Computing and Multithreading

[...]

Lubomir Bic, Guang R. Gao, Jean-Luc Gaudiot

1 Aug 1994

TL;DR: Examines recent advances in design, modeling, and implementation of dataflow and multithreaded computers and introduces the reader to dataflow concepts that show how functional programming ideas can be harnessed to exploit the power of parallel computing.

...read moreread less

Abstract: From the Publisher: Examines recent advances in design, modeling, and implementation of dataflow and multithreaded computers. The text contains reports concerning many of the world's leading projects engaged in the continuing evolution and application of dataflow concepts. It covers the broad range of dataflow principles in program representation - from language design to processor architecture - and compiler optimization techniques. The book includes papers on massively parallel distributed memory and multithreaded architecture design, synchronization and pipelined design, and superpipelined data-driven VLSI processors. Other sections discuss stream data types, the development of well-structured software, and parallelization of dataflow programs. It also details an analytical model for the behavior of dataflow graphs, compares a centralized work distribution scheme with a distributed scheme, and presents a comprehensive approach to understanding workload management schemes. Altogether, the text introduces the reader to dataflow concepts that show how functional programming ideas can be harnessed to exploit the power of parallel computing.

...read moreread less

Patent•

Multiprocessor system with distributed memory

[...]

John Joseph Coleman¹, Ronald Gerald Coleman¹, Owen Keith Monroe¹, Robert Frederick Stucke¹, Elizabeth Anne Vanderbeck¹, Stephen E. Bello¹, John R. Hattersley¹, Kien A. Hua¹, David Raymond Pruett¹, Gerald Franklin Rollo¹ - Show less +6 more•Institutions (1)

IBM¹

8 Nov 1994

TL;DR: In this paper, a parallel computer system consisting of a plurality of high level processors joined together using a cross-point or cross-bar switch is described, and the protocol processing to drive the switch, transfer pages and schedule transmissions between the processors is performed by the adapter.

...read moreread less

Abstract: A parallel computer system is disclosed comprising a plurality of high level processors joined together using a cross-point or cross-bar switch. The system includes an adapter between each processor and the switch. Protocol processing to drive the switch, transfer pages and schedule transmissions between the processors is performed by the adapter. The protocol use the notion of typed or tagged buffer management that allows a client to bind the semantics of a message being sent or received. These semantics specify behaviors in the protocol when message packets depart or when they arrive.

...read moreread less

Journal Article•10.1159/000154205•

Parallelization of general-linkage analysis problems.

[...]

Sandhya Dwarkadas¹, Alejandro A. Schäffer, Robert W. Cottingham, Alan L. Cox, P. Keleher, Willy Zwaenepoel - Show less +2 more•Institutions (1)

Rice University¹

01 May 1994-Human Heredity

TL;DR: A parallel implementation of a genetic-linkage analysis program that achieves good speed improvement, even for analyses on a single pedigree and with a single starting recombination fraction vector is described.

...read moreread less

Abstract: We describe a parallel implementation of a genetic-linkage analysis program that achieves good speed improvement, even for analyses on a single pedigree and with a single starting recombination fraction vector. Our parallel implementation has been run on three different platforms: an Ethernet network of workstations, a higher-bandwidth asynchronous transfer mode (ATM) network of workstations, and a shared-memory multiprocessor. The same program, written in a shared-memory programming style, is used on all platforms. On the workstation networks, the hardware does not provide shared memory, so the program executes on a distributed shared memory system that implements shared memory in software. These three platforms represent different points on the price/performance scale. Ethernet networks are cheap and omnipresent, ATM networks are an emerging technology that offers higher bandwidth, and shared-memory multiprocessors offer the best performance because communication is implemented entirely by hardware. On 8 processors and for the longer runs, we achieve speedups between 3.5 and 5 on the Ethernet network and between 4.8 and 6 on the ATM network. On the shared-memory multiprocessor, we achieve speedups in the 5.5-6.5 range for all runs.

...read moreread less

Proceedings Article•10.1145/191995.192019•

Evaluating the memory overhead required for COMA architectures

[...]

Truman Joe¹, John L. Hennessy¹•Institutions (1)

Stanford University¹

1 Apr 1994

TL;DR: Simulation data shows that the frequency of data reshuffling is sensitive to the allocation policy and associativity of the memory but is relatively unaffected by the block size chosen, and that data replication in the attraction memory is important for good performance, but most gains can be achieved through replicated in the processor caches.

...read moreread less

Abstract: Cache only memory architectures (COMA) have an inherent memory overhead due to the organization of main memory as a large cache called an attraction memory. This overhead consists of memory left unallocated for performance reasons as well as additional physical memory required due to the cache organization of memory. In this work, we examine the effect of data reshuffling and data replication on the memory overhead. Data reshuffling occurs when space needs to be allocated to store a remote memory line in the local memory. Data that is reshuffled is sent between memories via replacement messages. A simple mathematical model predicts the frequency of data reshuffling as a function of the attraction memory parameters. Simulation data shows that the frequency of data reshuffling is sensitive to the allocation policy and associativity of the memory but is relatively unaffected by the block size chosen. The simulation data also shows that data replication in the attraction memory is important for good performance, but most gains can be achieved through replication in the processor caches.

...read moreread less

Patent•

Run-time dynamically adaptive computer process for facilitating communication between computer programs

[...]

Daniel P. Schiavone¹•Institutions (1)

Ball Corporation¹

4 Nov 1994

TL;DR: In this paper, a dynamic interface between two dissimilar software programs that must communicate with each, whether running on one or a plurality of computers, is presented, which can provide bi-directional, nonintrusive data manipulation and communications between software programs on a distributed computing platform or across platforms on distributed network.

...read moreread less

Abstract: The present invention provides a dynamic interface between two dissimilar software programs that must communicate with each, whether running on one or a plurality of computers. The invention can provide bi-directional, non-intrusive data manipulation and communications between software programs on a distributed computing platform or across platforms on a distributed network. The invention includes user-defined template files, a user-defined equality file, first and second blocks of shared memory, a master interface, and a slave interface. The template files define the output and input data of their respective programs and map the output and input data to blocks of memory. The equality file equates the input data and output data of one program with the output data and input data, respectively, of the other computer program. The master interface takes data from the master side block of memory, reconfigures the data based on the contents of the equality file to match the input data requirements of the second computer program, and sends the reconfigured data to the slave interface to be loaded into the slave side block of shared memory. The second computer program accesses the reconfigured data from the slave side of shared memory.

...read moreread less

Proceedings Article•10.1109/IPPS.1994.288261•

Processor mapping techniques toward efficient data redistribution

[...]

E. T. Kalns¹, Lionel M. Ni¹•Institutions (1)

Michigan State University¹

1 Apr 1994

TL;DR: This paper presents a technique for data-processor mapping, applicable to data redistribution, that minimizes the total amount of data that must be communicated among processors.

...read moreread less

Abstract: Run-time data redistribution can affect algorithm performance in distributed-memory machines. Redistribution of data can be performed between algorithm phases when a different data decomposition is expected to deliver increased performance for a subsequent phase of computation. Additionally, data redistribution can occur at subprogram boundaries. Redistribution, however, represents increased program overhead as algorithm computation is necessarily discontinued while data are exchanged among processor memories. In this paper, we present a technique for data-processor mapping, applicable to data redistribution, that minimizes the total amount of data that must be communicated among processors. The mapping technique is architecture-independent and represents our initial work toward achieving efficient redistribution in distributed-memory machines. >

...read moreread less

Proceedings Article•10.1109/HICSS.1994.323149•

The S3.mp scalable shared memory multiprocessor

[...]

Andreas Nowatzyk¹, Gunes Aybay, Michael C. Browne, Edmund J. Kelly, D. Lee, Michael W. Parkin - Show less +2 more•Institutions (1)

Sun Microsystems¹

1 Jan 1994

TL;DR: S3.mp as mentioned in this paper is a low overhead, high throughput communication system that is based on cache coherent distributed shared memory (DSM) that uses distributed directories and point-to-point messages that are sent over a packet switched interconnect fabric to achieve scalability over a wide range of configurations.

...read moreread less

Abstract: S3.mp (Sun's Scalable Shared memory MultiProcessor) is a research project to demonstrate a low overhead, high throughput communication system that is based on cache coherent distributed shared memory (DSM). S3.mp uses distributed directories and point-to-point messages that are sent over a packet switched interconnect fabric to achieve scalability over a wide range of configurations. S3.mp uses a new CMOS serial link technology that achieves transmission rates >1 Gbit/sec and that is directly integrated into a packet router chip. Unlike other DSM systems, S3.mp can be spatially distributed over a local area via fiber optic links. This capability allows S3.mp to interconnect clusters of workstations to form multiprocessor workgroups that efficiently share memory, processors and I/O devices. Multichip module technology, the integrated arbitrary topology router, fast serial links, and a DSM system that is integrated into the memory controller allow compact, massively parallel S3.mp systems. >

...read moreread less

Proceedings Article•10.1145/181014.181081•

Experiences with parallel N-body simulation

[...]

Pangfeng Liu¹, Sandeep N. Bhatt²•Institutions (2)

Rutgers University¹, Telcordia Technologies²

1 Aug 1994

TL;DR: This paper describes the experiences developing high-performance code for astrophysical N-body simulations and uses a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures.

...read moreread less

Abstract: This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance implementation.Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified.The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 30% of the peak machine rate. Preliminary timings compare favorably with other approaches.

...read moreread less

...

Expand