Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 1994
  1. Home
  2. Topics
  3. Distributed memory
  4. 1994
Showing papers on "Distributed memory published in 1994"
Journal Article•10.1007/BF00818163•
Fast folding and comparison of RNA secondary structures

[...]

Ivo L. Hofacker1, Walter Fontana2, Peter F. Stadler2, Peter F. Stadler1, L. S. Bonhoeffer3, Manfred Tacker1, Peter Schuster1, Peter Schuster4, Peter Schuster2 •
University of Vienna1, Santa Fe Institute2, University of Oxford3, Institute of Molecular Biotechnology4
01 Feb 1994-Monatshefte Fur Chemie
TL;DR: The Vienna RNA package as mentioned in this paper is based on dynamic programming algorithms and aims at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities.
Abstract: Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities. An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment. All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are.

2,473 citations

Journal Article•10.1002/CPE.4330060203•
Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems

[...]

Stephen T. Barnard, Horst D. Simon
01 Apr 1994-Concurrency and Computation: Practice and Experience
TL;DR: In this paper, a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples, and it is shown that RSB in its simplest form is expensive.
Abstract: SUMMARY If problems involving unstructured meshes are to be solved efficiently on distributed-memory parallel computers, the meshes must be partitioned and distributed across processors in a way that balances the computational load and minimizes communication. The recursive spectral bisection method (RSB) has been shown to be very effective for such partitioning problems compared to alternative methods, but RSB in its simplest form is expensive. Here a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples. 1. INTRODUCTION Unstructured meshes are used in several large-scale scientific and engineering problems, including finite-volume methods for computational fluid dynamics and finite-element methods for structural analysis. If unstructured problems such as these are to be solved on distributed-memory parallel computers, their data structures must be partitioned and distributed across processors; if they are to be solved efficiently, the partitioning must niaximize load balance and minimize interprocessor communication. Recently, the recursive spectral bisection method (RSB)[l] has been shown to be very effective for such partitioning problems compared to alternative methods. Unfortunately, RSB in its simplest form is expensive. We shall describe a multilevel version of RSB that attains about im order-of-magnitude improvement in run time on typical examples.

616 citations

Proceedings Article•10.1145/195473.195575•
Fine-grain access control for distributed shared memory

[...]

Ioannis T. Schoinas1, Babak Falsafi1, Alvin R. Lebeck1, Steven K. Reinhardt1, James R. Larus1, Darien Wood1 •
University of Wisconsin-Madison1
1 Nov 1994
TL;DR: In this paper, the authors discuss implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions, and incorporate three techniques that require no additional hardware into Blizzard.
Abstract: This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing.This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzard's performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5's will be able to support the same shared-memory interfaces as supercomputers.

277 citations

Patent•
Message passing system for distributed shared memory multiprocessor system and message passing method using the same

[...]

Shigeki Yamada1, Katsumi Maruyama1, Minoru Kubota1, Satoshi Tanaka1•
Nippon Telegraph and Telephone1
3 Oct 1994
TL;DR: In this article, a multiprocessor system consisting of a processor, a distributed shared memory coupler, and a distributed memory protector is described, where the distributed shared memories are assigned global addresses common to all the processor modules, and each processor module has its addresses shared with the shared shared memory of each processor which is the destination of data transfer.
Abstract: In a multiprocessor system, each processor module comprises a processor, a distributed shared memory, a distributed memory coupler for controlling copying between distributed shared memories and a distributed memory protector for protecting said distributed shared memory against illegal access. The distributed shared memories are assigned global addresses common to all the processor modules, and the distributed shared memory of each processor module has its addresses shared with the distributed shared memory of each processor module which is the destinatiion of data transfer. Message buffers and message control areas on the distributed shared memory are divided into areas specified by a combination of sending and receiving processor modules. A processing request area on the distributed shared memory is divided corresponding to each receiving processor module and arranged accordingly. The processing request area on the receiver's side distributed shared memory has a FIFO structure. The sender's side distributed memory coupler stores identifying information of the destination processor module between the processor module communication and, upon occurrence of a write into the distributed shared memory, sends a write address and write data to the destination processor module. The receiver's side distributed memory coupler copies the received write data into the distributed shared memory of the processor module to which the distributed shared memory coupler belongs, by receiving write address and write data from the sender's side distributed memory coupler.

252 citations

Proceedings Article•10.1145/191995.192030•
A performance study of software and hardware data prefetching schemes

[...]

Tien-Fu Chen1, Jean-Loup Baer2•
National Chung Cheng University1, University of Washington2
1 Apr 1994
TL;DR: Qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references, and an approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.
Abstract: Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

242 citations

Journal Article•10.1006/JPDC.1994.1104•
Communication optimizations for irregular scientific computations on distributed memory architectures

[...]

Raja Das1, Mustafa Uysal1, Joel H. Saltz1, Yuan-Shin Hwang1•
University of Maryland, College Park1
01 Sep 1994-Journal of Parallel and Distributed Computing
TL;DR: A detailed performance and scalability analysis of the communication primitives is presented, carried out using a workload generator, kernels from real applications, and a large unstructured adaptive application.

216 citations

Patent•
Multi-processor with crossbar link of processors and memories and method of operation

[...]

Robert J. Gove1, Karl M. Guttag1, Keith Balmer1, Nicholas Ing-Simmons1•
Texas Instruments1
21 Jun 1994
TL;DR: In this article, a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor is described, with several individual processors all having communication links to several memories without restriction.
Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor. The processor is structured with several individual processors all having communication links to several memories without restriction. A crossbar switch serves to establish the processor memory links and the entire image processor, including the individual processors, the crossbar switch and the memories, are contained on a single silicon chip.

180 citations

Journal Article•10.1016/0167-8191(94)90033-7•
The design of a standard message passing interface for distributed memory concurrent computers

[...]

David W. Walker1•
Oak Ridge National Laboratory1
1 Apr 1994
TL;DR: An overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers, which includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies is presented.
Abstract: This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of MPI has been a collective effort involving researchers in the United States and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.

145 citations

Journal Article•10.1016/0167-8191(94)90080-9•
Message-passing multi-cell molecular dynamics on the Connection Machine 5

[...]

D. M. Beazley1, Peter S. Lomdahl1•
Los Alamos National Laboratory1
1 Feb 1994
TL;DR: In this article, a message-passing multi-cell approach is proposed for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message passing multicell approach.
Abstract: We present a new scalable algorithm for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message-passing multi-cell approach We have implemented the algorithm on the Connection Machine 5 (CM-5) and demonstrate that meso-scale molecular dynamics with more than 108 particles is now possible on massively parallel MIMD computers Typical runs show single particle update-times of 015 μs in 2 dimentions (2D) and approximately 1 μs in 3 dimensions (3D) on a 1024 node CM-5 without vector units, corresponding to more than 18 Gflops overall performance We also present a scaling equation which agrees well with actually observed timings

110 citations

Proceedings Article•10.5555/1267638.1267646•
Software write detection for a distributed shared memory

[...]

Matthew J. Zekauskas1, Wayne A. Sawdon1, Brian N. Bershad2•
Carnegie Mellon University1, University of Washington2
14 Nov 1994
TL;DR: A new method for write detection that relies on the compiler and runtime system to detect writes to shared data without invoking the operating system, and has low average write latency and supports fine-grained sharing with low overhead.
Abstract: Most software-based distributed shared memory (DSM) systems rely on the operating system's virtual memory interface to detect writes to shared data. Strategies based on virtual memory page protection create two problems for a DSM system. First, writes can have high overhead since they are detected with a page fault. As a result, a page must be written many times to amortize the cost of that fault. Second, the size of a virtual memory page is too big to serve as a unit of coherency, inducing false sharing. Mechanisms to handle false sharing can increase runtime overhead and may cause data to be unnecessarily communicated between processors.In this paper, we present a new method for write detection that solves these problems. Our method relies on the compiler and runtime system to detect writes to shared data without invoking the operating system. We measure and compare implementations of a distributed shared memory system using both strategies, virtual memory and compiler/runtime, running a range of applications on a small scale distributed memory multicomputer. We show that the new method has low average write latency and supports fine-grained sharing with low overhead. Further, we show that the dominant cost of write detection with either strategy is due to the mechanism used to handle fine-grain sharing.

109 citations

Patent•
Reconfigurable multi-processor operating in SIMD mode with one processor fetching instructions for use by remaining processors

[...]

Robert J. Gove1, Keith Balmer1, Nicholas Ing-Simmons1, Karl M. Guttag1•
Texas Instruments1
22 Jun 1994
TL;DR: In this article, a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor is described, with several individual processors all having communication links to several memories without restriction.
Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor The processor is structured with several individual processors all having communication links to several memories without restriction A crossbar switch serves to establish the processor memory links and an inter-processor communication link allows the processors to communicate with each other for the purpose of establishing operational modes A parameter memory, accessible via the crossbar switch, is used in conjunction with the communication link for control purposes The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip
Patent•
System and method of memory access in apparatus having plural processors and plural memories

[...]

Robert J. Gove1, Keith Balmer1, Nicholas Ing-Simmons1, Karl M. Guttag1•
Texas Instruments1
22 Jun 1994
TL;DR: In this paper, a multi-processor system and method arranged, in one embodiment, as an image and graphics processor is described. But this system is based on a single silicon chip and does not have a crossbar switch to establish the memory links.
Abstract: There is disclosed a multi-processor system and method arranged, in one embodiment, as an image and graphics processor. The image processor is structured with several individual processors all having communication links to several memories. A crossbar switch serves to establish the processor memory links. The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip.
Proceedings Article•10.5555/1267638.1267647•
The design and evaluation of a shared object system for distributed memory machines

[...]

Daniel J. Scales1, Monica S. Lam1•
Stanford University1
14 Nov 1994
TL;DR: This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms and finds that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware.
Abstract: This paper describes the design and evaluation of SAM, a shared object system for distributed memory machines. SAM is a portable run-time system that provides a global name space and automatic caching of shared data. SAM incorporates mechanisms to address the problem of high communication overheads on distributed memory machines; these mechanisms include tying synchronization to data access, chaotic access to data, prefetching of data, and pushing of data to remote processors. SAM has been implemented on the CM-5, Intel iPSC/860 and Paragon, IBM SP1, and networks of workstations running PVM. SAM applications run on all these platforms without modification.This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms. We find that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware. Our experience suggests that SAM is successful in allowing programmers to use distributed memory machines effectively with much less programming effort than required today.
Proceedings Article•10.1109/HICSS.1994.323177•
Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor

[...]

M. Galles, E. Williams
1 Jan 1994
TL;DR: The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system.
Abstract: This paper presents the architecture, implementation, and performance results for the SGI Challenge symmetric multiprocessor system. Novel aspects of the architecture are highlighted, as well as key design trade-offs targeted at increasing performance and reducing complexity. Multiprocessor design verification techniques and their impact is also presented. The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system. Hardware cache coherence mechanisms maintain a consistent view of shared memory for all processors, with no software overhead and minimal impact on processor performance. HDL simulation with random, self checking vector generation and a lightweight operating system on full processor models contributed to a concept to customer shipment cycle of 26 months. >
Proceedings Article•10.5555/602770.602793•
Run-time and compile-time support for adaptive irregular problems

[...]

S.D. Sharma1, Ravi Ponnusamy1, Bongki Moon1, Yuan-Shin Hwang1, Raja Das1, Joel H. Saltz1 •
University of Maryland, College Park1
14 Nov 1994
TL;DR: CHAOS is described, a library of efficient runtime primitives that provides support for dynamic data partitioning, efficient preprocessing and fast data migration in adaptive irregular problems and is used to parallelize kernels from two adaptive applications.
Abstract: In adaptive irregular problems, data arrays are accessed via indirection arrays, and data access patterns change during computation. Parallelizing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This paper describes CHAOS, a library of efficient runtime primitives that provides such support. To demonstrate the effectiveness of the runtime support, two adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a code for simulating gas flows (DSMC). We have also proposed minor extensions to Fortran D which would enable compilers to parallelize irregular for all loops in such adaptive applications by embedding calls to primitives provided by a runtime library. We have implemented our proposed extensions in the Syracuse Fortran 90D/HPF prototype compiler, and have used the compiler to parallelize kernels from two adaptive applications. >
Journal Article•10.1006/JPDC.1994.1039•
Compiling Fortran 90D/HPF for distributed memory MIMD computers

[...]

Zeki Bozkus1, Alok Choudhary1, Geoffrey C. Fox1, Tomasz Haupt1, Sanjay Ranka1, Min-You Wu1 •
Syracuse University1
01 Apr 1994-Journal of Parallel and Distributed Computing
TL;DR: This thesis describes an advanced compiler that can generate efficient parallel programs when the source programming language naturally represents an application's parallelism and Fortran 90D/HPF, described in this thesis is such a language.
Proceedings Article•10.1145/191995.192021•
Software versus hardware shared-memory implementation: a case study

[...]

Alan L. Cox1, Sandhya Dwarkadas1, P. Keleher1, Honghui Lu1, Ramakrishnan Rajamony1, Willy Zwaenepoel1 •
Rice University1
1 Apr 1994
TL;DR: The results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases.
Abstract: We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect.Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480.Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck.
Journal Article•10.1142/S0129626494000235•
Toward automatic distribution

[...]

Paul Feautrier
01 Sep 1994-Parallel Processing Letters
TL;DR: The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.
Abstract: This paper considers the problem of distributing data and code among the processors of a distributed memory supercomputer. Provided that the source program is amenable to detailed dataflow analysis, one may determine a placement function by an incremental analogue of Gaussian elimination. Such a function completely characterizes the distribution by giving the identity of the virtual processor on which each elementary calculation is done. One has then to “realize” the virtual processors on the PE. The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.
Journal Article•10.1109/38.291531•
Communication costs for parallel volume-rendering algorithms

[...]

Ulrich Neumann1•
University of North Carolina at Chapel Hill1
01 Jul 1994-IEEE Computer Graphics and Applications
TL;DR: The article enumerates and classifies parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network and determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements.
Abstract: The computational expense of volume rendering motivates the development of parallel implementations on multicomputers. Parallelism achieves higher frame rates, which provide more natural viewing control and enhanced comprehension of 3D structure. Although many parallel implementations exist, we have no framework to compare their relative merits independent of host hardware. The article attempts to establish that framework by enumerating and classifying parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network. It determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements. >
Book Chapter•10.1007/BFB0025891•
Cid: A Parallel, Shared-Memory C for Distributed-Memory Machines

[...]

Rishiyur S. Nikhil
8 Aug 1994
TL;DR: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines that uses available C compilers and packet-transport primitives, and links with existing libraries.
Abstract: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines. A major objective is to keep the entry cost low. For users-the language should be easily comprehensible to a C programmer. For implementors-it should run on standard hardware (including workstation farms); it should not require major new compilation techniques (which may not even be widely applicable); and it should be compatible with existing code, run-time systems and tools. Cid is implemented with a simple pre-processor and a library, uses available C compilers and packet-transport primitives, and links with existing libraries.
Proceedings Article•10.1145/191995.192026•
Exploring the design space for a shared-cache multiprocessor

[...]

Basem A. Nayfeh1, Kunle Olukotun1•
Stanford University1
1 Apr 1994
TL;DR: This paper investigates the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors, and shows that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system.
Abstract: In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques.
Journal Article•10.1006/JPDC.1994.1108•
Scalability issues affecting the design of a dense linear algebra library

[...]

Jack Dongarra1, Jack Dongarra2, Robert A. van de Geijn, David W. Walker2•
University of Tennessee1, Oak Ridge National Laboratory2
01 Sep 1994-Journal of Parallel and Distributed Computing
TL;DR: This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers, and shows that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor.
Book•
Advanced Topics in Dataflow Computing and Multithreading

[...]

Lubomir Bic, Guang R. Gao, Jean-Luc Gaudiot
1 Aug 1994
TL;DR: Examines recent advances in design, modeling, and implementation of dataflow and multithreaded computers and introduces the reader to dataflow concepts that show how functional programming ideas can be harnessed to exploit the power of parallel computing.
Abstract: From the Publisher: Examines recent advances in design, modeling, and implementation of dataflow and multithreaded computers. The text contains reports concerning many of the world's leading projects engaged in the continuing evolution and application of dataflow concepts. It covers the broad range of dataflow principles in program representation - from language design to processor architecture - and compiler optimization techniques. The book includes papers on massively parallel distributed memory and multithreaded architecture design, synchronization and pipelined design, and superpipelined data-driven VLSI processors. Other sections discuss stream data types, the development of well-structured software, and parallelization of dataflow programs. It also details an analytical model for the behavior of dataflow graphs, compares a centralized work distribution scheme with a distributed scheme, and presents a comprehensive approach to understanding workload management schemes. Altogether, the text introduces the reader to dataflow concepts that show how functional programming ideas can be harnessed to exploit the power of parallel computing.
Patent•
Multiprocessor system with distributed memory

[...]

John Joseph Coleman1, Ronald Gerald Coleman1, Owen Keith Monroe1, Robert Frederick Stucke1, Elizabeth Anne Vanderbeck1, Stephen E. Bello1, John R. Hattersley1, Kien A. Hua1, David Raymond Pruett1, Gerald Franklin Rollo1 •
IBM1
8 Nov 1994
TL;DR: In this paper, a parallel computer system consisting of a plurality of high level processors joined together using a cross-point or cross-bar switch is described, and the protocol processing to drive the switch, transfer pages and schedule transmissions between the processors is performed by the adapter.
Abstract: A parallel computer system is disclosed comprising a plurality of high level processors joined together using a cross-point or cross-bar switch. The system includes an adapter between each processor and the switch. Protocol processing to drive the switch, transfer pages and schedule transmissions between the processors is performed by the adapter. The protocol use the notion of typed or tagged buffer management that allows a client to bind the semantics of a message being sent or received. These semantics specify behaviors in the protocol when message packets depart or when they arrive.
Journal Article•10.1159/000154205•
Parallelization of general-linkage analysis problems.

[...]

Sandhya Dwarkadas1, Alejandro A. Schäffer, Robert W. Cottingham, Alan L. Cox, P. Keleher, Willy Zwaenepoel •
Rice University1
01 May 1994-Human Heredity
TL;DR: A parallel implementation of a genetic-linkage analysis program that achieves good speed improvement, even for analyses on a single pedigree and with a single starting recombination fraction vector is described.
Abstract: We describe a parallel implementation of a genetic-linkage analysis program that achieves good speed improvement, even for analyses on a single pedigree and with a single starting recombination fraction vector. Our parallel implementation has been run on three different platforms: an Ethernet network of workstations, a higher-bandwidth asynchronous transfer mode (ATM) network of workstations, and a shared-memory multiprocessor. The same program, written in a shared-memory programming style, is used on all platforms. On the workstation networks, the hardware does not provide shared memory, so the program executes on a distributed shared memory system that implements shared memory in software. These three platforms represent different points on the price/performance scale. Ethernet networks are cheap and omnipresent, ATM networks are an emerging technology that offers higher bandwidth, and shared-memory multiprocessors offer the best performance because communication is implemented entirely by hardware. On 8 processors and for the longer runs, we achieve speedups between 3.5 and 5 on the Ethernet network and between 4.8 and 6 on the ATM network. On the shared-memory multiprocessor, we achieve speedups in the 5.5-6.5 range for all runs.
Proceedings Article•10.1145/191995.192019•
Evaluating the memory overhead required for COMA architectures

[...]

Truman Joe1, John L. Hennessy1•
Stanford University1
1 Apr 1994
TL;DR: Simulation data shows that the frequency of data reshuffling is sensitive to the allocation policy and associativity of the memory but is relatively unaffected by the block size chosen, and that data replication in the attraction memory is important for good performance, but most gains can be achieved through replicated in the processor caches.
Abstract: Cache only memory architectures (COMA) have an inherent memory overhead due to the organization of main memory as a large cache called an attraction memory. This overhead consists of memory left unallocated for performance reasons as well as additional physical memory required due to the cache organization of memory. In this work, we examine the effect of data reshuffling and data replication on the memory overhead. Data reshuffling occurs when space needs to be allocated to store a remote memory line in the local memory. Data that is reshuffled is sent between memories via replacement messages. A simple mathematical model predicts the frequency of data reshuffling as a function of the attraction memory parameters. Simulation data shows that the frequency of data reshuffling is sensitive to the allocation policy and associativity of the memory but is relatively unaffected by the block size chosen. The simulation data also shows that data replication in the attraction memory is important for good performance, but most gains can be achieved through replication in the processor caches.
Patent•
Run-time dynamically adaptive computer process for facilitating communication between computer programs

[...]

Daniel P. Schiavone1•
Ball Corporation1
4 Nov 1994
TL;DR: In this paper, a dynamic interface between two dissimilar software programs that must communicate with each, whether running on one or a plurality of computers, is presented, which can provide bi-directional, nonintrusive data manipulation and communications between software programs on a distributed computing platform or across platforms on distributed network.
Abstract: The present invention provides a dynamic interface between two dissimilar software programs that must communicate with each, whether running on one or a plurality of computers. The invention can provide bi-directional, non-intrusive data manipulation and communications between software programs on a distributed computing platform or across platforms on a distributed network. The invention includes user-defined template files, a user-defined equality file, first and second blocks of shared memory, a master interface, and a slave interface. The template files define the output and input data of their respective programs and map the output and input data to blocks of memory. The equality file equates the input data and output data of one program with the output data and input data, respectively, of the other computer program. The master interface takes data from the master side block of memory, reconfigures the data based on the contents of the equality file to match the input data requirements of the second computer program, and sends the reconfigured data to the slave interface to be loaded into the slave side block of shared memory. The second computer program accesses the reconfigured data from the slave side of shared memory.
Proceedings Article•10.1109/IPPS.1994.288261•
Processor mapping techniques toward efficient data redistribution

[...]

E. T. Kalns1, Lionel M. Ni1•
Michigan State University1
1 Apr 1994
TL;DR: This paper presents a technique for data-processor mapping, applicable to data redistribution, that minimizes the total amount of data that must be communicated among processors.
Abstract: Run-time data redistribution can affect algorithm performance in distributed-memory machines. Redistribution of data can be performed between algorithm phases when a different data decomposition is expected to deliver increased performance for a subsequent phase of computation. Additionally, data redistribution can occur at subprogram boundaries. Redistribution, however, represents increased program overhead as algorithm computation is necessarily discontinued while data are exchanged among processor memories. In this paper, we present a technique for data-processor mapping, applicable to data redistribution, that minimizes the total amount of data that must be communicated among processors. The mapping technique is architecture-independent and represents our initial work toward achieving efficient redistribution in distributed-memory machines. >
Proceedings Article•10.1109/HICSS.1994.323149•
The S3.mp scalable shared memory multiprocessor

[...]

Andreas Nowatzyk1, Gunes Aybay, Michael C. Browne, Edmund J. Kelly, D. Lee, Michael W. Parkin •
Sun Microsystems1
1 Jan 1994
TL;DR: S3.mp as mentioned in this paper is a low overhead, high throughput communication system that is based on cache coherent distributed shared memory (DSM) that uses distributed directories and point-to-point messages that are sent over a packet switched interconnect fabric to achieve scalability over a wide range of configurations.
Abstract: S3.mp (Sun's Scalable Shared memory MultiProcessor) is a research project to demonstrate a low overhead, high throughput communication system that is based on cache coherent distributed shared memory (DSM). S3.mp uses distributed directories and point-to-point messages that are sent over a packet switched interconnect fabric to achieve scalability over a wide range of configurations. S3.mp uses a new CMOS serial link technology that achieves transmission rates >1 Gbit/sec and that is directly integrated into a packet router chip. Unlike other DSM systems, S3.mp can be spatially distributed over a local area via fiber optic links. This capability allows S3.mp to interconnect clusters of workstations to form multiprocessor workgroups that efficiently share memory, processors and I/O devices. Multichip module technology, the integrated arbitrary topology router, fast serial links, and a DSM system that is integrated into the memory controller allow compact, massively parallel S3.mp systems. >
Proceedings Article•10.1145/181014.181081•
Experiences with parallel N-body simulation

[...]

Pangfeng Liu1, Sandeep N. Bhatt2•
Rutgers University1, Telcordia Technologies2
1 Aug 1994
TL;DR: This paper describes the experiences developing high-performance code for astrophysical N-body simulations and uses a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures.
Abstract: This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance implementation.Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified.The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 30% of the peak machine rate. Preliminary timings compare favorably with other approaches.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve