Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 1991
  1. Home
  2. Topics
  3. Distributed memory
  4. 1991
Showing papers on "Distributed memory published in 1991"
Journal Article•10.1016/0956-0521(91)90014-V•
Partitioning of unstructured problems for parallel processing

[...]

Horst D. Simon1•
Ames Research Center1
01 Jan 1991-Computing Systems in Engineering
TL;DR: Numerical comparisons on large-scale two- and three-dimensional problems demonstrate the superiority of the new spectral bisection algorithm.

884 citations

Proceedings Article•10.1145/121132.121159•
Implementation and performance of Munin

[...]

John B. Carter1, John K. Bennett1, Willy Zwaenepoel1•
Rice University1
1 Sep 1991
TL;DR: This work evaluates the implementation of Munin and describes the execution of two Munin programs that achieve performance within ten percent of message passing implementations of the same programs.
Abstract: Munin is a distributed shared memory (DSM) system that allows shared memory parallel programs to be executed efficiently on distributed memory multiprocessors. Munin is unique among existing DSM systems in its use of multiple consistency protocols and in its use of release consistency. In Munin, shared program variables are annotated with their expected access pattern, and these annotations are then used by the runtime system to choose a consistency protocol best suited to that access pattern. Release consistency allows Munin to mask network latency and reduce the number of messages required to keep memory consistent. Munin's multiprotocol release consistency is implemented in software using a delayed update queue that buffers and merges pending outgoing writes. A sixteen-processor prototype of Munin is currently operational. We evaluate its implementation and describe the execution of two Munin programs that achieve performance within ten percent of message passing implementations of the same programs. Munin achieves this level of performance with only minor annotations to the shared memory programs.

772 citations

Journal Article•10.1145/103162.103164•
Paradigms for process interaction in distributed programs

[...]

Gregory R. Andrews1•
University of Arizona1
01 Mar 1991-ACM Computing Surveys
TL;DR: Several paradigms—examples or models—for process interaction in distributed computations are described, illustrated by solving problems, including parallel sorting, file servers, computing the topology of a network, distributed termination detection, replicated databases, and parallel adaptive quadrature.
Abstract: Distributed computations are concurrent programs in which processes communicate by message passing. Such programs typically execute on network architectures such as networks of workstations or distributed memory parallel machines (i.e., multicomputers such as hypercubes). Several paradigms—examples or models—for process interaction in distributed computations are described. These include networks of filters, clients, and servers, heartbeat algorithms, probe/echo algorithms, broadcast algorithms, token-passing algorithms, decentralized servers, and bags of tasks. These paradigms are appliable to numerous practical problems. They are illustrated by solving problems, including parallel sorting, file servers, computing the topology of a network, distributed termination detection, replicated databases, and parallel adaptive quadrature. Solutions to all problems are derived in a step-wise fashion from a general specification of the problem to a concrete solution. The derivations illustrate techniques for developing distributed algorithms.

270 citations

Proceedings Article•10.1145/109625.109637•
Scalable reader-writer synchronization for shared-memory multiprocessors

[...]

John Mellor-Crummey, Michael L. Scott
1 Apr 1991
TL;DR: Reader-writer locks that similarly exploit locality to achieve scalability are presented, with variants for reader preference, writer preference, and reader-writer fairness.
Abstract: Reader-writer synchronization relaxes the constraints of mutual exclusion to permit more than one process to inspect a shared object concurrently, as long as none of them changes its value. On uniprocessors, mutual exclusion and readerwriter locks are typically designed to de-schedule blocked processes; however, on shared-memory multiprocessors it is often advantageous to have processes busy wait. Unfortunately, implementations of busy-wait locks on sharedmemory multiprocessors typically cause memory and network contention that degrades performance. Several researchers have shown how to implement scalable mutual exclusion locks that exploit locality in the memory hierarchies of shared-memory multiprocessors to eliminate contention for memory and for the processor-memory interconnect. In this paper we present reader-writer locks that similarly exploit locality to achieve scalability, with variants for reader preference, writer preference, and reader-writer fairness. Performance results on a BBN TC2000 multiprocessor demonstrate that our algorithms provide low latency and excellent scalability.

174 citations

Proceedings Article•10.1145/106972.106999•
Synchronization without contention

[...]

John Mellor-Crummey, Michael L. Scott
1 Apr 1991
TL;DR: Fast, simple algorithms for contention-free mutual exclusion, reader-writer control, and barrier synchronization are presented, based on widely available fetch-and-@ instructions, that exploit local access to shared memory to avoid contention.
Abstract: Conventional wisdom holds that contention due to busy-wait synchronization is a major obstacle to scalability and acceptable performance in large shared-memory multiprocessors. We argue the contrary, and present fast, simple algorithms for contention-free mutual exclusion, reader-writer control, and barrier synchronization. These algorithms, based on widely available fetch-and-@ instructions, exploit local access to shared memory to avoid contention. We compare our algorithms to previous approaches in both qualitative and quantitative terms, presenting their performance on the Sequent Symmetry and BBN Butterfly multiprocessors. Our results highlight the importance of local access to shared memory, provide a case against the construction of so-called "dance hall" machines, and suggest that special-purpose hardware support for synchronization is unlikely to be cost effective on machines with sequentially consistent memory.

172 citations

Journal Article•10.1016/0022-1031(91)90017-Z•
Illusory correlation in a simulated exemplar-based memory

[...]

Eliot R. Smith1•
Purdue University1
01 Mar 1991-Journal of Experimental Social Psychology
TL;DR: In this paper, a quantitative model of long-term memory, implemented as a computer simulation, reproduces several known properties of the illusory correlation effect, such as distinctiveness-biased encoding.

146 citations

Journal Article•10.1145/118544.118546•
Experimental comparison of memory management policies for NUMA multiprocessors

[...]

Richard P. LaRowe, Carla Schlatter Ellis1•
Duke University1
01 Nov 1991-ACM Transactions on Computer Systems
TL;DR: The results show that there are memory management policies implemented in the system that can improve the performance of programs written using the simpler uniform memory access (UMA) programming model, and there appears to be no single policy that can be considered the best over a set of test applications.
Abstract: Non-uniformity of memory access is an almost inevitable feature of memory architecture in shared memory multiprocessor designs that can scale to large numbers of processors. One implication of NUMA architectures is that the placement and movement of code and data become crucial to performance. As memory architectures become more complex and the nonuniformity becomes less well hidden, systems software must assume a larger role in providing memory management support for the programmer. This paper investigates the role of the operating system. We take an experimental approach to evaluating a wide-range of memory management policies. The target NUMA environment is BBN''s GP-1000 multiprocessor. Extensive local modifications have been made to the memory management subsystem of BBN''s nX operating system to support multiple policy implementations. Policy comparisons are based on the measured performance of real parallel applications. Our results show that there are memory management policies implemented in our system that can improve the performance of programs written using the simpler uniform memory access (UMA) programming model. While achieving the level of performance of a highly tuned NUMA program is still a difficult problem, some examples come close. There appears to be no single policy that can be considered the best over our set of test applications. Investigations into the contributions made by individual policy features toward overall behavior of the workload provide some insight into the design of a set of effective policies.

123 citations

Patent•
Consistency protocols for shared memory multiprocessors

[...]

Pradeep S. Sindhu1, Cesar B. Douady1•
Xerox1
29 Nov 1991
TL;DR: In this paper, the authors propose a Consitency Protocol for Shared Memory Multi-Processor (CMMC) which allows the caches to store multiple copies of read/write data at identical physical addresses for use by the respective processors.
Abstract: A shared memory multiprocessor having a packet switched bus, together with write back caches for connecting individual processors to that bus, employs a consitency protocol that permits the caches to store multiple copies of read/write data at identical physical addresses for use as neded by the respective processors. The protocol causes the hardware to automatically and transparently maintain the consistency of this data. To that end, the caches detect when a datum becomes shared by monitoring the traffic on the bus, thereby enabling them to broadcast an updating write on the bus whenever their respective processors issue a write to a shared address. If desired, this protocol may be extended to include an advisory invalidate for reducing the amount of address sharing that occurs, thereby enhancing the efficiency of the protocol. The protocol maintains a consistent view of memory for the processors, while permitting I/O devices to have direct access to the memory system.

104 citations

Proceedings Article•10.1109/ICDCS.1991.148677•
Implementing and programming causal distributed shared memory

[...]

Mustaque Ahamad1, Phillip W. Hutto1, Ranjit John1•
Georgia Institute of Technology1
20 May 1991
TL;DR: A simple owner protocol for implementing a causal distributed shared memory (DSM) is presented, and it is argued that this implementation is more efficient than comparable coherent DSM implementations.
Abstract: A simple owner protocol for implementing a causal distributed shared memory (DSM) is presented, and it is argued that this implementation is more efficient than comparable coherent DSM implementations. Moreover, it is shown that writing programs for causal memory is no more difficult than writing programs for atomic shared memory. >

103 citations

Book•
Parallel Processing for Computer Graphics

[...]

Stuart Green
28 Aug 1991
TL;DR: Following a description of a multiprocessor solution for ray tracing and identification of the goals of the resulting system and its key components, Green identifies the areas of difficulty in programming distributed memory multiprocessionors with the intention of simplifying applications programming on these complex machines.
Abstract: Computer graphics has emerged as one of the most important and appropriate uses of parallel computers and supercomputers. Green provides background information on both computer graphics and parallel computers and then goes on to describe how a distributed memory multiprocessor can be used to effectively implement the ray tracing algorithm. He emphasizes a general-purpose approach so that the techniques developed can be applied to other applications.Ray tracing is a technique for generating realistic images that has high computational complexity, making it well suited for implementation on parallel computers. Following a description of a multiprocessor solution for ray tracing and identification of the goals of the resulting system and its key components, Green identifies the areas of difficulty in programming distributed memory multiprocessors with the intention of simplifying applications programming on these complex machines.Contents: Introduction. Introduction to Computer Graphics. Parallel Processing and Computer Graphics. Design of a Multiprocessor Solution for Ray Tracing. Task Management. Dynamic Data Management. Static Data Management. Conclusions.

97 citations

Patent•
Scheduling system for multiprocessor operating system

[...]

Mark A. Borcherding1, Steve Schneider1, J. Charles Kuhlman1•
Texas Instruments1
30 Aug 1991
TL;DR: In this article, a multiprocessor, multitasking digital processing system (10), having a host processor (11) and a number of digital signal processors (15), is described.
Abstract: A multiprocessor, multitasking digital processing system (10), having a host processor (11) and a number of digital signal processors (15). Tasks are downloaded from the host processor (11) to the digital signal processors (15), and are scheduled under the control of a scheduler program residing on the signal processors. Each task provides the scheduler with its processing characteristics, so that resources can be efficiently allocated according to the memory and processing requirements of the task.
Proceedings Article•10.1109/CMPEUR.1991.257369•
Load balancing in distributed memory multiprocessors

[...]

Arndt Bode1•
Information Technology University1
13 May 1991
TL;DR: It is argued that multiuser distributed memory multiprocessors with dynamic mapping of the application onto the hardware structure are needed to make available the advantages of this type of architecture to a wider user community.
Abstract: It is argued that multiuser distributed memory multiprocessors with dynamic mapping of the application onto the hardware structure are needed to make available the advantages of this type of architecture to a wider user community. It is shown, based on an abstract model, that such architectures may be used efficiently. It is also shown that future developments in interconnection hardware will allow the fulfillment of the assumptions made in the model. Since a dynamic load balancing procedure will be one of the most important components in the systems software, the elements of its implementation are discussed and first results based on a testbed implementation are presented. >
Patent•
Memory management method for coupled memory multiprocessor systems

[...]

David A. Orbits, Kenneth D. Abramson, H. Bruce Butts
20 Mar 1991
TL;DR: In this paper, a method of managing the memory of a CM multiprocessor computer system is described, where the data and stack pages of a process are transferred to the coupled memory region of the CPU module to which the process is assigned, when the pages are called for by the process.
Abstract: A method of managing the memory of a CM multiprocessor computer system is disclosed. A CM multiprocessor computer system includes: a plurality of CPU modules 11a . . . 11n to which processes are assigned; one or more optional global memories 13a . . . 13n; a storage medium 15a, 15b . . . 15n; and a global interconnect 12. Each of the CPU modules 11a . . . 11n includes a processor 21 and a coupled memory 23 accessible by the local processor without using the global interconnect 12. Processors have access to remote coupled memory regions via the global interconnect 12. Memory is managed by transferring, from said storage medium, the data and stack pages of a process to be run to the coupled memory region of the CPU module to which the process is assigned, when the pages are called for by the process. Other pages are transferred to global memory, if available. At prescribed intervals, the free memory of each coupled memory region and global memory is evaluated to determine if it is below a threshold. If below the threshold, a predetermined number of pages of the memory region are scanned. Infrequently used pages are placed on the end of a list of pages that can be replaced with pages stored in the storage medium. Pages associated with processes that are terminating are placed at the head of the list of replacement pages.
Proceedings Article•10.1145/107971.107987•
Analysis of task migration in shared-memory multiprocessor scheduling

[...]

Mark S. Squillante, Randolph Nelson
2 Apr 1991
TL;DR: The potential for significant improvements in system performance and the potential for unstable behavior under migratory scheduling policies are illustrated, and optimal policy thresholds are provided that yield the best performance and avoid this form of processor thrashing.
Abstract: In shared-memory multiprocessor systems it may be more efficient to schedule a task on one processor than on mother. Due to the inevitability of idle processors in these environments, there exists an important tradeoff between keeping the workload balanced and scheduling tasks where they run most efficiently. The purpose of an adaptive task migration policy is to determine the appropriate balance between the extremes of this load sharing tradeoff.We make the observation that there are considerable differences between this load sharing problem in distributed and shared-memory multiprocessor systems, and we formulate a queueing theoretic model of task migration to study the problem. A detailed mathematical analysis of the model is developed, which includes the effects of increased contention for system resources induced by the task migration policy. Our objective is to provide a better understanding of task migration in shared-memory multiprocessor environments. In particular, we illustrate the potential for significant improvements in system performance, and we show that even when migration costs are large it may still be beneficial to migrate waiting tasks to idle processors. We further demonstrate the potential for unstable behavior under migratory scheduling policies, and we provide optimal policy thresholds that yield the best performance and avoid this form of processor thrashing.
Proceedings Article•10.1145/115952.115964•
Race-free interconnection networks and multiprocessor consistency

[...]

Anders Landin1, Erik Hagersten1, Seif Haridi1•
Swedish Institute of Computer Science1
1 Apr 1991
TL;DR: New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph and what is required to maintain processor consistency in race-free networks are investigated.
Abstract: Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.
Journal Article•10.1002/CPE.4330030303•
Execution time support for adaptive scientific algorithms on distributed

[...]

Harry Berryman1, Joel H. Saltz1, Jeffrey S. Scroggs1•
Langley Research Center1
01 Jun 1991-Concurrency and Computation: Practice and Experience
TL;DR: The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines.
Abstract: Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
Proceedings Article•10.5555/949607.949616•
Fast rotation of volume data on data parallel architectures

[...]

Peter Schröder, James B. Salem
22 Oct 1991
TL;DR: The problem or rotating the volume in regard to the communication overhead associated with finely distributed memory is analyzed and an earlier technique (shear decomposition) is extended to 3D, and it is shown how this can be mapped onto a data-parallel architecture using only grid communication during the resampling associated with the rotation.
Abstract: Data parallel computer architectures hold great promises for high performance computing. Volume visualization (raytracing) is an application that can greatly benefit from these architectures. We describe an algorithm for rendering of orthographic views of volume data on such architectures. In particular the problem of rotating the volume in regard to the communication overhead associated with finely distributed memory is analyzed. We extend an earlier technique (shear decomposition) to 3D and show how this can be mapped onto a data parallel architecture using only grid communication during the resampling associated with the rotation. The rendering uses efficient parallel computation constructs that allow us to use sophisticated shading models and still maintain high speed throughout. This algorithm has been implemented on the Connection MachineR parallel supercomputer and is used in an interactive volume rendering application, with multiple frames per second performance.
Patent•
Adaptive memory management method for coupled memory multiprocessor systems

[...]

Kenneth D. Abramson, David A. Orbits, H. Bruce Butts
20 Mar 1991
TL;DR: In this paper, an adaptive memory management method for coupled memory multiprocessor computer systems is described, where the most referenced data and stack pages are placed in the coupled memory of the processor to which a specific process is assigned and lesser referenced pages are located in global memory or coupled memory regions of other processors.
Abstract: An adaptive memory management method for coupled memory multiprocessor computer systems is disclosed. In a coupled memory multiprocessor system all the data and stack pages of processes assigned to individual multiprocessors are, preferably, located in a memory region coupled to the assigned processor. When this becomes impossible, some data and stack pages are assigned to global memory or memory regions coupled to other processors. The present invention is a method of making certain that the most referenced data and stack pages are located in the coupled memory of the processor to which a specific process is assigned and lesser referenced pages are located in global memory or the coupled memory region of other processors. This result is accomplished by sampling the memory references made by the processors of the computer system and causing the most recently referenced pages in each coupled memory region to be maintained at the head of an active page list. References to remote data and stack pages are stored in a remote page hash table. Remote pages are pages stored in global memory or in coupled memory other than the coupled memory of the processor to which the process owning the pages is assigned. Any remote data and stack pages referenced more frequently than pages stored in a processor's coupled memory region are transferred to the processor's coupled memory region. If a processor's coupled memory region is tight, pages are transferred from the processor's coupled memory region to global memory or to the coupled memory region of another processor.
Journal Article•10.1016/0010-4655(91)90021-C•
Molecular-dynamics simulations on distributed memory machines

[...]

Steven Y. Liem1, David Brown1, Julian H. R. Clarke1•
University of Manchester1
01 Dec 1991-Computer Physics Communications
TL;DR: An efficient method of performing large scale molecular dynamics simulations on a distributed memory parallel computer which avoids redundant calculation of forces and incurs relatively low communication overheads is developed.
Patent•
Synchronous multiprocessor using tasks directly proportional in size to the individual processors rates

[...]

Syed V. Ahamed1, Victor B. Lawrence1•
Alcatel-Lucent1
7 Mar 1991
TL;DR: In this paper, the authors propose a multiprocessor architecture having advantages of both synchronous and asynchronous architectures, where each processor operates at a different rate (a), so that each processor processes a data unit (316) in a different amount of time.
Abstract: Multiprocessor architecture having advantages of both synchronous and asynchronous architectures. The multiprocessor (FIG. 10) comprises processors (300) operating in parallel and synchronously. Each processor operates at a different rate (a), so that each processor processes a data unit (316) in a different amount of time. An input distribution function (317) receives an input stream of data and distributes it to individual processors for processing, in amounts directly proportional to the operating rates of the individual processors, so that each processor processes all of the data distributed to it in the same amount of time as the other processors. Input data buffers (301) connected to processors operate synchronously with the connected processors, receiving and storing the distributed data and inputting it to the connected processors at rates synchronized with the processors' operating rates. Output data buffers (301) operating synchronously with the processors receive and store the processed data, and output it through an output gating function (318) into an output data stream at times and at rates also synchronized with the processors' operating rates.
Journal Article•10.1109/2.67191•
Address tracing for parallel machines

[...]

Craig B. Stunkel1, B. Janssens2, W.K. Fuchs2•
IBM1, University of Illinois at Urbana–Champaign2
01 Jan 1991-IEEE Computer
TL;DR: Recently implemented parallel system address-tracing methods based on several metrics are surveyed and the issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted.
Abstract: Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately. >
Journal Article•10.1109/59.141743•
Parallel implementation of a power system dynamic simulation methodology using the conjugate gradient method

[...]

I.C. Decker1, Djalma M. Falcão1, Eugenius Kaszkurewicz1•
Federal University of Rio de Janeiro1
07 May 1991-IEEE Transactions on Power Systems
TL;DR: The authors present results of tests with a parallel implementation of a power system dynamic simulation methodology for transient stability analysis in a parallel computer, showing considerable reduction in the simulation time.
Abstract: The authors present results of tests with a parallel implementation of a power system dynamic simulation methodology for transient stability analysis in a parallel computer. The test system is a planned configuration of the interconnected Brazilian South-Southeastern power system with 616 buses, 995 lines, and 88 generators. The parallel machine used in the computer simulation is a distributed memory multiprocessor arranged in a hypercube topology architecture. The nodes are based on the Inmos T800 processors with 4 Mbytes of local memory. The simulation methodology is based on the interlaced alternating implicit integration scheme in which the network equations are re-ordered such that the network admittance matrix appears in the block bordered diagonal form and is then solved by a combined application of the LU factorization and the conjugate gradient method. The results obtained show considerable reduction in the simulation time. >
Patent•
Consistent packet switched memory bus for shared memory multiprocessors

[...]

Pradeep Sindhu1, Jean-Marc Frailong1, Jean A. Gastinel1•
Xerox1
29 Nov 1991
TL;DR: In this article, a packet switched bus for transferring data between a plurality of processors, I/O devices, cache memories and main memory employs a bus protocol which permits multiple copies of data to be updated under the control of different processors while still ensuring that all processors and all I/Os have access to consistent values for all data at all times.
Abstract: A shared memory multiprocessor having a packet switched bus for transferring data between a plurality processors, I/O devices, cache memories and main memory employs a bus protocol which permits multiple copies of data to be updated under the control of different processors while still ensuring that all processors and all I/O devices have access to consistent values for all data at all times
A Fast Mach Network {IPC} Implementation

[...]

Joseph S. Barrera1•
Carnegie Mellon University1
1 Jan 1991
TL;DR: This paper describes an implementation of network Mach IPC optimized for clusters of processors connected by a fast network, such as workstations connected by an Ethernet or processors in a non-shared memory multiprocessor.
Abstract: This paper describes an implementation of network Mach IPC optimized for clusters of processors connected by a fast network, such as workstations connected by an Ethernet or processors in a non-shared memory multiprocessor. This work contrasts with earlier work, such as the netmsg server, which has emphasized connectivity (by using robust and widely available protocols such as TCP/IP) and con gurability (with an entirely user-state implementation) at the expense of performance. The issues addressed by this work are support for low latency delivery of small and large messages, support for port capabilities and reference counting, and integration with the existing local Mach IPC implementation. Low latency for small messages requires careful bu er and control ow management; this work is compared with other fast RPC work described in the literature. Low latency for large messages, particularly for faster networks, requires an avoidance of copying, which can be achieved through virtual memory support; the modi cations that were necessary to make Mach's virtual memory support inexpensive enough to be useful for this purpose are described. The distributed implementation of port capabilities, port reference counts, and port migration is discussed, and compared with that in the netmsg server. Finally, performance data is presented to quantify the speedup achieved with the described implementation.
Proceedings Article•10.1109/PACRIM.1991.160669•
An architecture of highly parallel computer AP 1000

[...]

Hiroaki Ishihata1, Takeshi Horie1, Satoshi Inano1, Toshiyuki Shimizu1, Sadayuki Kato1 •
Fujitsu1
9 May 1991
TL;DR: To reduce the overall communication latency, a message controller and a new routing scheme on the T-net have been developed to attain low-latency, high-throughput communication.
Abstract: A highly parallel computer with distributed memory called the AP1000 has been developed. The system consists of 64 and 1024 processing elements and three independent networks called the torus network (T-net), broadcast network (B-net), and synchronization network (S-net). The design goal for the AP1000 is to attain low-latency, high-throughput communication. To reduce the overall communication latency, a message controller and a new routing scheme on the T-net have been developed. The design concepts, architecture, and some results from performance tests for the AP1000 are presented. >
Patent•
Multiprocessor system having shared memory divided into a plurality of banks with access queues corresponding to each bank

[...]

Makoto Hanawa1, Tadahiko Nishimukai2, Osamu Nishii1, Suzuki Makoto2•
Renesas Electronics1, Hitachi2
29 Oct 1991
TL;DR: In this paper, a multiprocessor system of the present invention has an address bus, a data bus, first and second processors, four access queues, and a shared memory divided into four banks.
Abstract: A multiprocessor system of the present invention has an address bus, a data bus, first and second processors, four access queues, and a shared memory divided into four banks. The four access queues are constituted by first-in first-out memories for buffering a plurality of access-request addresses transmitted through the address bus. Even if continuous access requests are addressed to one bank of the shared memory, a succeeding access request need not wait for a previous access request to be finished. Accordingly, the throughput of the system can be improved greatly.
Journal Article•10.1016/0743-7315(91)90039-C•
Optimal matrix transposition of bit reversal on hypercubes: all-to-personalized communication

[...]

Alan Edelman1•
University of California, Berkeley1
02 Feb 1991-Journal of Parallel and Distributed Computing
TL;DR: An optimal algorithm for performing the communication described by exchanging the bits of the node address with that of the local address is described, typically in both matrix transposition and bit reversal for the fast Fourier transform.
Proceedings Article•10.5555/304238.304325•
Parallel simulation of timed Petri-nets

[...]

David M. Nicol1, Subhas C. Roy1•
College of William & Mary1
1 Dec 1991
TL;DR: A parallelized Petri-net simulator which has been implemented on an Intel iPSC/2 distributed memory multiprocessor is discussed, and a graphics-based front-end for the simulator, used to build timed petri-nets, is described.
Abstract: The authors consider the problem of using a parallel computer to execute discrete-event simulation of timed Petri-nets. They first develop synchronization and simulation algorithms for this task, and discuss a parallelized Petri-net simulator which has been implemented on an Intel iPSC/2 distributed memory multiprocessor. A graphics-based front-end for the simulator, used to build timed Petri-net models, is described. Empirical studies of the simulator's performance on a variety of timed Petri-net models are described. >
Proceedings Article•10.1109/DMCC.1991.633082•
Automatic Data Partitioning on Distributed Memory Multiprocessors

[...]

Manish Gupta1, Prithviraj Banerjee•
University of Illinois at Urbana–Champaign1
28 Apr 1991
TL;DR: A novel approach to the problem of automatic data partitioning by introducing the notion of constraints on data distribution, and showing how a parallelizing compiler can infer those constraints by looking at the data reference patterns in the source code of the program.
Abstract: : An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Most of the current projects leave this tedious problem almost entirely to the user. In this paper, we present a novel approach to the problem of automatic data partitioning. We introduce the notion of constraints on data distribution, and show how a parallelizing compiler can infer those constraints by looking at the data reference patterns in the source code of the program. We show how these constraints may be combined by the compiler to obtain a complete and consistent picture of the data distribution scheme, one that offers good performance in terms of the overall execution time. We illustrate our approach on an example routine, TRED2, from the EISPACK library, to demonstrate its applicability to real programs. Finally, we discuss briefly some other approaches that have recently been proposed for this problem, and argue why ours seems to be more general and powerful.
Patent•
Multiprocessor system having distributed shared resources and dynamic and selective global data replication

[...]

Maria Costa, Carlo Leonardi
25 Feb 1991
TL;DR: In this paper, the replication in the other CPUs is performed in a predetermined trash page of the local memory so that memory space required for replication is minimized, as is traffic on the system bus for global data replication and global data writes required for assuring global data consistency.
Abstract: Multiprocessor system having distributed shared resources and dynamical and selective global data replication in which a plurality of processors communicate each with the other through a sytem bus. Each CPU is provided with a local memory storing data used locally and global data shareable by a plurality of processes operative in differing CPUs and therefore replicated in the local memory of each CPU. The global data replication is performed, at page level, only when a global data page is effectively needed by a plurality of processes operative in differing CPUs and in those CPUs where the page is needed, the replication in the other CPUs being performed in a predetermined trash page of the local memory so that memory space required for replication is minimized, as is traffic on the system bus for global data replication and global data writes required for assuring global data consistency.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve