Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 1989
  1. Home
  2. Topics
  3. Distributed memory
  4. 1989
Showing papers on "Distributed memory published in 1989"
Journal Article•10.1145/75104.75105•
Memory coherence in shared virtual memory systems

[...]

Kai Li1, Paul Hudak2•
Princeton University1, Yale University2
01 Nov 1989-ACM Transactions on Computer Systems
TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.
Abstract: The memory coherence problem in designing and implementing a shared virtual memory on loosely coupled multiprocessors is studied in depth. Two classes of algorithms, centralized and distributed, for solving the problem are presented. A prototype shared virtual memory on an Apollo ring based on these algorithms has been implemented. Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.

1,319 citations

Journal Article•10.1016/0743-7315(89)90021-X•
Dynamic load balancing for distributed memory multiprocessors

[...]

George Cybenko1•
Tufts University1
01 Oct 1989-Journal of Parallel and Distributed Computing
TL;DR: This paper completely analyze the hypercube network by explicitly computing the eigenstructure of its node adjacency matrix and shows that a diffusion approach to load balancing on a hypercube multiprocessor is inferior to another approach which is called the dimension exchange method.

1,133 citations

Proceedings Article•
Time Warp on a Shared Memory Multiprocessor.

[...]

Richard M. Fujimoto
1 Jan 1989
TL;DR: The direct cancellation mechanism is proposed that eliminates the need for anti-messages and provides an efficient mechanism for canceling erroneous computations and thereby eliminates many of the overheads associated with conventional, message-based implementations of Time Warp.
Abstract: : A variation of the Time Warp parallel discrete event simulation mechanism is presented that is optimized for execution on a shared memory multiprocessor. In particular, the direct cancellation mechanism is proposed that eliminates the need for anti-messages and provides an efficient mechanism for canceling erroneous computations. The mechanism thereby eliminates many of the overheads associated with conventional, message-based implementations of Time Warp. More importantly, this mechanism effects rapid repairs of the parallel computation when an error is discovered. Initial performance measurements of an implementation of the mechanism executing on a BBN Butterfly multiprocessor are presented. These measurements indicate that the mechanism achieves good performance, particularly for many workloads where conservative clock synchronization algorithms perform poorly. Speedups as high as 56.8 using 64 processors were obtained. However, our studies also indicate that state saving overheads represent a significant stumbling block for many parallel simulations using Time Warp.

225 citations

Proceedings Article•10.1145/74850.74855•
The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum

[...]

Alan L. Cox1, Robert J. Fowler1•
University of Rochester1
1 Nov 1989
TL;DR: The design and implementation of the PLATINUM memory management system is described, emphasizing the coherent memory, and the cost and performance of a set of application programs running on PLATinUM are measured.
Abstract: PLATINUM is an operating system kernel with a novel memory management system for Non-Uniform Memory Access (NUMA) multiprocessor architectures. This memory management system implements a coherent memory abstraction. Coherent memory is uniformly accessible from all processors in the system. When used by applications coded with appropriate programming styles it appears to be nearly as fast as local physical memory and it reduces memory contention. Coherent memory makes programming NUMA multiprocessors easier for the user while attaining a level of performance comparable with hand-tuned programs.This paper describes the design and implementation of the PLATINUM memory management system, emphasizing the coherent memory. We measure the cost of basic operations implementing the coherent memory. We also measure the performance of a set of application programs running on PLATINUM. Finally, we comment on the interaction between architecture and the coherent memory system.PLATINUM currently runs on the BBN Butterfly Plus Multiprocessor.

144 citations

Book Chapter•10.1016/B978-0-12-236530-0.50021-2•
Models of working memory

[...]

Stuart K. Card
01 May 1989-Computer-Aided Engineering
TL;DR: This chapter describes the models of working memory, a functional part of human memory that accomplishes the temporary holding and manipulation of information during the performance of a range of cognitive tasks such as comprehension, learning, and reasoning.
Abstract: Publisher Summary This chapter describes the models of working memory. Working memory refers to a functional part of human memory that accomplishes the temporary holding and manipulation of information during the performance of a range of cognitive tasks such as comprehension, learning, and reasoning. At least three different functions performed by working memory, as expressed by models of cognitive processing, can be described in computational terms. Working memory functions as a place to hold operands or things to be operated on by the operations of cognitive processing, a cache to hold in a rapidly accessible state recently input or used information, and a buffer between processes that happen at incommensurate rates. In addition to its functions, working memory has also been characterized from two other points of view: (1) time and (2) structure. It is also distinguished from very short-term memory, lasting for only a fraction of a second. While a number of partial models of working memory exist, they do not yet embrace all the phenomena related to it in a computational framework.

109 citations

Journal Article•10.1109/38.41466•
Exploiting coherence for multiprocessor ray tracing

[...]

S.A. Green1, D.J. Paddon1•
University of Bristol1
01 Nov 1989-IEEE Computer Graphics and Applications
TL;DR: A form of coherence in the ray-tracing algorithm is identified that can be exploited to develop optimum schemes for data distribution in a multiprocessor system, which gives rise to high processor efficiency for systems with limited distributed memory.
Abstract: The scalability and cost effectiveness of general-purpose distributed-memory multiprocessor systems makes them particularly suitable for ray-tracing applications. However, the limited memory available to each processor in such a system requires schemes to distribute the model database among the processors. The authors identify a form of coherence in the ray-tracing algorithm that can be exploited to develop optimum schemes for data distribution in a multiprocessor system. This in turn gives rise to high processor efficiency for systems with limited distributed memory. >

101 citations

Patent•
Synchronized parallel processing with shared memory

[...]

Robin Chang
7 Nov 1989
TL;DR: In this paper, a high speed computer that permits the partitioning of a single computer program into smaller concurrent processes running in different parallel processors is presented, where the program execution time is divided into synchronous phases, each of which may require a shared memory to be configured in a distinct way.
Abstract: A high speed computer that permits the partitioning of a single computer program into smaller concurrent processes running in different parallel processors. The program execution time is divided into synchronous phases, each of which may require a shared memory to be configured in a distinct way. At the end of each execution phase, the processors are resynchronized such that the composite system will be in a known state at a known point in time. The computer makes efficient use of hardware such that n processors can solve a problem almost n times as fast as a single processor.

101 citations

Journal Article•10.1137/0910025•
A New Method for Solving Triangular Systems on Distributed Memory Message-Passing Multiprocessors

[...]

Guangye Li1, Thomas F. Coleman1•
Cornell University1
01 Mar 1989-Siam Journal on Scientific and Statistical Computing
TL;DR: A new efficient parallel triangular solver is described, based on the previous method of Li and Coleman [1986], but is considerably more efficient when $\frac{n}{p}$ is relatively modest, where p is the number of processors and $n$ is the problem dimension.
Abstract: Efficient triangular solvers for use on message passing multiprocessors are required, in several contexts, under the assumption that the matrix is distributed by columns (or rows) in a wrap fashion. In this paper we describe a new efficient parallel triangular solver for this problem. This new algorithm is based on the previous method of Li and Coleman [1986] but is considerably more efficient when $\frac{n}{p}$ is relatively modest, where $p$ is the number of processors and $n$ is the problem dimension. A useful theoretical analysis is provided as well as extensive numerical results obtained on an Intel iPSC with $p \leq 128$.

75 citations

Proceedings Article•
Coherence of Distributed Shared Memory: Unifying Synchronization and Data Transfer.

[...]

Umakishore Ramachandran, Mustaque Ahamad, M. Yousef Amin Khalidi
1 Jan 1989

74 citations

Patent•
Computer system with clock shared between processors executing separate instruction streams

[...]

David Robert Baldwin
21 Mar 1989
TL;DR: In this paper, a control processor and a high-level data-transfer processor are docked by a shared variable-duration clock and the duration of the clock is adjusted on the fly, to accommodate whichever of the two processors needs the longest cycle time on that particular cycle.
Abstract: A multiprocessor system which includes a control processor and a high-level data-transfer processor. Both of these two processors are docked by a shared variable-duration clock. The duration of the clock is adjusted on the fly, to accommodate whichever of the two processors needs the longest cycle time on that particular cycle. Thus, the control processor 110 and the data transfer processor 120 are enabled to run synchronously, even though they are concurrently running separate streams of instructions.

74 citations

A parallelizing compiler for distributed memory parallel computers

[...]

P.-S. Tseng
3 Jan 1989
TL;DR: This thesis presents a parallelizing compiler which can automate the program mapping, and generate efficient parallel code which achieves 8-fold speedup on the 10-processor array for small matrices of size 180 x 180.
Abstract: Distributed memory parallel computers provide an attractive approach to high speed computing because their performance can be easily scaled up by increasing the number of processor-memory modules. To use these computers, we have to design parallel algorithms and produce parallel programs. In many cases, parallel algorithm design is a mapping of existing algorithms to parallel architectures. In this thesis, we study such a mapping process and present a parallelizing compiler which can: (1) automate the program mapping, and (2) generate efficient parallel code. There are three key components in our program mapping: data decomposition, loop distribution and data relations. Data decomposition maps data structures to the distributed memory system; loop distribution maps the computation to processors; and data relations determine the interprocessor communication. The compiler applies data flow analysis and data dependence analysis to minimize interprocessor communication overhead and parallelize program execution. Based on these ideas, we have implemented the AL parallelizing compiler for the Warp machine. AL is a generic programming language for the prototype implementation. The target machine, Warp, is a programmable linear systolic array of 10 processors. AL has been successfully used to program many applications on Warp. These applications include matrix computations, image processing, finite element analysis, and partial differential equations. The AL compiler is able to generate efficient parallel code. For example, for the LINPACK routines such as LU decomposition, QR decomposition, and singular value decomposition (SVD), the AL compiler generates parallel code which achieves 8-fold speedup on the 10-processor array for small matrices of size 180 x 180. This thesis makes contributions to the research area of parallelizing compilers by introducing a model for mapping programs to distributed memory parallel computers. This thesis also makes contributions to the research area of parallel programming by introducing an approach to programming distributed memory parallel computers.
Proceedings Article•10.1145/72981.73003•
Shared-memory vs. message-passing in an asynchronous distributed environment

[...]

A. Bas-Noy, Danny Dolev
1 Jun 1989
Proceedings Article•10.1145/74925.74944•
Introducing Memory Into The Switch Elements Of Multiprocessor Interconnection Networks

[...]

Haim E. Mizrahi1, Jean-Loup Baer1, Edward D. Lazowska1, John Zahorjan1•
University of Washington1
1 Apr 1989
TL;DR: The results indicate that the MHN organization can have substantial performance benefits and so should be of increasing interest as the enabling technology becomes available.
Abstract: As VLSI technology continues to improve, circuit area is gradually being replaced by pin restrictions as the limiting factor in design. Thus, it is reasonable to anticipate that on-chip memory will become increasingly inexpensive since it is a simple, regular structure than can easily take advantage of higher densities. In this paper we examine one way in which this trend can be exploited to improve the performance of multistage interconnection networks (MINs). In particular, we consider the performance benefits of placing significant memory in each MIN switch. This memory is used in two ways: to store (the unique copies of) data items and to maintain directories. The data storage function allows data to be placed nearer processors that reference it relatively frequently, at the cost of increased distance to other processors. The directory function allows data items to migrate in reaction to changes in program locality. We call our MIN architecture the Memory Hierarchy Network (MHN), In a preliminary investigation of the merits of this design [8] we examined the performance of MHNs under the simplifying assumption that an unlimited amount of memory was available in each switch. We found that despite the longer switch processing times of the MHN, system performance is improved over simpler, conventional schemes based on caching. In this paper we refine the earlier model to account for practical storage limitations. We study ways to reduce the amount of directory storage required by keeping only partial information regarding the current location of data items. The price paid for this reduction in memory requirement is more complicated (and in some circumstances slower) protocols. We obtain comparative performance estimates in an environment containing a single global memory module and a tree-structured MIN. Our results indicate that the MHN organization can have substantial perfor- mance benefits and so should be of increasing interest as the enabling technology becomes available.
Journal Article•10.1109/12.24307•
Database operations in a cube-connected multicomputer system

[...]

Chaitan Baru1, Ophir Frieder•
University of Michigan1
01 Jun 1989-IEEE Transactions on Computers
TL;DR: Strategies for performing database operations in a cube-connected multicomputer system with parallel I/O are presented, which account for the non-uniform distribution of data across parallel paths by incorporating data redistribution steps as part of the overall algorithm.
Abstract: Distributed memory architectures, specifically hypercubes, for parallel database processing are treated. The cube interconnects support-efficient data combination for the various database operations, and nonuniform data distributions are handled by dynamically redistributing data utilizing these interconnections. Selection and scalar aggregation operations are easily supported. An algorithm for the join operation is discussed in some detail. The cube is compared with another multicomputer database machine, SM3, and the performance of the join operation in these systems is described. The join performance in a cube is comparable to that in SM3 even when the cube is assumed to have a nonuniform data distribution. >
Book Chapter•10.1007/978-1-4613-1537-7_9•
The Cache Coherence Protocol of the Data Diffusion Machine

[...]

Seif Haridi1, Erik Hagersten1•
Swedish Institute of Computer Science1
12 Jun 1989
TL;DR: The Data Diffusion Machine (DDM), a scalable shared memory multiprocessor in which the location of a datum in the machine is completely decoupled from its address, provides an automatic duplication and migration of the data to wherever needed.
Abstract: The Data Diffusion Machine (DDM) is a scalable shared memory multiprocessor in which the location of a datum in the machine is completely decoupled from its address. A data access "snooping" protocol provides an automatic duplication and migration of the data to wherever needed. The protocol also handles data coherence and replacement. The hardware organization consists of a hierarchy of buses and data controllers linking an arbitrary number of processors each having a large set-associative memory. Each data controller has a set-associative directory containing status bits for data under its control. The rest of the system appears to one processor like shared memory system, which makes the DDM a general architecture. The DDM is scalable in that there may be any number of levels in the hierarchy. The logical topmost bus (or any other bus) can be implemented by an unlimited number of physical buses removing an anticipated bottleneck.
Proceedings Article•
The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors.

[...]

Thomas Anderson
1 Jan 1989
Proceedings Article•10.1145/74925.74928•
Multi-level shared caching techniques for scalability in VMP-M/C

[...]

David R. Cheriton1, H. A. Goosen1, P. D. Boyle1•
Stanford University1
1 Apr 1989
TL;DR: The VMP-MC design is described, a distributed parallel multi-computer based on the VMP multiprocessor design that is intended to provide a set of building blocks for configuring machines from one to several thousand processors.
Abstract: The problem of building a scalable shared memory multiprocessor can be reduced to that of building a scalable memory hierarchy, assuming interprocessor communication is handled by the memory system. In this paper, we describe the VMP-MC design, a distributed parallel multi-computer based on the VMP multiprocessor design, that is intended to provide a set of building blocks for configuring machines from one to several thousand processors. VMP-MC uses a memory hierarchy based on shared caches, ranging from on- chip caches to board-level caches connected by busses to, at the bottom, a high-speed fiber optic ring. In addition to describing the building block components of this architecture, we identify the key performance issues associated with the design and provide performance evaluation of these issues using trace-drive simulation and measurements from the VMP. This work was sponsored in part by the Defense Advanced Re- search Projects Agency under Contract N00014-88-K-0619.
An alternative design for a sparse distributed memory

[...]

Louis A. Jaeckel1•
Research Institute for Advanced Computer Science1
1 Jul 1989
TL;DR: A new design for a Sparse Distributed Memory, called the selected-coordinate design, is described, where there are a large number of memory locations, each of which may be activated by many different addresses (binary vectors) in a very large address space.
Abstract: A new design for a Sparse Distributed Memory, called the selected-coordinate design, is described. As in the original design, there are a large number of memory locations, each of which may be activated by many different addresses (binary vectors) in a very large address space. Each memory location is defined by specifying ten selected coordinates (bit positions in the address vectors) and a set of corresponding assigned values, consisting of one bit for each selected coordinate. A memory location is activated by an address if, for all ten of the locations's selected coordinates, the corresponding bits in the address vector match the respective assigned value bits, regardless of the other bits in the address vector. Some comparative memory capacity and signal-to-noise ratio estimates for the both the new and original designs are given. A few possible hardware embodiments of the new design are described.
Journal Article•10.1016/S0092-8240(89)80067-9•
Context-dependent associations in linear distributed memories*

[...]

Eduardo Mizraji
01 Jan 1989-Bulletin of Mathematical Biology
TL;DR: A method that allows conditioning of the response of a linear distributed memory to a variable context and its capacity for the conditional extraction of features from a complex perceptual input, its capacity to perform quasi-logical operations and the potential importance of the capacity to establish arbitrary contexts are evaluated.
Book Chapter•10.1007/3540512845_37•
Distributed Implementation of Programmed Graph Reduction

[...]

Rita Loogen1, Herbert Kuchen1, Klaus Indermark1, Werner Damm2•
RWTH Aachen University1, University of Oldenburg2
12 Jun 1989
TL;DR: The realization of programmed graph reduction in PAM — a parallel abstract machine with distributed memory is described and results of the implementation of PAM on an Occam-Transputersystem are given.
Abstract: Programmed graph reduction has been shown to be an efficient implementation technique for lazy functional languages on sequential machines. Considering programmed graph reduction as a generalization of conventional environment-based implementations where the activation records are allocated in a graph instead of on a stack it becomes very easy to use this technique for the execution of functional programs in a parallel machine with distributed memory. We describe in this paper the realization of programmed graph reduction in PAM — a parallel abstract machine with distributed memory. Results of our implementation of PAM on an Occam-Transputersystem are given.
Journal Article•10.1016/S0734-189X(89)80040-4•
An architecture independent programming language for low-level vision

[...]

Leonard G. C. Hamey1, Jon A. Webb1, I-Chen Wu•
Carnegie Mellon University1
01 Nov 1989-Graphical Models \/graphical Models and Image Processing \/computer Vision, Graphics, and Image Processing
TL;DR: A specialized programming language, called Apply, is developed, which reduces the problem of writing the algorithm for this class of programs to the task ofWriting the function to be applied to a window around a single pixel.
Abstract: Low-level vision is particularly amenable to implementation on parallel architectures, which offer an enormous speedup at this level. To take advantage of this, the algorithm must be adapted to the particular parallel architecture. Having to adapt programs in this way poses a significant barrier to the vision programmer, who must learn and practice a different method of parallelization for each different parallel machine. There is also no possibility of portability for programs written for a particular parallel architecture. We have developed a specialized programming language, called Apply, which reduces the problem of writing the algorithm for this class of programs to the task of writing the function to be applied to a window around a single pixel. Apply provides a method for programming these applications which is easy, consistent, and efficient. Apply is programming model specific—it implements the input partitioning model—but is architecture independent. It is possible to implement versions of Apply which run efficiently on a wide variety of computers. We describe implementations of Apply on Warp, various Warp-like architectures, unix , and the Hughes HBA and sketch implementations on bit-serial processor arrays and distributed memory machines.
Proceedings Article•10.1145/76738.76825•
Efficient Aggregation Of Multiple LPs In Distributed Memory Parallel Simulations

[...]

David M. Nicol1, C. C. Michael, P. Inouye•
College of William & Mary1
1 Oct 1989
TL;DR: YAWNS, Yet Another Windowing Network Simulator, is described, for dealing with problems of large-scale parallel simulation involving many interacting logical processes in a distributed memory architecture.
Abstract: The state of research in parallel simulation now demands that we experiment with a multitude of simulation models. It is evident that large-scale simulations involving many interacting logical processes should be a focal point of such experimentation, as large-scale simulations will benefit the most from parallelism. This realization raises a number of issues. Large-scale parallel simulations must aggregate many logical processes onto each machine in a distributed memory architecture. This fact creates internal management problems---how does one synchronize in such a setting? How does one efficiently find and manage the simulation workload? If we are to experiment with multiple models, what underlying functions can we extract to program once, and use many times? This paper describes YAWNS, Yet Another Windowing Network Simulator, for dealing with these problems.
Patent•
Memory accessing switch network

[...]

Philip P. Carvey1, William R. Crowther1, Randall D Rettberg1•
BBN Technologies1
29 Nov 1989
TL;DR: In parallel processing computational apparatus, a switching network employing both routing switch elements and concentrator elements efficiently couples bit serial messages from a multiplicity of processors to a multiplivity of memory modules.
Abstract: In parallel processing computational apparatus, a switching network employing both routing switch elements and concentrator elements efficiently couples bit serial messages from a multiplicity of processors to a multiplicity of memory modules. The apparatus operates in a highly synchronous mode in which all processors issue memory requests only at essentially the same predetermined time within a frame interval encompassing a predetermined substantial number of clock periods. The routing switch elements and concentrator elements incorporate circuitry for comparing the addresses of requests which may be blocked at any element with requests which get through and, if the addresses are the same, returning the memory response to all processors seeking the same memory location.
Proceedings Article•
Coherent Shared Memory on a Distributed Memory Machine.

[...]

Roberto Bisiani, Andreas Nowatzyk, Mosur Ravishankar
1 Jan 1989
Journal Article•10.1016/0167-8191(89)90129-4•
Mapping strategies in message-based multiprocessor systems

[...]

Ottmar Krämer, Heinz Mühlenbein
1 Jan 1989
TL;DR: The term robustness is introduced and explained in detail with two examples, the SUPRENUM and the hypercube architecture, and it is shown that a simple mapping strategy (optimal clustering of the processes) gives almost as good results as the optimal mapping.
Abstract: Machines with distributed memory have the mapping problem—assigning processes to processors. In this paper we define the mapping problem as an optimization problem and discuss the question, how far is an optimum solution from an average or random solution. The term robustness is introduced and explained in detail with two examples, the SUPRENUM and the hypercube architecture. For the SUPRENUM architecture we show that a simple mapping strategy (optimal clustering of the processes) gives almost as good results as the optimal mapping. Optimal mapping is more important for the hypercube architecture. For nonhomogeneous networks adaptive routing seems promising.
Proceedings Article•10.1109/ICDCS.1989.37982•
An analysis of distributed shared memory algorithms

[...]

R.E. Kessler1, Miron Livny1•
University of Wisconsin-Madison1
5 Jun 1989
TL;DR: Results obtained in a study of algorithms to implement a distributed-shared memory in a distributed (loosely coupled) environment are described.
Abstract: Results obtained in a study of algorithms to implement a distributed-shared memory in a distributed (loosely coupled) environment are described. Distributed-shared memory is the implementation of shared memory across multiple nodes in a distributed system. This is accomplished using only the private memories of the nodes by controlling access to the pages of the shared memory and transferring data to and from the private memories when necessary. Alternative algorithms are analyzed to implement distributed-shared memory. The algorithms are analyzed and compared over a wide range of conditions. Application characteristics are identified which can be exploited by the algorithms. The conditions under which the algorithms analyzed perform better or worse than the other alternatives are shown. Results are obtained via simulation using a synthetic reference generator. >
Patent•
Microprocessor information exchange with updating of messages by asynchronous processors using assigned and/or available buffers in dual port memory

[...]

Charles Joseph Roslund1, Linda L. Santoline1, Albert William Crew1, Gilbert William Remley1, William D. Ghrist1 •
Westinghouse Electric1
13 Nov 1989
TL;DR: In this article, the authors propose a dual-port shared memory system where one of the processors is a controller for a data link channel carrying n messages, with 2n+1 buffers in free shared memory space with 2 buffers assigned to each message at all times and a common buffer serving as the third buffer for all of the messages.
Abstract: Updated images of messages are passed between asynchronous digital processors using dual port shared memory. In the basic form of the invention, three buffers in shared memory are assigned to each message. Where one of the processors is a controller for a data link channel carrying n messages, 2n+1 buffers are provided in free shared memory space with 2 buffers assigned to each message at all times and a common buffer serving as the third buffer for all of the messages. Where linked buffers in local memory of a controller processor receive message updates from a data highway, two buffers in shared memory are assigned to each message and a linked buffer in the controller local memory serves as the third buffer. The buffers containing the message updates are passed between processors by use of a buffer status array in shared memory. A semaphore lock in the array permits only one processor at a time to assign or release buffers.
Learning in a distributed memory model.

[...]

Murdock, B Bennet
1 Jan 1989
Patent•
Multiprocessor system which only allows alternately accessing to shared memory upon receiving read and write request signals

[...]

Akira Jinzaki1•
Fujitsu1
8 Nov 1989
TL;DR: In this paper, a control circuit for controlling accesse to the local memory by a processor of the same group and another processor of another group depending on the status of the flags is presented.
Abstract: In a multiprocessor system, wherein each local group a processor its local memory are included, and the local memory is accessed from a processor either in a same group or another group through a communication media. The system according to the present invention is provided with a LOCK flag, a SYNC flag and a control circuit for controlling accesse to the local memory by a processor of the same group and a processor of the other group depending on the status of the flags. The LOCK flag inhibits reading from the memory by the other processor while the same group processor is writing therein. The SYNC flag permits reading of the memory by the other processor and inhibits the same group processor from writing therein. A read request as well as a write request to the memory from both processors are input to the control circuit which controls conveyance of the requests to the memory. Thus, the flag setting/resetting is carried out within a group, therefore, synchronization software which requires much overhead, or control lines which interconnect the processors are not necessary, resulting in accomplishment of a reduced network traffic, that is, a high-speed multiprocessor operation.
Expressing Complex Parallel Algorithms in DINO

[...]

Matthew Rosing, Robert B. Schnabel, Robert P. Weaver
1 Mar 1989
TL;DR: The use of DINO is described to program a moderately complex, multiple-phase parallel algorithm, the parallel solution of block-bordered systems of linear equations, and some potential improvements to DINO are pointed to.
Abstract: : DINO is a programming language consisting of additions to C for expressing parallel numerical programs on distributed memory multiprocessors Its goal is to incorporate the high-level features of parallel algorithms, such as the mapping of data and procedures to processes, into the language, and have low-level operations such as interprocess communication and process control result implicitly This paper describes the use of DINO to program a moderately complex, multiple-phase parallel algorithm, the parallel solution of block-bordered systems of linear equations This example illustrates the suitability of DINO for such computations, but also points to some potential improvements to DINO that could be made
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve