TL;DR: The optimal task to processor assignment is found by an algorithm based on results in Markov decision theory, which is completely general and applicable to N-processor systems.
Abstract: In a distributed computing system made up of different types of processors each processor in the system may have different performance and reliability characteristics. In order to take advantage of this diversity of processing power, a modular distributed program should have its modules assigned in such a way that the applicable system performance index, such as execution time or cost, is optimized. This paper describes an algorithm for making an optimal module to processor assignment for a given performance criteria. We first propose a computational model to characterize distributed programs, consisting of tasks and an operational precedence relationship. This model alows us to describe probabilistic branching as well as concurrent execution in a distributed program. The computational model along with a set of seven program descriptors completely specifies a model for dynamic execution of a program on a distributed system. The optimal task to processor assignment is found by an algorithm based on results in Markov decision theory. The algorithm given in this paper is completely general and applicable to N-processor systems.
TL;DR: A memory system designed for parallel array access based on the use of a prime nwnber of memories and a powerful combination of indexing hardware and data alignment switches is described.
Abstract: In this paper we describe a memory system designed for parallel array access. The system is based on the use of a prime nwnber of memories and a powerful combination of indexing hardware and data alignment switches. Particular emphasis is placed on the indexing equations and their implementation.
TL;DR: A pyramidal data processing system comprising a plurality of levels of processor arrays is described in this paper, where the number of processors in an array increases in number from a level of lowest resolution to a high resolution.
Abstract: A pyramidal data processing system comprising a plurality of levels of processor arrays, the number of processors in an array increases in number from a level of lowest resolution to a level of highest resolution. Each processor in an array is coupled for data transfer to a neighborhood of processors including laterally and diagnoally adjacent processors in the same level, a processor in the level of next lowest resolution, and processors in the level of next greatest resolution. A memory is associated with each processor to store value of data elements. A controller and control memory generate control signals to perform in synchrony data transformations on selected data elements associated with each neighborhood of processors.
TL;DR: A multiprocessor arrangement in which the individual program functions of a program process are executed on different processors (101) Data shared by different program functions is stored in shared memory and the programs are stored in local memory (24) of the individual processors as discussed by the authors.
Abstract: A multiprocessor arrangement in which the individual program functions of a program process are executed on different processors (101) Data shared by different program functions is stored in shared memory and the programs are stored in local memory (24) of the individual processors One processor calls for the execution of a program function by another processor by causing the program address and a point to the program function context to be loaded into a work queue of the called processor Input output modules (34) are treated as processors Facilities are provided for the transfer of blocks of data over the interconnection bus system Virtual addresses are translated to physical addresses in one facility common to all processors
TL;DR: Markovian models are developed for the performance analysis and comparison of several single bus multiprocessor architectures, introducing simplifying assumptions that allow a compact Markovian system description.
Abstract: Markovian models are developed for the performance analysis and comparison of several single bus multiprocessor architectures. Processors are assumed to cooperate in a message passing fashion, and messages are exchanged through common memory areas. Four architectures are considered in this paper which differ in the location of the common memory modules. Contention for shared resources is modeled and the corresponding efficiency loss is studied. Numerical results are obtained for the processing power of each architecture, introducing simplifying assumptions that allow a compact Markovian system description.
TL;DR: In this paper, a single chip multiprocessor interface device for interfacing between two processors by connection to their bus systems, the device having a random access memory selectively accessible by the processors under the control of an arbitration latch.
Abstract: A single chip multiprocessor interface device for interfacing between two processors by connection to their bus systems, the device having a random access memory selectively accessible by the processors under the control of an arbitration latch. The arbitration latch has a bistable device the state of which determines which processor has access to the memory. The outputs of the bistable device have threshold devices which have threshold levels higher than the signal outputs of the bistable device when it is in a metastable state, so that there is no possibility that both processors could have access to the memory at the same time. Data and address registers for the two processors are selectively connectible to the random access memory through multiplexers controlled by the arbitration latch. Mode control inputs can set the device into a "stand alone" mode, a "master" mode and a "slave" mode; several devices can be used in parallel for bus systems more than one byte wide with one device the master and the others slaves. Control and status registers for each processor input enable the generation of interrupts when certain conditions are met.
TL;DR: This paper investigates strategies for dynamically reconfiguring shared memory multiprocessor systems that are subject to common memory faults and unpredictable processor deaths and presents a general distributed algorithm which enables the processors in such a system to exchange the local information needed to reach a consensus on system reconfiguration.
Abstract: In this paper, we investigate strategies for dynamically reconfiguring shared memory multiprocessor systems that are subject to common memory faults and unpredictable processor deaths. These strategies aim at determining a communication page, i.e., a page of common memory that can be used by a group of processors for storing crucial common resources such as global locks for synchronization and global data structures for voting algorithms. To ensure system reliability, the reconfiguration strategies must be distributed so that each processor independently arrives at exactly the same choice. This type of reconfiguration strategy is currently used in the STAGE operating system on the PLURIBUS multiprocessor [5]. We analyze the weak points of the PLURIBUS algorithm and examine alternative strategies satisfying optimization criteria such as maximization of the number of processors and the number of common memory pages in the reconfigured system. We also present a general distributed algorithm which enables the processors in such a system to exchange the local information that is needed to reach a consensus on system reconfiguration.
TL;DR: A priority-based task management scheduling algorithm is then defined which uses the optimal schedule of the formal model as a parameter, and its performance is simulated.
Abstract: A multiprocessor architecture is proposed which is based on the Multics concept of having all on-line information processor-addressible. All memory management is done by an intelligent paged virtual memory system, and each processor deals only with those segments relevant to its single executing program. The processors are chosen to have different implementations of a single system-wide instruction set and the problem is to effectively schedule different categories of programs, called task groups, on the dissimilar processors.Average weighted instruction times for each task group on every processor are defined as task/processor suitability measures, and typical values are given for different groups of programs running on IBM 370 models. Through the use of linear programming techniques, an optimal schedule for any such multiprocessor is then defined for the static case where task group loads and task/processor suitability values are known in advance. A priority-based task management scheduling algorithm is then defined which uses the optimal schedule of the formal model as a parameter, and its performance is simulated.
TL;DR: This dissertation presents designs for logical buses constructed from a hierarchy of physical buses that will allow snooping cache protocols to be used without the electrical loading problems that result from attaching all processors to a single bus.
Abstract: BUS AND CACHE MEMORY ORGANIZATIONS FOR MULTIPROCESSORS by Donald Charles Winsor Chairman: Trevor Mudge The single shared bus multiprocessor has been the most commercially successful multiprocessor system design up to this time, largely because it permits the implementation of efficient hardware mechanisms to enforce cache consistency. Electrical loading problems and restricted bandwidth of the shared bus have been the most limiting factors in these systems. This dissertation presents designs for logical buses constructed from a hierarchy of physical buses that will allow snooping cache protocols to be used without the electrical loading problems that result from attaching all processors to a single bus. A new bus bandwidth model is developed that considers the effects of electrical loading of the bus as a function of the number of processors, allowing optimal bus configurations to be determined. Trace driven simulations show that the performance estimates obtained from this bus model agree closely with the performance that can be expected when running a realistic multiprogramming workload in which each processor runs an independent task. The model is also used with a parallel program workload to investigate its accuracy when the processors do not operate independently. This is found to produce large errors in the mean service time estimate, but still gives reasonably accurate estimates for the bus utilization. A new system organization consisting essentially of a crossbar network with a cache memory at each crosspoint is proposed to allow systems with more than one memory bus to be constructed. A two-level cache organization is appropriate for this architecture. A small cache may be placed close to each processor, preferably on the CPU chip, to minimize the effective memory access time. A larger cache built from slower, less expensive memory is then placed at each crosspoint to minimize the bus traffic. By using a combination of the hierarchical bus implementations and the crosspoint cache architecture, it should be feasible to construct shared memory multiprocessor systems with several hundred processors. c Donald Charles Winsor All Rights Reserved 1989 To my family and friends
TL;DR: A data flow architecture with a paged memory system to hold both data flow programs and data structures and the token labeling mechanism is coupled with the memory management system in order to provide for each token a unique memory location.
Abstract: During the last ten years, data flow has become an exciting research area and several architectures have been proposed and built. They differ mostly in the way they handle data structures and how they provide mechanisms for token labeling or colouring in order to make data flow graphs reentrant. The paper presents a data flow architecture with a paged memory system to hold both data flow programs and data structures. The token labeling mechanism is coupled with the memory management system in order to provide for each token a unique memory location. The instruction format allows instructions with multiple operands and multiple destinations for each result. Data structures are held in memory while pointers to the structures are circulating as tokens. The proposed architecture is able to execute data flow programs at the level of single instructions or at a higher level.
TL;DR: The authors describe the pattern understanding parallel processing system MACSYM, a dedicated multi-processor system for document image understanding that is an asynchronous common bus system with one master processor, a maximum of sixteen slave processors, and a large shared memory managed by a major-minor ring arbiter net.
Abstract: The authors describe the pattern understanding parallel processing system MACSYM, a dedicated multi-processor system for document image understanding. It is an asynchronous common bus system with one master processor, a maximum of sixteen slave processors, and a large shared memory managed by a major-minor ring arbiter net. Special circuits are implemented for the event-driven sequencing control. The Japanese Newspaper Layout Understanding System Express is being developed on the MACSYM. Newspaper articles are automatically extracted and reconstructed for retrieval in a few seconds. 7 references.
TL;DR: In this article, the authors propose a priority network to reduce mutual interference between processors during access to the main memory in a multiprocessor system having independently addressable memory modules via a common node.
Abstract: To reduce mutual interference between processors during access to the main memory in a multiprocessor system having independently addressable memory modules via a common node, an access request, selected by a priority network, of a processor having the relatively highest priority is immediately checked with the aid of the occupancy information output by the memory modules to see whether it leads to a free memory module. If the result of the check is positive, the request parameters and data are only transmitted then. If the result of the test is negative, the node and particularly its priority network are released for requests from other processors.
TL;DR: The design of an operating system intended for dedicated real-time multiprocessor applications is presented and the problems encountered in designing such a system are discussed, together with the solutions which are adopted.
Abstract: By using multiple processors it is possible to increase the computing power and hence the complexity of the task that can be managed by a microprocessor-based system. To be useful, such a system must consist of (a) hardware, (b) development software, which provides program development support, and (c) operating-system software which executes on the multiprocessor and provides run-time support for applications programs. In this paper, we present the design of an operating system intended for dedicated real-time multiprocessor applications. The problems encountered in designing such a system are discussed, together with the solutions which we have adopted. The use of our system for parallel processing is illustrated by two example applications.
TL;DR: In this paper, a multiprocessing three level memory hierarchy implementation is described which uses a "write" flag and a "share" flag per pages of information stored in a level three main memory.
Abstract: A multiprocessing three level memory hierarchy implementation is described which uses a "write" flag and a "share" flag per pages of information stored in a level three main memory. These two flag bits are utilized to communicate from main memory (4) at level three to private and shared caches (12, 27; 20, 30; 14; 22) at memory levels one and two how a given page of information is to be used. Essentially, pages which can be both written and shared are moved from main memory to the shared level two cache and then to the shared level one cache, with the processors executing from the shared level one cache. All other pages are moved from main memory to the private level two and level one caches of the requesting processor. Thus, a processor executes either from its private or shared level one cache. This allows several processors to share a level three common main memory without encountering cross interrogation overhead. If uniform status within a page cannot be guaranteed at the main memory interface, the shared cache configuration does not interface with main memory but, in parallel, with the private caches at an appropriate intermediate level.
TL;DR: The author describes a technique for attaching processor-memory units to a memory and I/O channel in a multiprocessor system to increase the combined system bandwidth.
Abstract: The author describes a technique for attaching processor-memory units to a memory and I/O channel in a multiprocessor system to increase the combined system bandwidth.
TL;DR: This dissertation investigates strategies for dynamically reconfiguring shared memory multiprocessor systems that are subject to common memory faults and unpredictable processor deaths and deals with fault-masking algorithms as applied to the development of network protocols with an underlying communication medium that may reorder, duplicate or lose messages.
Abstract: Depending upon the philosophy used to implement fault-tolerant systems, one can distinguish two classes of algorithms: reconfiguration algorithms and fault masking algorithms. The precise statement and analysis of the problems and the underlying assumptions associated with these classes of algorithms is the subject of this dissertation.
The first part of the thesis investigates strategies for dynamically reconfiguring shared memory multiprocessor systems that are subject to common memory faults and unpredictable processor deaths. These strategies aim at determining a communication page, i.e., a page of common memory that can be used by a group of processors for storing crucial common resources such as global locks for synchronization and global data structures for voting algorithms. To insure system reliability, the reconfiguration strategies must be distributed so that each processor independently arrives at exactly the same choice. This type of reconfiguration strategy is currently used in the STAGE operating system on the PLURIBUS multiprocessor {24}. We analyze the weak points of the PLURIBUS algorithm and examine alternative strategies satisfying optimization criteria such as maximization of the number of processors and the number of common memory pages in the reconfigured system. We also present a general distributed algorithm which enables the processors in such a system to exchange the local information that is needed to reach a consensus on system reconfiguration.
In the second part of the thesis, we deal with fault-masking algorithms as applied to the development of network protocols with an underlying communication medium that may reorder, duplicate or lose messages. In chapter (3) we present a simple network, whose communication medium is assumed to be reliable, and develop a strategy for the remote submission and processing of requests. We also show how to formally specify and verify the network behavior. In the final chapter we describe a more complex network model where the communication medium is no longer assumed to be reliable. We then show that despite the reordering, duplication or loss of messages, all requests are eventually processed exactly once at the remote site and that responses are received in the right order at their submission site.
TL;DR: In this paper, two hypotheses concerning the way in which short-term memory interacts with another task in a dual task situation are considered, and it is noted that when two tasks are combined, the activity of controlling and organizing performance on both tasks simultaneously may compete with either task for a resource; this resource may be space in a central mechanism or general processing capacity or it may be some task specific resource.
Abstract: Two hypotheses concerning the way in which short-term memory interacts with another task in a dual task situation are considered. It is noted that when two tasks are combined, the activity of controlling and organizing performance on both tasks simultaneously may compete with either task for a resource; this resource may be space in a central mechanism or general processing capacity or it may be some task-specific resource. If a special relationship exists between short-term memory and control, especially if there is an identity relationship between short-term and a central controlling mechanism, then short-term memory performance should show a decrement in a dual task situation. Even if short-term memory does not have any particular identity with a controlling mechanism, but both tasks draw on some common resource or resources, then a tradeoff between the two tasks in allocating resources is possible and could be reflected in performance. The persistent concurrence cost in memory performance in these experiments suggests that short-term memory may have a unique status in the information processing system.
TL;DR: The fully reconfigurable multimicroprocessor is an experimental configuration designed specifically as a research tool for implementing and evaluating parallel-processing algorithms on various multiprocessor architectures under development at the Los Alamos National Laboratory.
Abstract: The fully reconfigurable multimicroprocessor is an experimental configuration designed specifically as a research tool for implementing and evaluating parallel-processing algorithms on various multiprocessor architectures. Basically, the system is a shared-memory MIMD (multiple instruction-multiple data stream) machine that supports reconfiguration between processor and memory nodes to permit experimentation on architectures sharing common memory, networks of processors with only local memory, etc. This experimental computer system is currently under development within the Computing Division at the Los Alamos National Laboratory.
TL;DR: It is shown that the number of accesses by T obeys a generalized Kraft inequality and lower bounds are given for the worst case and average number of Accesses.
Abstract: A model in which a transmitter T sends a message to a receiver R via shared random-access memory is analyzed. In the model, the random-access memory consists of L individually addressable cells, each of which may be set to a value from a finite alphabet. A message m is sent by writing values into some of the memory cells so that the memory state is consistent with some codeword for m . The model differs from traditional source coding in several respects. The codeword may specify values for a noncontiguous subset of the memory cells and allow the remaining unspecified cells to be filled in by other users as they wish. Also, the transmitter T may attempt to avoid writing a full codeword into memory by first reading some cells to determine the initial memory state partially. Thus, the cells accessed for transmission and the cells specified by a codeword may be distinct, unlike traditional noiseless source coding where the symbols sent and symbols received are identical. Here we analyze the operational characteristics of the transmitter T . It is shown that the number of accesses by T obeys a generalized Kraft inequality. Lower bounds are given for the worst case and average number of accesses.
TL;DR: It is concluded that this system demonstrates the simplicity of a small-scale multiple-microprocessor computer, its applicability and potential ease of application to biomedical signal processing problems.
TL;DR: The effective bandwidth in a multiprocessor with shared memory with N processors and N memory modules is compared using as interconnection networks the crossbar or the multiple-bus.
Abstract: In this paper we compare the effective bandwidth in a multiprocessor with shared memory using as interconnection networks the crossbar or the multiple-bus. We consider a system with N processors and N memory modules, in which the processor requests to the memory modules are independent and uniformly distributed random variables. We consider two cases: in the first the processor makes another request immediately after a memory service, and in the second there is some internal processing time.
TL;DR: A theoretical model is developed that predicts at least order of k12 speedup with k processors andMeasurements of the algorithm's performance on the Arachne distributed operating system are presented.