TL;DR: In this paper, a plurality of multiprocessor systems is arranged in a high speed network to allow any processor in one system to communicate with another processor in another system, and buffer locations are managed so that they can request an adjacent node to stop transmitting packets if the buffer is becoming full from that direction and request resumption of transmission of packets as the buffer empties.
Abstract: A plurality of multiprocessor systems is arranged in a high speed network to allow any processor in one system to communicate with any processor in another system. The network is configured as a multi-node dual bidirectional ring having a multiprocessor system at each node. Packets of information may be passed around the ring in either of two directions and are temporarily stored in buffer memory locations dedicated to a selected destination processor in a selected direction between each successive transfer between neighboring nodes. The buffer locations are managed so that they can request an adjacent node to stop transmitting packets if the buffer is becoming full from that direction and request resumption of transmission of packets as the buffer empties.
TL;DR: A universal algorithm that implements this algorithm in models that forbid simultaneous access to the same memory location, using p processors, O ( d log 2 p ) time units, and O ( s + p ) memory space is presented.
TL;DR: In this article, a parallel processing circuit is described for use as the processor/memory in a highly parallel processor, which consists of an instruction decoder that generates tables of outputs in response to instructions received at the decoder and a plurality of processor/memories each of which comprises a read/write memory and a processor for producing an output depending at least in part on data read from the memory and instruction information received at instruction decoding.
Abstract: A parallel processing circuit is disclosed for use as the processor/memory in a highly parallel processor. The circuit comprises an instruction decoder that generates tables of outputs in response to instructions received at the decoder and a plurality of processor/memories each of which comprises a read/write memory and a processor for producing an output depending at least in part on data read from the memory and instruction information received at the instruction decoder. In addition, the circuit provides means for simultaneously addressing at least one cell in each read/write memory to write data thereto or read data therefrom and means for providing to each processor an output table from the decoder, the particular output table depending on instruction information received at the decoder. Further the processing circuit comprises means for selecting from the output table a particular output depending on data input to the processor. Advantageously, each processor/memory also comprises a flag controller for controlling the reading of a plurality of flags and means for simultaneously addressing each flag controller to read a flag for input into the processor associated therewith. Preferably, each processor is a bit-serial processor with three inputs, two from the read/write memory and one from the flag controller, and two outputs, one to the read/write memory and one to the flag controller; and the decoder and the plurality of processor/memories and formed on a single, integrated circuit chip.
TL;DR: In this article, a multiprocessor system includes a plurality of processors (10-1 to 10-N) which are respectively connected to a memory device (3) and each of which produces a first control signal when executing a test-and-set instruction and a second control signal after executing a sequence of queuing steps.
Abstract: A multiprocessor system includes a plurality of processors (10-1 to 10-N) which are respectively connected to a memory device (3) and each of which produces a first control signal when executing a test-and-set instruction and a second control signal after executing a sequence of queuing steps. The multiprocessor system further has flip-flop circuits (12-1 to 12-N) each of which is set in response to the first control signal from the corresponding one of the processors (10-1 to 10-N) and which are commonly reset in response to a second control signal from any one of the processors (10-1 to 10-N). The processors (10-1 to 10-N) are prevented from executing the test-and-set instruction while the corresponding one of the flip-flop circuits (12-1 to 12-N) is set.
TL;DR: In this paper, a direct memory access (DMA) controller is connected to each processor and facilitates transfer of bulk data from one memory to the other without the intervention of either or both processors.
Abstract: Apparatus for transferring data from the memory associated with one processor to the memory associated with another processor. A direct memory access (DMA) controller is connected to each processor and facilitates transfer of bulk data from one memory to the other without the intervention of either or both processors.
TL;DR: The computer system for missile guidance comprises five parallel processors interconnected by a global bus, with each processor having its own CPU, program memory, temporary memory, and two critical variable memories as mentioned in this paper.
Abstract: The computer system for missile guidance comprises five parallel processors interconnected by a global bus; with each processor having its own CPU, program memory, temporary memory, and two critical variable memories, interconnected by a local bus. The program memory and critical variable memory are hard MNOS to survive nuclear radiation. Each processor has its own cycle time, synchronized by a master clock. In each processor, the cycle has three phases for intercommunication, task processing, and critical variable storage. Thus the critical variables are stored only after task processing is completed.
TL;DR: This thesis focuses on the development and simulation of suboptimal algorithms and on consideration of special cases for each of the five performance criteria on the problem of task assignment in distributed systems.
Abstract: This thesis addresses the problem of task assignment in distributed systems. A distributed system is defined as any configuration of two or more processors, each with private memory, in which computations utilize the combined resources of the component machines. A distributed process is defined as a set of tasks which together work towards some common goal. Each task spends a portion of its time executing on one of the processors and a portion of its time communicating with other tasks in the distributed process. An assignment of tasks to processors designates one processor for each task to reside on for its lifetime.
We consider five different performance goals (cost functions) and investigate the problem of achieving optimal assignments with respect to each of these functions. In particular, we investigate task assignment to minimize (1) total execution and communication costs, (2) completion time, (3) total execution, communication, and interference costs, (4) total execution and communication costs with bounds on the number of tasks assigned to each processor, and (5) a weighted product of cost functions (1) and (2).
In all cases the problem of finding an optimal assignment for an arbitrary number of processors is found to be NP-complete. This thesis focuses on the development and simulation of suboptimal algorithms and on consideration of special cases for each of the five performance criteria.
TL;DR: An address conversion unit for a multiprocessor system including a common memory, and in which at least one processor includes a private memory, with the private memory and common memory having separate and distinct memory spaces is described in this article.
Abstract: An address conversion unit for a multiprocessor system including a common memory, and in which at least one processor includes a private memory, with the private memory and common memory having separate and distinct memory spaces. The conversion unit converts addresses between private addresses that are used within the processor itself and addresses that are used to retrieve contents of locations in common memory.
TL;DR: The multibus interconnection network is an attractive solution for connecting processors and memory modules in a multiprocessor with shared memory that provides a throughput which is intermediate between the single bus and the crossbar, with a corresponding intermediate cost.
Abstract: The multibus interconnection network is an attractive solution for connecting processors and memory modules in a multiprocessor with shared memory. It provides a throughput which is intermediate between the single bus and the crossbar, with a corresponding intermediate cost.
TL;DR: An experimental multiprocessor computer was designed and built in order to explore the feasibility of certain internal communication mechanisms, and has shown that communication structures based on distributed global memory and global bus systems can be used efficiently for medium scale systems.
Abstract: An experimental multiprocessor computer was designed and built in order to explore the feasibility of certain internal communication mechanisms The system consisted of seven processing elements, each containing a part of the global memory connected to a local bus For each processor the global memory is seen as one single, linearly addressable structure The processing elements were all connected to a common, global bus, consisting of three separate busses in order to increase the capacity A bus selection unit was designed, capable of making a unique bus selection for each request, within a fraction of a memory cycle The experiments have shown that communication structures based on distributed global memory and global bus systems can be used efficiently for medium scale systems
TL;DR: In this paper, a speech recognizer is described, which includes a number of processors (110,130, 140,150,160) each having a shared memory (406) associated therewith, each processor performs local processing tasks on data stored in the associated shared memory.
Abstract: A speech recognizer is disclosed which includes a number of processors (110,130, 140,150,160) each having a shared memory (406) associated therewith. Each processor performs local processing tasks on data stored in the associated shared memory. The data stored is distributed by direct memory access during and without interfering with local processing of the remaining data stored in the shared memories. A plurality of circuits are connected to a shared data bus (412) for effecting the data transfer across the shared data bus. A remote controller (447) controls transfer of data across a remote bus (450). A shared controller (440) includes synchronization circuitry (1100) for synchronizing shared data bus requests with the timing of the local processor, and priority circuitry (1000) to insure that the local processor always has access to the shared memory (406) through the shared data bus (412) without waiting. When used in continuous speech recognition, a front end processor (110) is employed for converting digital spectral speech data to frames of parametric data more suitable for further speech processing; at last two template processors (130, 140, 150) are employed to store the recognizable vocabulary as templates and for comparing the frames of parametric data individually with the stored templates; and a master processor (160) is employed to transfer new frames of parametric data to the template processors and to redistribute templates among the template processors for more efficient processing in response to analysis of the results of template comparisons.
TL;DR: A recent entry in the expanding market for continuous-processing systems, the n+1 online transaction-processing system is a fail-safe computer comprised of tightly-coupled general-purpose processors and specialised input/output processors.
Abstract: A recent entry in the expanding market for continuous-processing systems is a fail-safe computer comprised of tightly-coupled general-purpose processors and specialised input/output processors. The processor in Synapse Computer Corp.'s n+1 online transaction-processing system use a proprietary nonwrite-through cache memory and can access reconfigurable, shared main memory over dual 32 m-byte-per-sec buses. Access protection is achieved by integrating the relational database management system, the transaction processing manager and the synthesis operating system into a set of protection spheres. Synchronisation of the database and transaction processing systems provides automatic application checkpointing and recoverability.
TL;DR: In this article, a system of connecting a plurality of independently operable data processor systems was proposed to obtain a substantial increase in power and flexibility. But there is no need for any communication between respective independently operability data processor system, and in one embodiment of the invention, the data transfer controller has its own separate memory to store data temporarily as it becomes available in one memory which will be needed by another data processor.
Abstract: A system of connecting a plurality of independently operable data processor systems (12, 13, 14) to obtain a substantial increase in power and flexibility. Each of the pluralities of independently operable systems has its own memory (15, 16, 17), and the respective memories are accessible by a data transfer controller (10) connected to each memory by a common bus (18). Since all data transfer activity is controlled by the data transfer controller, there is no need for any communication between respective independently operable data processor systems, and in one embodiment of the invention, the data transfer controller has its own separate memory (11) to store data temporarily as it becomes available in one memory which will be needed by another data processor system.
TL;DR: This paper presents an approximate analysis of a multiprocessor system consisting of P processors, M memory modules, and B buses, which assumes constant memory access times, arbitrary memory access patterns, and bus contention.
Abstract: This paper presents an approximate analysis of a multiprocessor system consisting of P processors, M memory modules, and B buses. The model assumes constant memory access times, arbitrary memory access patterns, and bus contention. The solution technique aggregates all memories into a composite queue and degrades the service rates of this queue so as to include the effect of bus contention. The throughput predictions from this model are very accurate, typically within 1% of predictions made with either simulation or exact analysis.
TL;DR: An approximate model is developed to estimate the processor utilization and the speed-up improvement provided by the caches, and it assumes a two-dimensional organization, previously studied under random and word access.
Abstract: A possible design alternative for improving the performance of a multiprocessor system is to insert a private cache between each processor and the shared memory. The caches act as high-speed buffers by reducing the effective memory access time, and affect the delays caused by memory conflicts. In this paper, we study the effectiveness of caches in a multiprocessor system. The shared memory is pipelined and interleaved to improve the block transfer rate, and it assumes a two-dimensional organization, previously studied under random and word access. An approximate model is developed to estimate the processor utilization and the speed-up improvement provided by the caches.
TL;DR: Memory mapping and interconnection network neighborhood operations such as two dimensional convolution are easily performed and memory access collisions can be minimized.
Abstract: LIPP (Linkoping Image Parallell Processor) is a multiprocessor system intended mainly for image analysis and image processing but even other computing tasks where large amount of data should be manipulated in forms of matrices, such as weather forecasts or other related problems namely systems of differential equations. The processors within the processor array are of bit-serial type with the capability of directly processing data with wordlengths in the range of 1 bit to 32 bits in one bit increments without time penalty. Bit-serial operation gives the possibility of designing suprisingly fast algorithms. To each processor is a fairly large memory (64 Kbit) associated. A processor can instantly reach 8 neighboring memories through an interconnecting network. The processor array whose size is thought to be 16 by 16 it running in SIMD mode. In this way memory access collisions can be minimized. Image and matrix data are mapped in the memory space so that each memory holds a subimage. We call this mapping distributed processor topology. Because of the memory mapping and interconnection network neighborhood operations such as two dimensional convolution are easily performed.
TL;DR: A multiprocessor includes five 8086 microprocessors interconnected with replicated shared memory, which minimizes read interference since each processor simply accesses its own private copy of the shared memory.
Abstract: A multiprocessor includes five 8086 microprocessors interconnected with replicated shared memory. Such a memory structure consists of a set of memories, one for each processor, with identical contents. This minimizes read interference since each processor simply accesses its own private copy of the shared memory. To ensure shared memory integrity, write requests transfer data over the MULTIBUS to all copies in parallel. Overall, replicated shared memory structures provide improved concurrency.An HP 64000 Logic Development System serves as a host computer for program development and a bulk storage device. A power-on and restart monitor in shared PROM provides a run-time debug and method for down-loading the operating system and application programs. The real-time, multi-tasked operating system (called MPX) distributes a sequence of high and low priority tasks, with possible preemption, among the processors. MPX floats from processor to processor while balancing the system load for maximum concurrency and throughput.
TL;DR: A monolithic 1k*8-bit dual-port static RAM, the SY2130 which simplifies the design of multiprocessor systems and other circuits in which more than one device needs fast repetitive access to a common memory core.
Abstract: This paper describes a monolithic 1k*8-bit dual-port static RAM, the SY2130 which simplifies the design of multiprocessor systems and other circuits in which more than one device needs fast repetitive access to a common memory core. By providing two asynchronous devices with simultaneous read/write memory access, the device eliminates the external arbitration circuitry common to dual-port designs built using conventional RAMS. As a result, circuits that incorporate the device employ fewer components and typically operate faster than do standard dual-port designs. In addition, the SY2130's fully asynchronous operation allows to couple processors of varying speeds without degrading system performance.
TL;DR: The ICL Distributed Array Processor is discussed in detail and applications are assessed, finding that operations on low precision, symbol and Boolean data give even higher performance than floating point work.
Abstract: High processing power can be achieved in a cost-effective manner by having many processing units executing a common instruction stream and embedded in memory in order to provide a wide memory bandwidth. The ICL Distributed Array Processor uses this principle, the present implementation having 4096 elementary processors arranged in a square array. A parallel high-level language is used for expressing operations on arrays and permits effective use of the hardware's capability. Many applications have been implemented and because of the bit organised nature of the processing elements, operations on low precision, symbol and Boolean data give even higher performance than floating point work. The DAP is discussed in detail and applications are assessed.
TL;DR: This dissertation develops a method for organizing a system of processors with shared memory that takes into account dynamic load balancing across the processors, robustness, and reliability, and describes a distributed implementation of a Pascal compiler on a local network of personal computers.
Abstract: Designing a closely-coupled distributed computer system would create an environment that is easily expandable, eliminate the high communication costs of a loosely-coupled system, and provide a great deal of power at a low cost. Since the software and hardware architectures do not currently exist to allow this kind of system to be built, this dissertation will explore a unified view of hardware and software and present one solution.
The first portion of this dissertation develops a method for organizing a system of processors with shared memory that takes into account dynamic load balancing across the processors, robustness, and reliability. Emphasis is given to the design of the network, placement of input/output devices, and caching techniques.
Next, a technique for implementing language processors (compilers and assemblers) that will run on the closely-coupled distributed system is discussed. The basis of this technique is to process in parallel various syntactic structures of the language to be compiled. In this way, the system resembles a data flow computer in which the granularity of the data is very large. In addition, techniques for the production of a compiler that analyzes the data flow of a program written in a conventional language, that produces a data flow graph representing that program, and that partitions the graph for execution on a multiprocessor system are developed.
The final part of this dissertation will describe a distributed implementation of a Pascal compiler on a local network of personal computers. This prototype demonstrates the feasibility of this approach to distributed processing.
TL;DR: The Megaframe is a multiprocessor system composed of separate sets of MC68010 and IAPX-186 processors that handles all application-related tasks including process and memory management and communications with peripheral devices.
Abstract: The Megaframe is a multiprocessor system composed of separate sets of MC68010 and IAPX-186 processors. One set handles all application-related tasks including process and memory management. Another set takes care of all file management, and the third set handles communications with peripheral devices. The high degree of independence enjoyed by each application processor means that adding extra processors gives an almost linear increase in the processing power of the system. The Unix kernel in each application processor provides user programs with an interface compatible with the Unix system v.
TL;DR: The MIDAS architecture organizes multiple CPUs into clusters called distributed subsystems, each of which consists of an array of processors controlled by a supervisory CPU.
Abstract: The MIDAS architecture organizes multiple CPUs into clusters called distributed subsystems. Each subsystem consists of an array of processors controlled by a supervisory CPU. The multiprocessor array is composed of commercial CPUs (with floating point hardware) and specialized processing elements. Interprocessor communication within the array may occur either through switched memory modules or common shared memory. The architecture permits multiple processors to be focused on single problems. A distributed subsystem has been constructed and tested. It currently consists of a supervisor CPU; 16 blocks of independently switchable memory; 9 general purpose, VAX-class CPUs; and 2 specialized pipelined processors to handle I/O. Results on a variety of problems indicate that the subsystem performs 8 to 15 times faster than a standard computer with an identical CPU. The difference in performance represents the effect of differing CPU and I/O requirements.
TL;DR: The authors study the performance of a tightly-coupled multiprocessor in which a crossbar is employed to interconnect p processors to m memory modules and proposes three approximation methods which based on the idea of aggregation generates the best result.
Abstract: System structure and program behaviour are two major factors that influence the performance of a tightly-coupled multiprocessor. The latter has been usually ignored in most of the previous studies. The authors study the performance of a tightly-coupled multiprocessor in which a crossbar is employed to interconnect p processors to m memory modules. A set of non-uniformly distributed probabilities is also employed to illustrate the program behaviour, but no distinction is made between the processors. An inverse relation between the average request completion time and the effective memory bandwidth is obtained and three approximation methods are proposed. Their solutions are compared with the exact solution. Among them the repetitive augmenting method which based on the idea of aggregation generates the best result. 18 references.
TL;DR: In this paper, the authors proposed a shared memory sharing scheme for transferring information from computer to computer without the computer software overhead, by writing information which has been received by the communication control eqipment, on the memory of the computer and using the memory units of the respective computers as a share memory.
Abstract: PURPOSE:To transfer information from computer to computer without the computer software overhead, by writing information which has been received by the communication control eqipment, on the memory of the computer and using the memory units of the respective computers as a shared memory. CONSTITUTION:A communication control equipment 4 of the computer 1 periodically fetches the block contents of the virtual shared memory segment (not shown in the figure) into a frame memory (not shown in the figure), addes the transmission control header thereto and sends them to the transmission line 5. The frame which has been sent to the transmission line 5 is sent again to the transmission line 5. The frame which has been sent to the transmission line 5, is removed from the transmission route when the frame returns to the transmitting device 4, after making one round. By this arrangement, the information which has been set to the memory of the computer 1 is automatically transferred to all other memory units of the computer 1 within a certain limited time. When viewed from the program of the computer, therefore, the operation can be made as if the shared memory were present, which can be accessed in common from all computers.
TL;DR: Two schemes which allow for the parallel secondary storage devices to preload input data and programs into the primary memories so that system performance can be improved are presented and compared and show that both methods are effective techniques.
Abstract: Parallel processing systems, such as PASM, employ a large number of primary memory modules. A memory system organization using parallel secondary storage devices and double-buffered primary memories has been devised for PASM in order to prevent primary/secondary memory transfers from becoming a bottleneck. To efficiently use the memory system, it is desirable to overlap the operation of the parallel secondary storage devices with computations being performed by the processors. Due to the dynamically reconfigurable architecture of PASM, the processors which will execute a new task will not be selected until they are ready to execute the task. That is, to make effective use of double-buffering, a task must be preloaded prior to the final selection of the processors on which is will execute. Two schemes which allow for the parallel secondary storage devices to preload input data and programs into the primary memories so that system performance can be improved are presented and compared. Results show that both methods are effective techniques. 21 references.
TL;DR: This paper presents a technique, based on processors sharing memory, which enables the construction of flexible multiprocessor systems, which has certain limitations in that it is not dynamically reconfigurable at run time and that certain topologies may not be implemented directly.
TL;DR: A stochastic model of throughput with interference for a system of n processors and m shared resources is developed which provides both individual processor and overall system throughputs.
Abstract: A stochastic model of throughput with interference for a system of n processors and m shared resources is developed which provides both individual processor and overall system throughputs. In addition, the model considers the influence of local processor resources, the bandwidth of shared resources, the priorities of processors, and the processor utilizations of each shared resource. To validate the model, experimental data collected on a three-processor, one shared-resource (memory) computer is compared to model predictions. Accuracy decreases as shared resource utilization and the number of processors increases. For utilization of 60-70percent, the error is less than 3percent, and at saturation (100percent) it increases to 10percent. Finally, a two-processor, three-shared-resource computer illustrates the shift in throughput as various model parameters change. 8 references.
TL;DR: This paper proposes the expanded high-order correlation matrix associative memory, aiming at reducing the required number of memory elements, and several aspects of the characteristics are discussed, such as correctness of memory and recall, rate of correct recall, utilization efficiency of memory element, resistance to partial breakdown, and error compensation and correction capabilities.
Abstract: Among the problems concerning correlation matrix-type associative memory, the limitation of the number of vectors that can be memorized (i.e., memory capacity) was removed by Ichikawa et al. [3] by introducing the idea of high-order correlation matrix. However, there is a disadvantage in that the required number of memory elements is proportional to the N power of the vector in the N-th order correlation matrix associative memory. One reason for this is that a redundancy exists in the memorized data due to the symmetrical property of the high-order correlation matrix. This paper focuses on this point and proposes the expanded high-order correlation matrix associative memory, aiming at reducing the required number of memory elements. Several aspects of the characteristics of the proposed system are discussed, such as correctness of memory and recall, rate of correct recall, utilization efficiency of memory element, resistance to partial breakdown, and error compensation and correction capabilities. As a result, the following properties are shown:
1
there exists an order for which correct recall is possible independently of the attribute of the vector to be memorized;
2
the rate of recall increases with the increase of the order;
3
the utilization efficiency of the memory element is improved with the increase of the order, but the contribution of the memory element to the recall is independent of the order;
4
the system is resistant to the partial breakdown of memory, which is a desirable property as a distributed memory;
5
the error compensation and correction capability increases with the increase of the nonzero components of the input vector and the increase of the order.
TL;DR: A simulator for a multiprocessor system consisting of p identical processors and m shared memories connected by a crossbar switch is presented and special attention is given to the automatic partitioning of a high level program into parallel tasks.
Abstract: A simulator for a multiprocessor system consisting of p identical processors and m shared memories connected by a crossbar switch is presented. Its main features are a DO-loop unwrapper, the scheduler and the emulator. Special attention is given to the automatic partitioning of a high level program into parallel tasks. This allows the user to concentrate on different processor configurations and experience with many algorithms. The simulator takes into account communication delays and variable execution times for different operations.