TL;DR: The basic principle of systolic architectures is reviewed and it is explained why they should result in cost-effective, highperformance special-purpose systems for a wide range of problems.
Abstract: f High-performance, special-purpose computer systems are typically used to meet specific application requirements or to off-load computations that are especially taxing to general-purpose computers. As hardware cost and size continue to drop and processing requirements become well-understood in areas such as signal and image processing, more special-purpose systems are being constructed. However, since most of these systems are built on an ad hoc basis for specific tasks, methodological work in this area is rare. Because the knowledge gaited from individual experiences is neither accumulated nor properly organized, the same errors are repeated. I/O and computation imbalance is a notable example-often, the fact that I/O interfaces cannot keep up with device speed is discovered only after constructing a high-speed, special-purpose device. We intend to help correct this ad hoc approach by providing a general guideline-specifically, the concept of systolic architecture, a general methodology for mapping high-level computations into hardware structures. In a systolic system, data flows from the computer memcory in a rhythmic fashion, passing through many processing elements before it returns to memory, much as blood circulates to and from the heart. The system works like an autombbile assembly line where different people work on the same car at different times and many cars are assembled simultaneously. An assembly line is always linear, however, and systolic systems are sometimes two-dimensional. They can be rectangular, triangular, or hexagonal to make use of higher degrees of parallelism. Moreover, to implement a variety of computations, data flow in a systolic system may be at multiple speeds in multiple directions-both inputs and (partial) results flow, whereas only results flow in classical pipelined systems. Generally speaking, a systolic system is easy to implement because of its regularity and easy to reconfigure (to meet various outside constraints) because of its modularity. The systolic architectural concept was developed at Carnegie-Mellon University,'17 and versions of systolic processors are being designed and built by several industrial and governmental organizations.840 This article reviews the basic principle of systolic architectures and explains why they should result in cost-effective, highperformance special-purpose systems for a wide range of problems.
TL;DR: A systolic system is a network of processors which rhythmically compute and pass data through the system, and almost all processors used in the networks are identical, so that a regular flow of data is kept up in the network.
Abstract: : A systolic system is a network of processors which rhythmically compute and pass data through the system. Physiologists use the work 'systole' to refer to the rhythmically recurrent contraction of the heart and arteries which pulses blood through the body. In a systolic computing system, the function of a processor is analogous to that of the heart. Every processor regularly pumps data in and out, each time performing some short computation, so that a regular flow of data is kept up in the network. Many basic matrix computations can be pipelined elegantly and efficiently on systolic networks having an array structure. As an example, hexagonally connected processors can optimally perform matrix multiplication. Surprisingly, a similar systolic array can compute the LU-decomposition of a matrix. These systolic arrays enjoy simple and regular communication paths, and almost all processors used in the networks are identical. As a result, special purpose hardware devices based on systolic arrays can be built inexpensively using the VLSI technology. (Author)
TL;DR: This paper is concerned with the mapping of cyclic loop algorithms into special-purpose VLSI arrays and the mapping procedure is based on the mathematical transformations of index sets and data dependence vectors.
Abstract: This paper is concerned with the mapping of cyclic loop algorithms into special-purpose VLSI arrays. The mapping procedure is based on the mathematical transformations of index sets and data dependence vectors. Necessary and sufficient conditions for the existence of valid transformations are given for algorithms with constant data dependences. Two examples of different algorithms are given to illustrate the proposed mapping procedure; first is the LU decomposition of a matrix which leads to constant data dependence vectors, and secondly is the dynamic programming which leads to dependences which are functions on the index set and are more difficult to be mapped into VLSI arrays.
TL;DR: In this article, a parallel RISC computer system is provided by a multimode dynamic multi-mode parallel processor array with one embodiment illustrating a tightly coupled VLSI embodiment with an architecture which can be extended to more widely placed processing elements through the interconnection network which couples multiple processors capable of MIMD mode processing to one another.
Abstract: A Parallel RISC computer system is provided by a multi-mode dynamic multi-mode parallel processor array with one embodiment illustrating a tightly coupled VLSI embodiment with an architecture which can be extended to more widely placed processing elements through the interconnection network which couples multiple processors capable of MIMD mode processing to one another with broadcast of instructions to selected groups of units controlled by a controlling processor. The coupling of the processing elements logic enables dynamic mode assignment and dynamic mode switching, allowing processors operating in a SIMD mode to make maximum memory and cycle time usage. On and instruction by instruction level basis, modes can be switched from SIMD to MIMD, and even into SISD mode on the controlling processor for inherently sequential computation allowing a programmer or complier to build a program for the computer system which uses the optimal kind of parallelism (SISD, SIMD, MIMD). Furthermore, this execution, particularly in the SIMD mode, can be set up for running applications at the limit of memory cycle time. With the ALLNODE switch and alternatives paths a system can be dynamically achieved in a few cycles for many many processors. Each processing element and memory and has MIMD capability the processor's an instruction register, condition register and program counter provide common resources which are used in MIMD and SIMD. The program counter become a base register in SIMD mode.
TL;DR: Use of the cut theory and ring architectures for arrays with feedback gives effective fault-tolerant and two-level pipelining schemes for most systolic arrays.