TL;DR: Pentti Kanerva's Sparse Distributed Memory presents a mathematically elegant theory of human long term memory that resembles the cortex of the cerebellum, and provides an overall perspective on neural systems.
Abstract: From the Publisher:
Motivated by the remarkable fluidity of memory the way in which items are pulled spontaneously and effortlessly from our memory by vague similarities to what is currently occupying our attention Sparse Distributed Memory presents a mathematically elegant theory of human long term memory
The book, which is self contained, begins with background material from mathematics, computers, and neurophysiology; this is followed by a step by step development of the memory model The concluding chapter describes an autonomous system that builds from experience an internal model of the world and bases its operation on that internal model Close attention is paid to the engineering of the memory, including comparisons to ordinary computer memories
Sparse Distributed Memory provides an overall perspective on neural systems The model it describes can aid in understanding human memory and learning, and a system based on it sheds light on outstanding problems in philosophy and artificial intelligence Applications of the memory are expected to be found in the creation of adaptive systems for signal processing, speech, vision, motor control, and (in general) robots Perhaps the most exciting aspect of the memory, in its implications for research in neural networks, is that its realization with neuronlike components resembles the cortex of the cerebellum
Pentti Kanerva is a scientist at the Research Institute for Advanced Computer Science at the NASA Ames Research Center and a visiting scholar at the Stanford Center for the Study of Language and Information A Bradford Book
TL;DR: One possible input language for describing distributions is described and one efficient message-passing program is derived from a sequential shared-memory program annotated with directions on how elements of shared arrays are distributed to processors.
Abstract: We describe a new approach to programming distributed-memory computers. Rather than having each node in the system explicitly programmed, we derive an efficient message-passing program from a sequential shared-memory program annotated with directions on how elements of shared arrays are distributed to processors. This article describes one possible input language for describing distributions and then details the compilation process and the optimization necessary to generate an efficient program.
TL;DR: In this article, the use of the partitioning process permits data to be duplicated throughout a distributed system architecture and permits read cycles for shared data to execute at local bus speeds.
Abstract: A computer system having plural nodes interconnected by a common broadcast bus. Each node has memory and at least one node has a processor. The system has a dynamically configurable memory which may be located within the system address space of a distributed system architecture including memory within each node having a processor and the memory resident within other nodes. The memory in the system address space is addressable by system physical addresses which are isolated from the physical addresses for memory in each node. The node physical addresses are translatable to and from the system physical addresses by partition maps located in partition tables at each node. Memory located anywhere in the distributed system architecture may be partitioned dynamically and accessed on a local basis by programming the partition tables, stored in partitioning RAMs. The use of the partitioning process permits data to be duplicated throughout a distributed system architecture and permits read cycles for shared data to execute at local bus speeds.
TL;DR: The architecture of the processor in the Horizon system is described, which will be capable performing at a rate of several hundred MFLOPS (millions of floating-point operations per second) to achieve an overall system performance target of 100 GFLOPS.
Abstract: Horizon is a scalable shared-memory Multiple Instruction stream - Multiple Data stream (MIMD) computer architecture independently under study at the Supercomputing Research Center (SRC) and Tera Computer Company. It is composed of a few hundred identical scalar processors and a comparable number of memories, sparsely embedded in a three-dimensional nearest-neighbor network. Each processor has a horizontal instruction set that can issue up to three floating point operations per cycle without resorting to vector operations. Processors will each be capable of performing several hundred Million Floating Point Operations Per Second (FLOPS) in order to achieve an overall system performance target of 100 Billion (1011) FLOPS.This paper describes the architecture of the processor in the Horizon system. In the fashion of the Denelcor HEP, the processor maintains a variable number of Single Instruction stream - Single Data stream (SISD) processes, which are called instruction streams. Memory latency introduced by the large shared memory is hidden by switching context (instruction stream) each machine cycle. The processor functional units are pipelined to achieve high computational throughput rates; however, pipeline dependencies are hidden from user code. Hardware mechanisms manage the resources to guarantee anonymity and independence of instruction streams.
TL;DR: Several parallel algorithms are presented for solving triangular systems of linear equations on distributed-memory multiprocessors and new wavefront algorithms are developed for both row-oriented and column-oriented matrix storage.
Abstract: Several parallel algorithms are presented for solving triangular systems of linear equations on distributed-memory multiprocessors. New wavefront algorithms are developed for both row-oriented and column-oriented matrix storage. Performance of the new algorithms and several previously proposed algorithms is analyzed theoretically and illustrated empirically using implementations on commercially available hypercube multiprocessors.
TL;DR: This article deals with the problem of factoring a large sparse positive definite matrix on a multiprocessor system where the processors are assumed to have substantial local memory but no globally shared memory.
Abstract: This article deals with the problem of factoring a large sparse positive definite matrix on a multiprocessor system. The processors are assumed to have substantial local memory but no globally shared memory. They communicate among themselves and with a host processor through message passing. Our primary interest is in designing an algorithm which exploits parallelism, rather than in exploiting features of the underlying topology of the hardware. However, part of our study is aimed at determining, for certain sparse matrix problems, whether hardware based on the binary hypercube topology adequately supports the communication requirements for such problems. Numerical results from experiments conducted on a hypercube multiprocessor are included.
TL;DR: It is shown the kernel, through a combination of locking, shadowed memory, and controlled flushing of non-write-through cache, maintains a consistent main memory state recoverable from any single-point failure.
Abstract: The Sequoia computer is a tightly coupled multiprocessor that avoids most of the fault-tolerance disadvantages of tight coupling by using a fault-tolerant hardware-design approach. An overview is give of how the hardware architecture and operating system (OS) work together to provide a high degree of fault tolerance with good system performance. A description of hardware is followed by a discussion of the multiprocessor synchronization problem. Kernel support for fault recovery and the recovery process itself are examined. It is shown the kernel, through a combination of locking, shadowed memory, and controlled flushing of non-write-through cache, maintains a consistent main memory state recoverable from any single-point failure. The user shared memory is also discussed. >
TL;DR: In this paper, a modular, expandable, topologically-distributed-memory multiprocessor computer comprises a plurality of non-directly communicating slave processors under the control of a synchronizer and a master processor.
Abstract: A modular, expandable, topologically-distributed-memory multiprocessor computer comprises a plurality of non-directly communicating slave processors under the control of a synchronizer and a master processor. Memory space is partitioned into a plurality of memory cells. Dynamic variables may be mapped into the memory cells so that they depend upon processing in nearby partitions. Each slave processor is connected in a topologically well-defined way through a dynamic bi-directional switching system (gateway) to different respective ones of the memory cells. Access by the slave processors to their respective topologically similar memory cells occurs concurrently or in parallel in such a way that no data-flow conflicts occur. The topology of data distribution may be chosen to take advantage of symmetries which occur in broad classes of problems. The system may be tied to a host computer used for data storage and analysis of data not efficiently processed by the multiprocessor computer.
TL;DR: It is concluded that, in the absence of loop-unrolling, $LU$ factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting.
Abstract: In this paper, we consider the effect that the data-storage scheme and pivoting scheme have on the efficiency of $LU$ factorization on a distributed-memory multiprocessor. Our presentation will focus on the hypercube architecture, but most of our results are applicable to distributed-memory architectures in general. We restrict our attention to two commonly used storage schemes (storage by rows and by columns) and investigate partial pivoting both by rows and by columns, yielding four factorization algorithms. Our goal is to determine which of these four algorithms admits the most efficient parallel implementation. We analyze factors such as load distribution, pivoting cost, and potential for pipelining. We conclude that, in the absence of loop-unrolling, $LU$ factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting. The two schemes that can be pipelined are pivoting by interchanging rows when the coefficient matrix is distributed to the processors by columns, and pivoting by interchanging columns when the matrix is distributed to the processors by rows.
TL;DR: A toroidally-connected distributed-memory parallel computer with rows of processors with each processor having an independent memory is described in this article, where each buffering mechanism is associated with one processor of the single row of processors.
Abstract: A toroidally-connected distributed-memory parallel computer having rows of processors (12), with each processor having an independent memory. The computer includes at least one common I/O channel (26) adapted to be connected to a single row of processors (20) by buffering (24) mechanisms. Each buffering mechanism is associated with one processor of the single row of processors.
TL;DR: This work considers solving triangular systems of linear equations on a distributed-memory multiprocessor which allows for a ring embedding and proposes a parallel algorithm, applicable when the triangular matrix is distributed by column in a wrap fashion.
Abstract: We consider solving triangular systems of linear equations on a distributed-memory multiprocessor which allows for a ring embedding. Specifically, we propose a parallel algorithm, applicable when the triangular matrix is distributed by column in a wrap fashion. Numerical experiments indicate that the new algorithm is very efficient in some circumstances (in particular, when the size of the problem is sufficiently large relative to the number of processors).A theoretical analysis confirms that the total running time varies linearly, with respect to the matrix order, up to a threshold value of the matrix order, after which the dependence is quadratic. Moreover, we show that total message traffic is essentially the minimum possible.Finally, we describe an analogous row-oriented algorithm.
TL;DR: In this paper, a multiprocessing system is presented having a plurality of processing nodes interconnected together by a communication network, each processing node including a processor, responsive to user software running on the system, and an associated memory module, and capable under user control of dynamically partitioning each memory module into a global storage efficiently accessible by a number of processors connected to the network, and local storage efficient accessible by its associated processor.
Abstract: A multiprocessing system is presented having a plurality of processing nodes interconnected together by a communication network, each processing node including a processor, responsive to user software running on the system, and an associated memory module, and capable under user control of dynamically partitioning each memory module into a global storage efficiently accessible by a number of processors connected to the network, and local storage efficiently accessible by its associated processor.
TL;DR: In this paper, a multiprocessing system and a method for multi-processing is described, in which a pair of processors are connected to a central memory through a plurality of memory reference ports, and each processor is further connected to shared registers which may be directly addressed by either processor at rates commensurate with intra-processor operation.
Abstract: A multiprocessing system and method for multiprocessing is disclosed A pair of processors are provided, and each are connected to a central memory through a plurality of memory reference ports The processors are further each connected to a plurality of shared registers which may be directly addressed by either processor at rates commensurate with intra-processor operation The shared registers include registers for holding scalar and address information and registers for holding information to be used in coordinating the transfer of information through the shared registers A multiport memory is provided and includes a conflict resolution circuit which senses and prioritizes conflicting references to the central memory Each CPU is interfaced with the central memory through three ports, with each of the ports handling different ones of several different types of memory references which may be made At least one I/O port is provided to be shared by the processors in transferring information between the central memory and peripheral storage devices A vector register design is also disclosed for use in vector processing computers, and provides that each register consist of at least two independently addressable memories, to deliver data to or accept data from a functional unit The method of multiprocessing permits multitasking in the multiprocessor, in which the shared registers allow independent tasks of different jobs or related tasks of a single job to be run concurrently, and facilitate multithreading of the operating system by permitting multiple critical code regions to be independently synchronized
TL;DR: This paper describes how Crystal, a language based on familiar mathematical notation and lambda calculus, addresses the issues of programmability and performance for parallel supercomputers and illustrates the power of its approach with benchmarks of compiled parallel code from Crystal source.
Abstract: This paper describes how Crystal, a language based on familiar mathematical notation and lambda calculus, addresses the issues of programmability and performance for parallel supercomputers. Some scientifc programmers and theoreticians may ask, “What is new about Crystal?” or “How is it different from existing functional languages?” The answers lie in its model of parallel computation and a theory of parallel program optimization, and we examine this in the text to follow. We illustrate the power of our approach with benchmarks of compiled parallel code from Crystal source. The target machines are hypercube multiprocessors with distributed memory, on which it is considered difficult for functional programs to achieve high efficiency.
TL;DR: In this paper, a method for uniformly balancing the aggregate computational load in, and utilizing a minimal memory by, a network having identical computations to be executed at each connection therein is disclosed.
Abstract: In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing a minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of partitions, and the computational load is subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which memory is subdivided.
TL;DR: New parallel algorithms and comparative test results are given for solving triangular systems of linear equations on distributed-memory multiprocessors and the new algorithms are shown to provide substantial performance improvements.
Abstract: New parallel algorithms and comparative test results are given for solving triangular systems of linear equations on distributed-memory multiprocessors. These results supplement those given in a previous paper. All of the new algorithms are variations on the cyclic algorithms discussed previously. The new algorithms are shown to provide substantial performance improvements.
TL;DR: This model abstracts the three degrees of distribution: shared memory, local network, and wide area network and quantifies these differences for past, current, and future technologies.
Abstract: Distributed systems can be modeled as processes communicating via messages. This model abstracts the three degrees of distribution: shared memory, local network, and wide area network. Although these three forms of distribution are qualitatively the same, there are huge quantitative differences in their message transport costs and message transport reliability. This paper quantifies these differences for past, current, and future technologies. Table of
TL;DR: A new distributed algorithm is shown to outperform centralized ones and provide unrestricted sharing of read-write memory between tasks running on either strongly coupled or loosely coupled architectures, and any mixture thereof.
Abstract: This report describes the design, implementation and performance evaluation of a virtual shared memory server for the Mach operating system. The server provides unrestricted sharing of read-write memory between tasks running on either strongly coupled or loosely coupled architectures, and any mixture thereof. A number of memory coherency algorithms have been implemented and evaluated, including a new distributed algorithm that is shown to outperform centralized ones. Some of the features of the server include support for machines with multiple page sizes, for heterogeneous shared memory, and for fault tolerance. Extensive performance measures of applications are presented, and the intrinsic costs evaluated. Table of
TL;DR: Analytic and experimental results confirm the viability of distributed shared memory supported in hardware at the memory controller level and the impact of distributing the memory resource is under 10% of the undistributed performance.
Abstract: A major limitation of current distributed system technology is that the overhead associated with the normal input/output paradigm of interconnection severely affects the system performance. This research has taken a new perspective on the interconnection problem based on a memory extension paradigm. Evidence to date demonstrates that the processor overhead is greatly reduced and significant additional functionality is gained.
Memnet is a computer architecture in which the local network appears as memory in the physical address space of each processor on the network. Local area networking and distributed system support are two potential applications of this architecture. The Memnet principles of computer/communication interconnection are extendable to wide-area, high-speed, low-latency processor interconnection.
This dissertation includes a survey of interprocess communication schemes and shared memory architectures. A description of the Memnet architecture and implementation is followed by an analysis of the behavior of the Memnet architecture over a wide range of uses. The state machines and schematics of the experimental implementation are included as appendices.
Analytic and experimental results confirm the viability of distributed shared memory supported in hardware at the memory controller level. For many applications, the impact of distributing the memory resource is under 10% of the undistributed performance.
TL;DR: A description is given of the architecture, operating system, and performance of Balance, a shared-memory, tightly coupled multiprocessor system that supports both the 4.2 BSD and System V Unix environments.
Abstract: A description is given of the architecture, operating system, and performance of Balance, a shared-memory, tightly coupled multiprocessor system. Balance can contain two to thirty 32-bit microprocessors with an aggregate performance of up to 21 million instructions per second (MIPS). Each processor has a private cache as well as a small local memory to hold frequently used kernel routines. The system features a high-bandwidth pipelined bus, up to 28 Mbytes of main memory, a diagnostic and console processor, up to four IEEE 769 (Multibus) adapters, an IEEE 802.3 (Ethernet) LAN interface, and an ANSI Small Computer System Interface (SCSI). Dynix, a multiprocessor operating system supporting both the 4.2 BSD and System V Unix environments, manages Balance, providing transparent support for multiprocessing as well as tools and libraries for developing parallel applications. The various subsystems and the Dynix operating system are examined. Applications and performance are discussed. >
TL;DR: In this article, the authors propose a multiprocessor computing system where a plurality of processors are connected each other through a system bus, each processor comprising a processing unit, a local memory and an interface unit, and wherein the processing unit of any processor has access to both its own Local Memory and the local memory of any other processor (through the interface unit and the system bus) for concurrently writing into all the local memories an information identified by a destination code as a global data to be concurrently written in all local memories.
Abstract: Multiprocessor computing system wherein a plurality of processors are connected each other through a system bus, each processor comprising a processing unit, a local memory and an interface unit, and wherein the processing unit of any processor has access to both its own local memory and the local memory of any other processor (through the interface unit and the system bus) for concurrently writing into all the local memories an information identified by a destination code as a global data to be concurrently written in all local memories.
TL;DR: In this article, the operation request function and a plurality of processor elements is used to define recursively defined operations in a parallel computer with a shared memory for holding data and an operation execution element for accepting a message from another processor, temporarily stopping any other operation of the processor element, and executing the requested operation.
Abstract: A parallel computer has an operation request function and a plurality of processor elements. Each processor element has a sharable distributed memory for holding data, and is interconnected to a network to permit communication. Each processor element comprises a request sent unit for sending an operation request message for causing another processor element connected to a memory module to execute a recursive defining operation. The memory module stores data to be recursively defined. Each processor element further comprises an operation request execution element for accepting a message from another processor, temporarily stopping any other operation of the processor element in accordance with the content of the message, and executing the requested operation. Registers are also used for executing the operation requested by the other processor in addition to the general purpose registers and floating point registers.
TL;DR: The Psyche project at the University of Rochester aims to develop a high-performance operating system to support a wide variety of models for parallel programming, predicated on the conviction that no one model of process state or style of communication will prove appropriate for all applications, but that shared-memory multiprocessors can and should support all models.
Abstract: The Psyche project at the University of Rochester aims to develop a high-performance operating system to support a wide variety of models for parallel programming. It is predicated on the conviction that no one model of process state or style of communication will prove appropriate for all applications, but that shared-memory multiprocessors (particularly the scalable ‘‘NUMA’’ variety) can and should support all models. Conventional approaches, such as shared memory or message passing, can be regarded as points on a continuum that reflects the degree of sharing between processes. Psyche facilitates dynamic sharing by providing a user interface based on passive data abstractions in a uniform virtual address space. It ensures that users pay for protection only when it is required by permitting lazy evaluation of protection policies implemented with keys and access lists. The data abstractions define conventions for sharing the uniform address space; the tradeoff between protection and performance determines the degree to which those conventions are enforced. In the absence of protection boundaries, access to a shared abstraction can be as efficient as a procedure call or a pointer dereference.
TL;DR: In this article, the process manager assigns processes to processors and satisfies their initial memory requirements through global memory allocations, and deallocates to uncommitted memory both memory that is dynamically requested to be deallocated and memory of terminating processes.
Abstract: In a multiprocessor system (FIG. 1) wherein each adjunct processor has its own, non-shared, memory (22) the non-shared memory of each adjunct processor (11-12) comprises global memory (42) and local memory (41). All global memory of all adjunct processors is managed by a single process manager (30) of a system-wide host processor (10). Each processor's local memory is managed by its operating system kernel (31). Local memory comprises uncommitted memory (45) not allocated to any process and committed memory (46) allocated to processes. The process manager assigns processes to processors and satisfies their initial memory requirements through global memory allocations. Each kernel satisfies processes' dynamic memory allocation requests from uncommitted memory, and deallocates to uncommitted memory both memory that is dynamically requested to be deallocated and memory of terminating processes. Each processor's kernel and the process manager cooperate to transfer memory between global memory and uncommitted memory to keep the amount of uncommitted memory within a predetermined range.
TL;DR: The intent is to raise interprocess communication and process control to a higher and more natural level than using messages by allowing the user to define a virtual machine onto which data structures can be distributed.
Abstract: Dino is a new language, consisting of high level modifications to C, for writing numerical programs on distributed memory multiprocessors. Our intent is to raise interprocess communication and process control to a higher and more natural level than using messages. We achieve this by allowing the user to define a virtual machine onto which data structures can be distributed. Interprocess communication is implicitly invoked by reading and writing the distributed data. Parallelism is achieved by making concurrent procedure calls. This paper provides a summary of the syntax and semantics of Dino, and illustrates its features through several sample programs. We also briefly discuss a prototype of the language we have developed using C++.
TL;DR: In this article, a crossbar switch is constructed in monolithic integrated circuit form together with respective memory cells controlling each of the component crosspoint switches in the cross bar switch, which reduces the number of bits which must be provided in parallel to the integrated circuit for controlling the cross point switches.
Abstract: A crossbar switch is constructed in monolithic integrated circuit form together with respective memory cells controlling each of the component crosspoint switches in the crossbar switch. The memory cells permit control signals for the crosspoint switches to be supplied serially to the monolithic integrated circuit and thus permit those control signals to be supplied in coded form as orthogonal cross addressing for the memory cells. This reduces the number of bits which must be provided in parallel to the integrated circuit for controlling the crosspoint switches. In preferred embodiments of the crossbar switch, provision is made for operation as a corner-turn array for rotating bit matrices and for faster operation as a barrel shifter.
TL;DR: This taxonomy of MIMD multiprocessor architectures, classified as shared memory, message passing, or "hybrid" architectures, is shown to be incomplete, and an alternative complete taxonomy is suggested.
Abstract: MIMD multiprocessor architectures have been classified as shared memory, message passing, or "hybrid" architectures. This taxonomy is shown to be incomplete, and an alternative complete taxonomy is suggested. Examples of each class of the taxonomy are discussed, along with general attributes of the classes.