TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
Abstract: The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.
TL;DR: In this paper, a new parallel distributed memory multifrontal approach is described to handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed.
TL;DR: Recently integrated features of MUMPS are reported on and the present performance of the solver on an SGI Origin 2000 and a CRAY T3E is illustrated.
Abstract: MUMPS is a public domain software package for the multifrontal solution of large sparse linear systems on distributed memory computers. The matrices can be symmetric positive definite, general symmetric, or unsymmetric, and possibly rank deficient. MUMPS exploits parallelism coming from the sparsity in the matrix and parallelism available for dense matrices. Additionally, large computational tasks are divided into smaller subtasks to enhance parallelism. MUMPS uses a distributed dynamic scheduling technique that allows numerical pivoting and the migration of computational tasks to lightly loaded processors. Asynchronous communication is used to overlap communication with computation. In this paper, we report on recently integrated features and illustrate the present performance of the solver on an SGI Origin 2000 and a CRAY T3E.
TL;DR: Modern molecular dynamics methods are reviewed and their application to quantum manybody systems and electronic structure calculations described, and it is shown how modern object oriented programming paradigms can be employed to implement multilevel parallel algorithms in a large computational package rapidly and efficiently.
TL;DR: A High Performance Implementation of the Data Space Transfer Protocol (DSTP) and an efficient Parallel Classification Using Dimensional Aggregates for Mining Associations are implemented.
Abstract: Large-Scale Parallel Data Mining.- Parallel and Distributed Data Mining: An Introduction.- Mining Frameworks.- The Integrated Delivery of Large-Scale Data Mining: The ACSys Data Mining Project.- A High Performance Implementation of the Data Space Transfer Protocol (DSTP).- Active Mining in a Distributed Setting.- Associations and Sequences.- Efficient Parallel Algorithms for Mining Associations.- Parallel Branch-and-Bound Graph Search for Correlated Association Rules.- Parallel Generalized Association Rule Mining on Large Scale PC Cluster.- Parallel Sequence Mining on Shared-Memory Machines.- Classification.- Parallel Predictor Generation.- Efficient Parallel Classification Using Dimensional Aggregates.- Learning Rules from Distributed Data.- Clustering.- Collective, Hierarchical Clustering from Distributed, Heterogeneous Data.- A Data-Clustering Algorithm on Distributed Memory Multiprocessors.
TL;DR: In this article, an intelligent memory system, method, and computer program product for enabling stand-alone or distributed client-server software applications to operate at maximum speeds on a personal computer and the like.
Abstract: An intelligent memory system, method, and computer program product for enabling stand-alone or distributed client-server software applications to operate at maximum speeds on a personal computer and the like. An intelligent memory (IM) allows the acceleration of computer software processes through process virtual memory, application optimization, multiprocessor control, and system strategies. The IM includes both control logic and memory. The control logic uses an application database and system database to determine a set of modifications to the computer, application, and/or operating system, while the memory stores the application and allows the control logic to implement the set of modifications. A remote performance management system is also described which allows an IM service provider to supply the infrastructure to clients (e.g., e-businesses and the like who run World Wide Web servers) to facilitate and accelerate their content offerings to end user clients (i.e., consumers).
TL;DR: The point-centered domain decomposition algorithm is implemented in the new program Opal p using a standard message passing library, so that it runs on both shared memory and massively parallel distributed memory computers.
TL;DR: The Distributed Data Interface to permit storage of large data arrays in the aggregate memory of distributed memory, message passing computer systems is described and the good performance of a MP2 program using DDI is demonstrated.
TL;DR: In this article, a network switch apparatus (10), components for such an apparatus, and methods of operating an apparatus in which data flow handling and flexibility is enhanced by the cooperation of a plurality of memory elements and a pluralityof interface processors formed on a semiconductor substrate (10).
Abstract: A network switch apparatus (10), components for such an apparatus, and methods of operating such an apparatus in which data flow handling and flexibility is enhanced by the cooperation of a plurality of memory elements and a plurality of interface processors formed on a semiconductor substrate (10). The memory elements and interface processors together form a network processor (10) capable of cooperating with other elements in executing instructions directing the flow of data in a network. Access to the memory elements is controlled in a particular manner and under operative rules which provide controlled multiple accesses of the plurality of memory elements by a plurality of processors.
TL;DR: This paper addresses the problem of distributing model checking of timed automata and shows how in the timed case the search order of the state space is crucial for the effectiveness and scalability of the exploration.
Abstract: In this paper we address the problem of distributing model checking of timed automata. We demonstrate through four real life examples that the combined processing and memory resources of multi-processor computers can be effectively utilized. The approach assumes a distributed memory model and is applied to both a network of workstations and a symmetric multiprocessor machine. However, certain unexpected phenomena have to be taken into account. We show how in the timed case the search order of the state space is crucial for the effectiveness and scalability of the exploration. An effective heuristic to counter the effect of the search order is provided. Some of the results open up for improvements in the single processor case.
TL;DR: In this paper, a single integrated circuit includes first and second data processors operating on different instruction sets independently operating on disjoint programs and data, and a shared data transfer controller and shared memory divided into plural independently accessible memory banks.
Abstract: A single integrated circuit includes first and second data processors operating on different instruction sets independently operating on disjoint programs and data. The single integrated circuit preferably includes an external interface, a shared data transfer controller and shared memory divided into plural independently accessible memory banks. The two data processors are preferably a digital signal processor (DSP) and a reduced instruction set computer (RISC) processor. The DSP and RISC processors are suitably programmed to perform differing aspects of computer image processing.
TL;DR: This paper presents the first system that implements OpenMP on a network of shared-memory multiprocessors, and shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions.
TL;DR: This work presents one-sweep parallel algorithms for the inversion of general and symmetric positive definite matrices that feature simple programming and performance optimization while maintaining the same arithmetic cost and numerical properties of conventional inversion algorithms.
Abstract: We present one-sweep parallel algorithms for the inversion of general and symmetric positive definite matrices. The algorithms feature simple programming and performance optimization while maintaining the same arithmetic cost and numerical properties of conventional inversion algorithms. Our experiments on a Cray T3E-600 and a Beowulf cluster demonstrate high performance of implementations for distributed memory parallel computers.
TL;DR: This article shows that for the classical problem of three-dimensional transonic Euler flow about an M6 wing, ψNKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton’s method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver.
Abstract: Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (ΨNKS) algorithmic framework is presented as a widely applicable answer. This article shows that for the classical problem of three-dimensional transonic Euler flow about an M6 wing, ΨNKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton's method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of ΨNKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. The authors therefore distill several recommendations from their experience and reading of the literature on various algorithmic components of ΨNKS, and they describe a freely available MPI-based portable parallel software implementation of the solver employed here.
TL;DR: In this paper, a reader-writer lock is proposed to reduce the writer and reader overhead by employing lock structures that are shared among groups of processors that have lower latencies, which reduces the writer overhead that otherwise would exist if each processor in the system had a separate flag.
Abstract: A reader-writer lock minimizes writer and reader overhead by employing lock structures that are shared among groups of processors that have lower latencies. In the illustrated multiprocessor system having a non-uniform memory access (NUMA) architecture, each processor node has a lock structure comprised of a shared counter and associated flag for each CPU group. During a read, the counter can be changed only by processors within a CPU group performing a read. This reduces the reader overhead that otherwise would exist if all processors in the system shared a single counter. During a write, the shared flag can be changed by a process running on any processor in the system. The processors in a CPU group are notified of the write through the shared flag. This reduces the writer overhead that otherwise would exist if each processor in the system had a separate flag. The number of CPUs per group can be varied to optimize performance of the lock in different multiprocessor systems.
TL;DR: Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system, is described.
Abstract: This paper describes Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system. Paje is capable of representing a wide variety of interactions between threads. The main characteristics of Paje, interactivity and scalability, are exemplified by the performance tuning of a molecular dynamics application. In order to be easily extensible, the architecture of the system was based on components which are connected in a data flow graph to produce a given visualization tool. Innovative components were designed, in addition to “classical” components existing in similar visualization systems, to support scalability and interactivity.
TL;DR: A survey of the buffer management methods that have been proposed for shared-memory packet switches and their strengths and weaknesses are described and evaluated using computer simulations.
Abstract: In the shared-memory switch architecture, output links share a single large memory, in which logical FIFO queues are assigned to each link. Although memory sharing can provide a better queuing performance than physically separated buffers, it requires carefully designed buffer management schemes for a fair and robust operation. This article presents a survey of the buffer management methods that have been proposed for shared-memory packet switches. Several buffer management policies are described, and their strengths and weaknesses are examined. The performances of various policies are evaluated using computer simulations. A comparison of the most important schemes is obtained with the help of the simulation results and the results provided in the literature. The survey concludes with a discussion of the possible future research areas related to shared-memory ATM switches.
TL;DR: An enhanced memory algorithmic processor (MAP) as discussed by the authors is an architecture for multiprocessor computer systems which comprises an assembly that may include an array of field programmable gate arrays (FPGAs) functioning as the memory algorithms.
Abstract: An enhanced memory algorithmic processor (“MAP”) architecture for multiprocessor computer systems comprises an assembly that may comprise, for example, field programmable gate arrays (“FPGAs”) functioning as the memory algorithmic processors. The MAP elements may further include an operand storage, intelligent address generation, on board function libraries, result storage and multiple input/output (“I/O”) ports. The MAP elements are intended to augment, not necessarily replace, the high performance microprocessors in the system and, in a particular embodiment of the present invention, they may be connected through the memory subsystem of the computer system resulting in it being very tightly coupled to the system as well as being globally accessible from any processor in a multiprocessor computer system.
TL;DR: This paper introduces a new interconnection network for large-scale distributed memory multiprocessors called dual-cube, which mitigates the problem of increasing number of links in the large- scale hypercube network while keeps most of the topological properties of thehypercube network.
Abstract: The binary hypercube, or n-cube, has been widely used as the interconnection network in parallel computers. However, the major drawback of the hypercube is the increase in the number of communication links for each node with the increase in the total number of nodes in the system. This paper introduces a new interconnection network for large-scale distributed memory multiprocessors called dual-cube. This network mitigates the problem of increasing number of links in the large-scale hypercube network while keeps most of the topological properties of the hypercube network. We investigate the topological properties of the dualcube, compare them with other hypercube-like networks, and establish the basic routing and broadcasting algorithms for dual-cubes.
TL;DR: In this article, the authors describe a software architecture that allows image processing researchers to develop parallel applications in a transparent manner, where all parallelism is completely hidden from the user, and the main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers.
Abstract: This paper describes a software architecture that allows image processing researchers to develop parallel applications in a transparent manner. The architecture's main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers. Since the library has an application programming interface identical to that of an existing sequential library, all parallelism is completely hidden from the user.The first part of the paper discusses implementation aspects of the parallel library, and shows how sequential as well as parallel operations are implemented on the basis of so-called parallelizable patterns. A library built in this manner is easily maintainable, as extensive code redundancy is avoided. The second part of the paper describes the application of performance models to ensure efficiency of execution on all target platforms. Experiments show that for a realistic application performance predictions are highly accurate. These results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.
TL;DR: In this article, a control system using microprocessors which communicate through a Local Area Network (private LAN) to control operation of both processors and input and output subsystems (IO system) of a multiprocessor computer system is presented.
Abstract: The invention is a control system using microprocessors which communicate through a Local Area Network (private LAN) to control operation of both processors and input and output subsystems (IO system) of a multiprocessor computer system. The processors each have memory associated therewith, and each processor has an IO system comprising a plurality of busses such as PCI busses, associated therewith. The processors are cabled together in a mesh arrangement so that messages can be transferred between any of the processors and delivered to memory associated with the destination processor, or delivered to an IO system associated with the destination processor, etc. The microprocessors are powered on when power is applied to the chassis of the multiprocessor system, and the microprocessors then control the processors of the multiprocessor system, including applying power to the processors, forming hard partitions containing selected processors, computing routes from a processor to a memory associated with any processor for read and write transactions, computing routes to IO subsystems associated with any processor of the hard partition, forming partition boundaries so that processors in one hard partition cannot read and write to memory or IO systems associated with processors in another hard partition, forming soft partitions of processors, controlling boot-up of operating systems executing on the processors of the multiprocessor computer system, removing power from a failed processor, providing power to a repaired processor, etc.
TL;DR: It is shown that perhaps the major explanation for why optical technologies have nearly been unable to penetrate into computers is that OI's generally do not shorten the memory-access time, which is the most critical issue for today's stored-program machines.
TL;DR: The conventional memory organization of fast Fourier transform (FFT) processors is based on Cohen's (1976) scheme, but the new scheme reduces the hardware complexity of address generation by about 50% while improving the memory access speed.
Abstract: The conventional memory organization of fast Fourier transform (FFT) processors is based on Cohen's (1976) scheme. Compared with this scheme, our scheme reduces the hardware complexity of address generation by about 50% while improving the memory access speed. Much power consumption in memory is saved since only half of the memory is activated during memory access, and the number of coefficient access is reduced to a minimum by using a new ordering of FFT butterflies. Therefore, the new scheme is a superior solution to constructing high-performance FFT processors.
TL;DR: This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems and concludes that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.
Abstract: Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.
TL;DR: In this paper, the authors present a system and method for the high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model, which includes decoding a locked-memory instruction, obtaining exclusive ownership of a cacheline to be used by a load-lock operation, setting a bit to indicate the loadlock operation's ownership of the cacheline, and activating a snoop checking process.
Abstract: The present invention relates to locked memory instructions, and more specifically to a system and method for the high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model. In accordance with an embodiment of the present invention, a method for executing locked-memory instructions includes decoding a locked-memory instruction, obtaining exclusive ownership of a cacheline to be used by a load-lock operation, setting a bit to indicate the load-lock operation's ownership of the cacheline, and activating a snoop checking process. The method also includes modifying a load data value and storing the modified load data value. The method further includes determining that the cacheline is still exclusively owned, storing the load data value, determining that the cacheline is unsnooped, merging the modified load data value with the load data value, and releasing the locked-memory instruction to be retired.
TL;DR: A memory-aware compiler approach is described that exploits efficient memory access modes by extracting accurate timing information, allowing the compiler's scheduler to perform global code reordering to better hide the latency of memory operations.
Abstract: Memory delays represent a major bottleneck in embedded systems performance. Newer memory modules exhibiting efficient access modes (e.g., page-, burst-mode) partly alleviate this bottleneck. However, such features can not be efficiently exploited in processor-based embedded systems without memory-aware compiler support. We describe a memory-aware compiler approach that exploits such efficient memory access modes by extracting accurate timing information, allowing the compiler's scheduler to perform global code reordering to better hide the latency of memory operations. Our memory-aware compiler scheduled several benchmarks on the TI C6201 processor architecture interfaced with a 2-bank synchronous DRAM and generated average improvements of 24% over the best possible schedule using a traditional (memory-transparent) optimizing compiler, demonstrating the utility of our memory-aware compilation approach.
TL;DR: In this article, a support processor handles non-immediate problems and allows resetting of memory locations formerly owned by failed units, and continuous processing by non-failing units is allowable.
Abstract: Poisoning of specific memory locations as a process when a part of a multiprocessor computer system becomes faulty leads to ability to isolate specific data owned by individual failing units even in a shared memory area. Also continuous processing by non-failing units is allowable. A support processor handles non-immediate problems and allows resetting of memory locations formerly owned by failed units.
TL;DR: Disclosed as mentioned in this paper is a fully-interconnected, heterogeneous, multiprocessor data processing system with a plurality of processors each having unique characteristics including, for example, different processing speeds (frequency) and different cache topologies (sizes, levels, etc.).
Abstract: Disclosed is a fully-interconnected, heterogenous, multiprocessor data processing system. The data processing system topology has a plurality of processors each having unique characteristics including, for example, different processing speeds (frequency) and different cache topologies (sizes, levels, etc.). Second and third generation heterogenous processors are connected to a specialized set of pins, connected to the system bus. The processors are interconnected and communicate via an enhanced communication protocol and specialized SMP bus topology that supports the heterogeneous topology and enables newer processors to support full downward compatibility to the previous generation processors. Various processor functions are modified to support operations on either of the processors depending on which processor is assigned which operations. The enhanced communication protocol, operating system, and other processor logic enable the heterogenous multiprocessor data processing system to operate as a symmetric multiprocessor system.
TL;DR: A set of preconditioners for the Schur complement domain decomposition method that implement a global coupling mechanism, through coarse-space components, similar to the one proposed in Bramble, Pasciak, and Shatz, Math.
Abstract: The solution of elliptic problems is challenging on parallel distributed memory computers since their Green's functions are global. To address this issue, we present a set of preconditioners for the Schur complement domain decomposition method. They implement a global coupling mechanism, through coarse-space components, similar to the one proposed in [Bramble, Pasciak, and Shatz, Math. Comp., 47 (1986), pp. 103--134]. The definition of the coarse-space components is algebraic; they are defined using the mesh partitioning information and simple interpolation operators. These preconditioners are implemented on distributed memory computers without introducing any new global synchronization in the preconditioned conjugate gradient iteration. The numerical and parallel scalability of those preconditioners are illustrated on two-dimensional model examples that have anisotropy and/or discontinuity phenomena.
TL;DR: This paper describes the experiences developing high-performance code for astrophysical N-body simulations and uses a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures.
Abstract: This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance implementation. Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified. The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 48 percent of the peak machine rate, which compares favorably with other approaches.