TL;DR: In this article, a technique for improving performance in a multi-processor system by reducing access latency by correlating processor, node and memory allocation is presented, where a Process/Thread Scheduler is modified such that system mapping and node proximity tables may be referenced to help determine processor assignments for ready-to-run processes/threads.
Abstract: A technique for improving performance in a multi-processor system by reducing access latency by correlating processor, node and memory allocation. Specifically, a Process/Thread Scheduler is modified such that system mapping and node proximity tables may be referenced to help determine processor assignments for ready-to-run processes/threads. Processors are chosen to minimize access latency. Further, the Page Fault Handler is modified such that free memory pages are assigned to a process based partially on the proximity of the memory with respect to the processor requesting memory allocation.
TL;DR: In this article, a portable communication device may have multiple processors and a memory, and some portions of the memory may only be accessible by one of the processors, while others are accessible by all the processors.
Abstract: Briefly, in accordance with one embodiment of the invention, a portable communication device may have multiple processors and a memory. Portions of the memory may only be accessible by one of the processors.
TL;DR: An optimization algorithm is proposed that targets the reduction of extra off-chip memory accesses caused by inter-processor communication by increasing the application-wide reuse of data that resides in the scratch-pad memories of processors.
Abstract: In this paper, we present a compiler strategy to optimize data accesses in regular array-intensive applications running on embedded multiprocessor environments. Specifically, we propose an optimization algorithm that targets the reduction of extra off-chip memory accesses caused by inter-processor communication. This is achieved by increasing the application-wide reuse of data that resides in the scratch-pad memories of processors. Our experimental results obtained on four array-intensive image processing applications indicate that exploiting inter-processor data sharing can reduce the energy-delay product by as much as 33.8% (and 24.3% on average) on a four-processor embedded system. The results also show that the proposed strategy is robust in the sense that it gives consistently good results over a wide range of several architectural parameters.
TL;DR: This book gives readers a fundamental understanding of parallel processing application and system development and provides them with the level of understanding they need to evaluate and select the products.
Abstract: From the Publisher:
Rapid changes in the field of parallel processing make this book especially important for professionals who are faced daily with new productsand provides them with the level of understanding they need to evaluate and select the products. It gives readers a fundamental understanding of parallel processing application and system development. Chapter topics include parallel machines and computations, potential for parallel computations, vector algorithms and architectures, MIMD computers and multiprocessors, distributed memory processors, interconnection networks, data dependence and parallelism, implementing synchronization and data sharing, parallel processor performance, temporal behavior of parallel programs, and parallel I/O. For computational scientists, software engineers, computer architects, and computer engineers.
TL;DR: The methods used in MPI‐CHECK 2.0 are presented to detect many situations where actual and potential deadlocks occur when using blocking and non‐blocking point‐to‐point routines as well as when using collective routines.
Abstract: SUMMARY The Message-Passing Interface (MPI) is commonly used to write parallel programs for distributed memory parallel computers. MPI-CHECK is a tool developed to aid in the debugging of MPI programs that are written in free or fixed format Fortran 90 and Fortran 77. This paper presents the methods used in MPI-CHECK 2.0 to detect many situationswhereactual andpotential deadlocksoccur whenusingblocking and non-blocking point-to-point routines as well as when using collective routines. Copyright 2002 John Wiley & Sons, Ltd.
TL;DR: Results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.
Abstract: This paper describes a software architecture that allows image processing researchers to develop parallel applications in a transparent manner. The architecture's main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers. Since the library has an application programming interface identical to that of an existing sequential library, all parallelism is completely hidden from the user.The first part of the paper discusses implementation aspects of the parallel library, and shows how sequential as well as parallel operations are implemented on the basis of so-called parallelizable patterns. A library built in this manner is easily maintainable, as extensive code redundancy is avoided. The second part of the paper describes the application of performance models to ensure efficiency of execution on all target platforms. Experiments show that for a realistic application performance predictions are highly accurate. These results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.
TL;DR: In this paper, the authors describe a method and system for an interface for integrating reconfigurable processors into a general purpose computing system, which includes a command processor, a command list memory, various registers, a direct memory access engine, a translation look-aside buffer, a dedicated section of common memory, and a dedicated memory.
Abstract: The present invention describes a method and system for an interface for integrating reconfigurable processors into a general purpose computing system. In particular, the system resides in a computer system containing standard instruction processors, as well as reconfigurable processors. The interface includes a command processor, a command list memory, various registers, a direct memory access engine, a translation look-aside buffer, a dedicated section of common memory, and a dedicated memory (Figure 2, 12, 40, 42, 43, 44, 45, 46, 47, 48, 52, 54, 60, 62(1), 62(2(, 64). The interface is controlled via commands from a command list that is created during compilation of a user application, or various direct commands.
TL;DR: In this article, a system and method for verifying a memory consistency model for a shared memory multiprocessor computer systems generates random instructions to run on the processors, saves the results of the running of the instructions, and analyzes the results to detect a memory subsystem error if the results fall outside of the space of possible outcomes consistent with the memory consistency models.
Abstract: A system and method for verifying a memory consistency model for a shared memory multiprocessor computer systems generates random instructions to run on the processors, saves the results of the running of the instructions, and analyzes the results to detect a memory subsystem error if the results fall outside of the space of possible outcomes consistent with the memory consistency model. A precedence relationship of the results is determined by uniquely identifying results of a store location with each result distinct to allow association of a read result value to the instruction that created the read result value. A precedence graph with static, direct and derived edges identifies errors when a cycle is detected that indicates results that are inconsistent with memory consistency model rules.
TL;DR: An approach to parallelizing the QR algorithm that greatly improves scalability is discussed, a theoretical analysis indicates that the algorithm is ultimately not scalable, but the nonscalability does not become evident until the matrix dimension is enormous.
Abstract: One approach to solving the nonsymmetric eigenvalue problem in parallel is to parallelize the QR algorithm. Not long ago, this was widely considered to be a hopeless task. Recent efforts have led to significant advances, although the methods proposed up to now have suffered from scalability problems. This paper discusses an approach to parallelizing the QR algorithm that greatly improves scalability. A theoretical analysis indicates that the algorithm is ultimately not scalable, but the nonscalability does not become evident until the matrix dimension is enormous. Experiments on the Intel Paragon system, the IBM SP2 supercomputer, the SGI Origin 2000, and the Intel ASCI Option Red supercomputer are reported.
TL;DR: In this article, a multi-processor configuration with configurable signal processing logic is presented, where each processor is provided with a local memory which can be accessed by the local processor as well as by the other processors via the communications paths.
Abstract: A multi-processor arrangement having an interprocessor communication path between each of every possible pair of processors, in addition to I/O paths to and from the arrangement, having signal processing functions configurably embedded in series with the communication paths and/or the I/O paths. Each processor is provided with a local memory which can be accessed by the local processor as well as by the other processors via the communications paths. This allows for efficient data movement from one processor's local memory to another processor's local memory, such as commonly done during signal processing corner turning operations. The configurable signal processing logic may be configured to host one or more signal processing functions to allow data to be processed prior to its deposit into local memory.
TL;DR: It is shown that computational fluid dynamics simulations provide reliable results and yields a detailed and accurate picture of the complex flow phenomena observed in stirred-tank reactors.
TL;DR: This paper presents a new methodology for embedded memory design in the case of application specific multiprocessor system-on-chip, and gives also a generic architecture to produce this memory wrapper.
Abstract: Embedded memory plays a critical role to improve performances of systems-on-chip (SoC). In this paper, we present a new methodology for embedded memory design in the case of application specific multiprocessor system-on-chip. This approach facilitates the integration of standard memory components. The concept of memory wrapper allows automatic adaptation of physical memory interfaces to a communication network that may have a different number of access ports. We give also a generic architecture to produce this memory wrapper. This approach has successfully been applied on a low-level image processing application.
TL;DR: In this paper, the authors propose a method for handling computation and data synergistically to minimize the overall execution time on distributed memory parallel computers (DMPCs), based on a number of novel techniques, also presented in this article.
Abstract: To exploit parallelism on shared memory parallel computers (SMPCs), it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested Do-Loops). In contrast, on distributed memory parallel computers (DMPCs), the decomposition of computation and the distribution of data must both be handled---in order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to minimize the overall execution time on DMPCs. The method is based on a number of novel techniques, also presented in this article. The core idea is to rank the "importance" of data arrays in a program and specify some of the dominant. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Do-loop, allows us to design algorithms for determining data and computation decompositions at the same time. Based on data distribution, computation decomposition for each nested Do-loop is determined based on either the "owner computes" rule or the "owner stores" rule with respect to the dominant data array. If all temporal dependence relations across iteration partitions are regular, we use tiling to allow pipelining and the overlapping of computation and communication. However, in order to use tiling on DMPCs, we needed to extend the existing techniques for determining tiling vectors and tile sizes, as they were originally suited for SMPCs only. The overall method is illustrated on programs for the 2D heat equation, for the Gaussian elimination with pivoting, and for the 2D fast Fourier transform on a linear processor array and on a 2D processor grid.
TL;DR: In this article, the authors argue that such a generalization across tasks becomes possible because the models specify basic mechanisms rather than simple equations that describe the data in a specific task and illustrate this by a brief review of the developments since the 1950s, followed by a discussion of the major theoretical frameworks that have been developed over the past 25 years.
Abstract: Over the past 50 years, models for human memory have developed from simple data descriptions for specific tasks to general frameworks that can and have been generalized to most of the major paradigms that are used in memory research. We argue that such a generalization across tasks becomes possible because the models specify basic mechanisms rather than simple equations that describe the data in a specific task. We illustrate this by a brief review of the developments since the 1950s, followed by a discussion of the major theoretical frameworks that have been developed over the past 25 years. The early models such as Estes’ Stimulus Sampling Theory focused on learning, but in the 1960s the emphasis gradually shifted to memory and especially the distinction between short-term and long-term memory. In the 1980s a number of global models were developed that dealt with data from a variety of memory tasks. Although these global memory models have been quite successful there remained some problems, most notably the explanation for the lack of list-strength effects in recognition. Recent developments show that models based on a Bayesian or rational approach (ACT-R, REM) may provide a unified framework for explicit as well as implicit memory.
Keywords:
associative networks;
ACT;
distributed memory models;
global memory models;
mathematical models;
SAM;
TODAM
TL;DR: This paper shows how to modify an existing Real-Time Operating System (RTOS) to support the new proposed System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU), which presents a paradigm shift in the way designers look at on-chip dynamic memory allocation.
Abstract: The aggressive evolution of the semiconductor industry --- smaller process geometries, higher densities, and greater chip complexity --- has provided design engineers the means to create complex high-performance Systems-on-a-Chip (SoC) designs. Such SoC designs typically have more than one processor and huge memory, all on the same chip. Dealing with the global on- chip memory allocation/de-allocation in a dynamic yet deterministic way is an important issue for the upcoming billion transistor multiprocessor SoC designs. To achieve this, we propose a memory management hierarchy we call Two-Level Memory Management. To implement this memory management scheme --- which presents a paradigm shift in the way designers look at on-chip dynamic memory allocation --- we present a System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU) for allocation of the global on-chip memory, which we refer to as Level Two memory management (Level One is the operating system management of memory allocated to a particular on-chip Processing Element). In this way, processing elements (heterogeneous or non-heterogeneous hardware or software) in an SoC can request and be granted portions of the global memory in a fast and deterministic time (for an example of a four processing element SoC, the dynamic memory allocation of the global on-chip memory takes sixteen cycles per allocation/deallocation in the worst case). In this paper, we show how to modify an existing Real-Time Operating System (RTOS) to support the new proposed SoCDMMU. Our example shows a multiprocessor SoC that utilizes the SoCDMMU has 440% overall speedup of the application transition time over fully shared memory that does not utilize the SoCDMMU.
TL;DR: A general parallel algebraic multigrid algorithm for finite element discretizations based on domain decomposition ideas which is well suited for distributed memory computers is proposed and results show the high efficiency of the approach.
Abstract: Algebraic multigrid methods are well suited as preconditioners for iterative solvers. We consider linear systems of equations which are sparse and symmetric positive definite and which stem from a finite element discretization of a second order self-adjoint elliptic partial differential equation or a system of them. Since preconditioners based on algebraic multigrid are very efficient, additional speedup can only be achieved by parallelization. In this paper, we propose a general parallel algebraic multigrid algorithm for finite element discretizations based on domain decomposition ideas which is well suited for distributed memory computers. This paper pays special attention to the coarsening strategy which has to be adapted in the parallel case. Moreover, a general framework of data distribution gives rise to a construction scheme for the prolongation operators.
Results of numerical studies on parallel computers with distributed memory are presented which show the high efficiency of the approach.
TL;DR: A new technique to allow the static application of global data transformations, such as partitioning, to reshaped arrays is presented, eliminating the need for expensive temporary copies and hence eliminating any communication and synchronization.
TL;DR: In this paper, a compiler for multiple processor and distributed memory architectures is described, which uses a high-level language to represent a task-level network of behaviors that describes an embedded system.
Abstract: A compiler for multiple processor and distributed memory architectures is described. The compiler uses a high-level language to represent a task-level network of behaviors that describes an embedded system. The compiler maps a plurality of tasks and data onto a multiple processor, distributed memory hardware architecture. The mapping includes describing a task-level network of behaviors, each of the task-level network of behaviors being related through control and data flow. The mapping further includes predicting a schedule of tasks for the task-level network of behaviors and allocating the plurality of tasks and data to at least one of the multiple processors and to at least one of distributed memory, respectively, in response to the predicted schedule of tasks.
TL;DR: This paper introduces a library support for multi-process non-blocking synchronization called NOBLE, which provides an inter-process communication interface that allows the user to select synchronisation methods transparently to the one that suits best for the current application.
Abstract: Many applications on shared memory multi-processor machines can benefit from the exploitation of parallelism that non-blocking synchronization offers. In this paper, we introduce a library support for multi-process non-blocking synchronization called NOBLE. Noble provides an inter-process communication interface that allows the user to select synchronisation methods transparently to the one that suits best for the current application. The selection can take place even at run-time. The library provides a
collection of the most commonly used data types and protocols in a form, which allows them to be used by non-experts. We describe the functionality and the implementation of the library functions and illustrate the library programming style with example programs. The examples show that using the library can considerably reduce the runtime on distributed memory machines.
TL;DR: In this paper, a distributed shared memory system may involve implementing several different shared memory regions in each distributed node, and each node may reflect write access requests targeting those regions to one or more other nodes, depending on which shared region is targeted (e.g., requests targeting one region may be reflected to a single other node while requests targeting other regions may be reflect to more than one other node).
Abstract: Various embodiments of systems and methods for implementing shared memory regions in a distributed shared memory system may involve implementing several different shared memory regions in each distributed shared memory node. Each node may reflect write access requests targeting those shared memory regions to one or more other nodes, depending on which shared region is targeted (e.g., requests targeting one region may be reflected to a single other node while requests targeting other regions may be reflected to more than one other node). A node's completion of the requested write access locally may be dependent on the completion of the write access in the other nodes, depending on which shared memory region is targeted.
TL;DR: The implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams.
TL;DR: In this paper, a method for mapping a memory addressing of a multiprocessing system when it is emulated using a virtual memory addressing using a VM addressing of another multi-core system is presented.
Abstract: A method (and structure) of mapping a memory addressing of a multiprocessing system when it is emulated using a virtual memory addressing of another multiprocessing system includes accessing a local lookaside table (LLT) on a target processor with a target virtual memory address. Whether there is a “miss” in the LLT is determined and, with the miss determined in the LLT, a lock for a global page table is obtained.
TL;DR: In this article, a method and system for identifying a source of corrupt data in a memory in a multiprocessor computer system is presented, where the corrupt data and its address are identified.
Abstract: A method and system for identifying a source of a corrupt data in a memory in a multiprocessor computer system. When a computer program stores corrupt data causing a program failure or a system crash, the corrupt data and its address are identified. The multiprocessor computer system is shut down, and the corrupt data is cleared from the memory. Before fully re-booting the multiprocessor computer system, a processor is selected from the multiprocessor computer system to load and run monitor code designed to monitor the location where the corrupt data was stored. The program that previously stored the corrupt data is restarted, and the selected processor detects any re-storage of the corrupt data in the same memory address. All processors in the computer system are then immediately suspended. The registers of all processors suspected of storing corrupt data are inspected to determine the source of the corrupt data.
TL;DR: Results show that combining prefetching at the L1 and DRAM memory levels provides the most effectivePrefetching with minimal extra bandwidth, enabling more efficient memory performance for media and stream processing.
Abstract: This paper presents a multi-level memory prefetch hierarchy for media and stream processing applications Two major bottlenecks in the performance of multimedia and network applications are long memory latencies and limited off-chip processor bandwidth Aggressive prefetching can be used to mitigate the memory latency problem, but overly aggressive prefetching may overload the limited external processor bandwidth To accommodate both problems, we propose multilevel memory prefetching The multi-level organization enables conservative prefetching on-chip and more aggressive prefetching off-chip The combination provides aggressive prefetching while minimally impacting off-chip bandwidth, enabling more efficient memory performance for media and stream processing This paper presents preliminary results for multi-level memory prefetching, which show that combining prefetching at the L1 and DRAM memory levels provides the most effective prefetching with minimal extra bandwidth
TL;DR: The technical choices and the design of a multi-agents software architecture to manage a corporate memory in the form of a corporate semantic web are described and the approach to tackle a distributed memory and distributed queries is presented.
Abstract: We describe the technical choices and the design of a multi-agents software architecture to manage a corporate memory in the form of a corporate semantic web. We then present our approach to tackle a distributed memory and distributed queries.
TL;DR: In this paper, the authors propose a shared memory access arbitration logic (SMAAL) for embedded symmetric multiprocessor (ESMP) architectures, which can arbitrate among central processing units for access.
Abstract: A single chip, embedded symmetric multiprocessor (ESMP) having parallel multiprocessing architecture composed of identical processors includes a single program memory. Program access arbitration logic supplies an instruction to a single requesting central processing unit at a time. Shared memory access arbitration logic can supply data from separate simultaneously accessible memory banks or arbitrate among central processing units for access. The system may simulate an atomic read/modify/write instruction by prohibiting access to the one address by another central processing unit for a predetermined number of memory cycles following a read access to one of a predetermined set of addresses in said shared memory.
TL;DR: A parallel algorithm for Gaussian elimination is presented: in both a shared memory environment using OpenMP, and in a distributedMemory environment using MPI.
Abstract: In this paper, we present a parallel algorithm for Gaussian elimination: in both a shared memory environment using OpenMP, and in a distributed memory environment using MPI. Parallel LU and Gaussian algorithms for linear systems are studied extensively, and the the results of examining various load balancing schemes on both platforms are presented. The results show an improvement in many cases over the default implementation.
TL;DR: This work designed and implemented a bisimulation reduction algorithm for distributed memory settings using message passing communication, and shows that the algorithm scales up with the number of workers.
TL;DR: In this paper, a hardware-based memory management technology manages memory access requests to a common memory shared by multiple requesting entities, including prioritizing and arbitrating such requests, and minimizing latency of such requests.
Abstract: The memory management technology controls, as described herein, access to and monitors availability of common memory resources. In particular, this hardware-based, memory-management technology manages memory access requests to a common memory shared by multiple requesting entities. This includes prioritizing and arbitrating such requests. It further includes minimizing latency of such requests. This abstract itself is not intended to limit the scope of this patent. The scope of the present invention is pointed out in the appending claims.