TL;DR: A series of experiments are designed to test the use of short-term memory in the course of a natural hand-eye task where subjects have the freedom to choose their own task parameters and reduce the instantaneous memory required to perform the task by serializing the task with eye movements.
Abstract: The very limited capacity of short-term or working memory is one of the most prominent features of human cognition. Most studies have stressed delimiting the upper bounds of this memory in memorization tasks rather than the performance of everyday tasks. We designed a series of experiments to test the use of short-term memory in the course of a natural hand-eye task where subjects have the freedom to choose their own task parameters. In this case subjects choose not to operate at the maximum capacity of short-term memory but instead seek to minimize its use. In particular, reducing the instantaneous memory required to perform the task can be done by serializing the task with eye movements. These eye movements allow subjects to postpone the gathering of task-relevant information until just before it is required. The reluctance to use short-term memory can be explained if such memory is expensive to use with respect to the cost of the serializing strategy.
TL;DR: In this paper, the authors consider the ways in which individual human memory systems are linked into group memory systems, such as directory updating (learning who knows what in the group), information allocation (assigning memory items to group members), and retrieval coordination (planning how to find items in a way that takes advantage of who-knows-what).
Abstract: Several of the design factors that must be considered in linking computers together into networks are also relevant to the ways in which individual human memory systems are linked into group memory systems. These factors include directory updating (learning who knows what in the group), information allocation (assigning memory items to group members), and retrieval coordination (planning how to find items in a way that takes advantage of who knows what). When these processes operate effectively in a group, the group's transactive memory is likely to be effective.
TL;DR: In this article, the authors describe a computer system having a plurality of processors and memory elements, including multiple like processor memory elements within a node and external communication paths for communication external to the node to another like scalable node of the system.
Abstract: A computer system having a plurality of processors and memory including a plurality of scalable nodes having multiple like processor memory elements. Each of the processor memory elements has a plurality of communication paths for communication within a node to other like processor memory elements within the node. Each of the processor memory elements also has a communication path for communication external to the node to another like scalable node of the computer system.
TL;DR: This work has developed the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance, and shows that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance.
Abstract: Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We have developed the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance.
TL;DR: This paper describes how to combine simple hardware support and sampling techniques to obtain empirical data on memory system behavior without appreciably perturbing system performance.
Abstract: Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance. Reflecting this trend, architects are developing increasingly complex memory hierarchies to mask the speed gap, compiler writers are adding locality enhancing transformations to better utilize complex memory hierarchies, and applications programmers are recoding their algorithms to exploit memory systems. All of these groups need empirical data on memory system behavior to guide their optimizations. This paper describes how to combine simple hardware support and sampling techniques to obtain such data without appreciably perturbing system performance. The idea is implemented in the Mprof prototype that profiles data stall cycles, first level cache misses, and second level misses on the Sun Sparc 10/41.
TL;DR: In this paper, a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor is described, with several individual processors all having communication links to several memories without restriction.
Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor. The processor is structured with several individual processors all having communication links to several memories without restriction. A crossbar switch serves to establish the processor memory links and an inter-processor communication link allows the processors to communicate with each other for the purpose of establishing operational modes. A parameter memory, accessible via the crossbar switch, is used in conjunction with the communication link for control purposes. The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip.
TL;DR: The Paradigm (PARAllelizing compiler for DIstributed-memory, General-purpose Multicomputers) project at the University of Illinois addresses the problem of massively parallel distributed-memory multicomputers by developing automatic methods for the efficient parallelization of sequential programs.
Abstract: To harness the computational power of massively parallel distributed-memory multicomputers, users must write efficient software. This process is laborious because of the absence of global address space. The programmer must manually distribute computations and data across processors and explicitly manage communication. The Paradigm (PARAllelizing compiler for DIstributed-memory, General-purpose Multicomputers) project at the University of Illinois addresses this problem by developing automatic methods for the efficient parallelization of sequential programs. A unified approach efficiently supports regular and irregular computations using data and functional parallelism. >
TL;DR: The authors explored the utility of custom computing machinery for accelerating the development, testing, and prototyping of a diverse set of image processing applications and developed a real time image processing system called VTSplash, based on the Splash-2 general-purpose platform.
Abstract: The authors explore the utility of custom computing machinery for accelerating the development, testing, and prototyping of a diverse set of image processing applications. We chose an experimental custom computing platform called Splash-2 to investigate this approach to prototyping real time image processing designs. Custom computing platforms are emerging as a class of computers that can provide near application specific computational performance. We developed a real time image processing system called VTSplash, based on the Splash-2 general-purpose platform. Splash-2 is an attached processor featuring programmable processing elements (PEs) and communication paths. The Splash-2 system uses arrays of RAM based field programmable gate arrays (FPGAs), crossbar networks, and distributed memory to accomplish the needed flexibility and performance tasks. Such platforms let designers customize specific operations for function and size, and data paths for individual applications. >
TL;DR: The results show that the optimizations enabled by relaxed models are extremely effective in hiding virtually the full latency of writes in architectures with blocking reads, with gains as high as 80%.
Abstract: The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the underlying hardware. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. However, fewer ordering guarantees can compromise programmability and portability. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. Furthermore, the lack of consensus on an acceptable model hinders software portability across different systems.
This dissertation focuses on providing a balanced solution that directly addresses the trade-off between programming ease and performance. To address programmability, we propose an alternative method for specifying memory behavior that presents a higher level abstraction to the programmer. We show that with only a few types of information supplied by the programmer, an implementation can exploit the full range of optimizations enabled by previous models. Furthermore, the same information enables automatic and efficient portability across a wide range of implementations.
To expose the optimizations enabled by a model, we have developed a formal framework for specifying the low-level ordering constraints that must be enforced by an implementation. Based on these specifications, we present a wide range of architecture and compiler implementation techniques for efficiently supporting a given model. Finally, we evaluate the performance benefits of exploiting relaxed models based on detailed simulations of realistic parallel applications. Our results show that the optimizations enabled by relaxed models are extremely effective in hiding virtually the full latency of writes in architectures with blocking reads (i.e., processor stalls on reads), with gains as high as 80%. Architectures with non-blocking reads can further exploit relaxed models to hide a substantial fraction of the read latency as well, leading to a larger overall performance benefit. Furthermore, these optimizations complement gains from other latency hiding techniques such as prefetching and multiple contexts.
We believe that the combined benefits in hardware and software will make relaxed models universal in future multiprocessors, as is already evidenced by their adoption in several commercial systems.
TL;DR: The aim of this presentation is to provide a discussion of the design and implementation of Scalable Shared-Memory Systems, as well as some of the techniques used to design and implement these systems.
Abstract: Foreword Preface Part 1 General Concepts Chapter 1 Multiprocessing and Scalability 1.1 Multiprocessor Architecture 1.1.1 Single versus Multiple Instruction Streams 1.1.2 Message-Passing versus Shared-Memory Architectures 1.2 Cache Coherence 1.2.1 Uniprocessor Caches 1.2.2 Multiprocessor Caches 1.3 Scalability 1.3.1 Scalable Interconnection Networks 1.3.2 Scalable Cache Coherence 1.3.3 Scalable I/O 1.3.4 Summary of Hardware Architecture Scalability 1.3.5 Scalability of Parallel Software 1.4 Scaling and Processor Grain Size 1.5 Chapter conclusions Chapter 2 Shared-Memory Parallel Programs 2.1 Basic Concepts 2.2 Parallel Application Set 2.2.1 MP3D 2.2.2 Water 2.2.3 PTHOR 2.2.4 LocusRoute 2.2.5 Cholesky 2.2.6 Barnes-Hut 2.3 Simulation Environment 2.3.1 Basic Program Characteristics 2.4 Parallel Application Execution Model 2.5 Parallel Execution under a PRAM Memory Model 2.6 Parallel Execution with Shared Data Uncached 2.7 Parallel Execution with Shared Data Cached 2.8 Summary of Results with Different Memory System Models 2.9 Communication Behavior of Parallel Applications 2.10 Communication-to-Computation Ratios 2.11 Invalidation Patterns 2.11.1 Classification of Data Objects 2.11.2 Average Invalidation Characteristics 2.11.3 Basic Invalidation Patterns for Each Application 2.11.4 MP3D 2.11.5 Water 2.11.6 PTHOR 2.11.7 LocusRoute 2.11.8 Cholesky 2.11.9 Barnes-Hut 2.11.10 Summary of Individual Invalidation Distributions 2.11.11 Effect of Problem Size 2.11.12 Effect of Number of Processors 2.11.13 Effect of Finite Caches and Replacement Hints 2.11.14 Effect of Cache Line Size 2.11.15 Invalidation Patterns Summary 2.12 Chapter Conclusions Chapter 3 System Performance Issues 3.1 Memory Latency 3.2 Memory Latency Reduction 3.2.1 Nonuniform Memory Access (NUMA) 3.2.2 Cache-Only Memory Architecture (COMA) 3.2.3 Direct Interconnect Networks 3.2.4 Hierarchical Access 3.2.5 Protocol Optimizations 3.2.6 Latency Reduction Summary 3.3 Latency Hiding 3.3.1 Weak Consistency Models 3.3.2 Prefetch 3.3.3 Multiple-Context Processors 3.3.4 Producer-Initiated Communications 3.3.5 Latency Hiding Summary 3.4 Memory Bandwidth 3.4.1 Hot Spots 3.4.2 Synchronization Support 3.5 Chapter Conclusions Chapter 4 System Implementation 4.1 Scalability of System Costs 4.1.1 Directory Storage overhead 4.1.2 Sparse Directories 4.1.3 Hierarchical Directories 4.1.4 Summary of Directory Storage overhead 4.2 Implementation Issues and Design Correctness 4.2.1 Unbounded Number of Requests 4.2.2 Distributed memory Operations 4.2.3 Request Starvation 4.2.4 Error Detection and Fault tolerance 4.2.5 Design Verification 4.3 Chapter Conclusions Chapter 5 Scalable Shared-Memory Systems 5.1 Directory-Based Systems 5.1.1 DASH 5.1.2 Alewife 5.1.3 S3.mp 5.1.4 IEEE Scalable Coherent Interface 5.1.5 Convex Exemplar 5.2 Hierarchical Systems 5.2.1 Encore GigaMax 5.2.2 ParaDiGM 5.2.3 Data Diffusion Machine 5.2.4 Kendall Square Research KSR-1 and KSR-2 5.3 Reflective Memory Systems 5.3.1 Plus 5.3.2 Merlin and Sesame 5.4 Non-Cache Coherent Systems 5.4.1 NYU Ultracomputer 5.4.2 IBM RP3 and BBN TC2000 5.4.3 Cray Research T3D 5.5 Vector Supercomputer Systems 5.5.1 Cray Research Y-MP C90 5.5.2 Tera Computer MTA 5.6 Virtual Shared-Memory Systems 5.6.1 Ivy and Munin/Treadmarks 5.6.2 J-Machine 5.6.3 MIT/Motorola *T and *T-NG 5.7 Chapter Conclusions Part 2 Experience with DASH Chapter 6 DASH Prototype System 6.1 System Organization 6.1.1 Cluster Organization 6.1.2 Directory Logic 6.1.3 Interconnection Network 6.2 Programmer's Model 6.3 Coherence Protocol 6.3.1 Nomenclature 6.3.2 Basic Memory Operations 6.3.3 Prefetch Operations 6.3.4 DMA/Uncached Operations 6.4 Synchronization Protocol 6.4.1 Granting Locks 6.4.2 Fetch&Op Variables 6.4.3 Fence Operations 6.5 Protocol General Exceptions 6.6 Chapter Conclusions Chapter 7 Prototype Hardware Structures 7.1 Base Cluster Hardware 7.1.1 SGI Multiprocessor Bus (MPBUS) 7.1.2 SGI CPU Board 7.1.3 SGI Memory Board 7.1.4 SGI I/O Board 7.2 Directory Controller 7.3 Reply Controller 7.4 Pseudo-CPU 7.5 Network and Network Interface 7.6 Performance Monitor 7.7 Logic Overhead of Directory-Based Coherence 7.8 Chapter Conclusions Chapter 8 Prototype Performance Analysis 8.1 Base Memory Performance 8.1.1 Overall Memory System Bandwidth 8.1.2 Other Memory Bandwidth Limits 8.1.3 Processor Issue Bandwidth and Latency 8.1.4 Interprocessor Latency 8.1.5 Summary of Memory System Bandwidth and Latency 8.2 Parallel Application Performance 8.2.1 Application Run-time Environment 8.2.2 Application Speedups 8.2.3 Detailed Case Studies 8.2.4 Application Speedup Summary 8.3 Protocol Effectiveness 8.3.1 Base Protocol Features 8.3.2 Alternative Memory Operations 8.4 Chapter Conclusions Part 3 Future Trends Chapter 9 TeraDASH 9.1 TeraDASH System Organization 9.1.1 TeraDASH Cluster Structure 9.1.2 Intracluster Operations 9.1.3 TeraDASH Mesh Network 9.1.4 Tera \DASH Directory Structure 9.2. TeraDASH Coherence Protocol 9.2.1 Required Changes for the Scalable Directory Structure 9.2.2 Enhancements for Increased protocol Robustness 9.2.3 Enhancements for Increased Performance 9.3 TeraDASH Performance 9.3.1 Access Latencies 9.3.2 Potential Application Speedup 9.4 Chapter Conclusions Chapter 10 Conclusions and Future Directions 10.1 SSMP Design Conclusions 10.2 Current Trends 10.3 Future Trends Appendix Multiprocessor Systems References Index
TL;DR: In this paper, a massively parallel data processing system is described, where each node has at least one processor, a memory for storing data, a processor bus that couples the processor to the memory, and a remote memory access controller coupled to the processor bus.
Abstract: A massively parallel data processing system is disclosed. This data processing system includes a plurality of nodes, with each node having at least one processor, a memory for storing data, a processor bus that couples the processor to the memory, and a remote memory access controller coupled to the processor bus. The remote memory access controller detects and queues processor requests for remote memory, processes and packages the processor requests into request packets, forwards the request packets to the network through a router that corresponds to that node, receives and queues request packets received from the network, recovers the memory request from the request packet, manipulates local memory in accordance with the request, generates an appropriate response packet acceptable to the network and forwards the response packet to the requesting node.
TL;DR: This paper discusses the design of linear algebra libraries for high performance computers, with particular emphasis on the development of scalable algorithms for multiple instruction multiple data (MIMD) distributed memory concurrent computers.
Abstract: This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for multiple instruction multiple data (MIMD) distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 2 and Level 3 basic linear algebra subprograms (BLAS) as computational building blocks, and the use of basic linear algebra communication subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS c...
TL;DR: A novel scalable shared memory multiprocessor architecture that features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity.
Abstract: We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity-similarly to distributed virtual shared memory (DVSM) systems, leaving simpler hardware to maintain shared memory coherence at a cache line granularity. By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all hardware designs. >
TL;DR: This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors and shows that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers.
Abstract: Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor. >
TL;DR: This work demonstrates a storage scheme for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution that does not waste any storage and shows that the local memory access sequence of any processor for a computation involving the regular section A(?:h:s) is characterized by a finite state machine of at most k states.
TL;DR: Three issues—partitioning, mutual exclusion, and data transfer—crucial to the efficient execution of irregular problems on distributed-memory machines are explored.
Abstract: Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.This paper explores three issues—partitioning, mutual exclusion, and data transfer—crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three alternative systems on the same hardware base (a Thinking Machines CM-5): the CHAOS irregular application library, Transparent Shared Memory (TSM), and eXtensible Shared Memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems were somewhat (13%) to significantly faster (991%) than TSM.
TL;DR: In this article, a data processing system having flexibility coping with parallelism of a program comprises a plurality of processor elements for executing instructions, a main memory shared by the plurality of processors, and parallel operation control facilities for enabling the plurality processors to operate in synchronism.
Abstract: A data processing system having flexibility coping with parallelism of a program comprises a plurality of processor elements for executing instructions, a main memory shared by the plurality of processor elements, and a plurality of parallel operation control facilities for enabling the plurality of processor elements to operate in synchronism. The plurality of parallel operation control facilities are provided in correspondence to the plurality of processor elements, respectively. The data processing system further comprises a multiprocessor operation control facility for enabling the plurality of processor elements to operate independently, and a flag for holding a value indicating which of the parallel operation mode or the multiprocessor mode is to be activated. The shared cache memory is implemented in a blank instruction and controlled by a cache controller so that inconsistency of the data stored in the cache memory is eliminated.
TL;DR: S3.mp (Sun's Scalable Shared memory MultiProcessor) is a research project to demonstrate a low overhead, high throughput communication system that is based on cache coherent distributed shared memory (DSM).
TL;DR: In this paper, the authors propose a shared virtual memory network for general purpose interprocessor communication implemented through a distributed shared memory network connecting a plurality of processors, computers, multiprocessors, and electronic and optical devices.
Abstract: The invention relates to general purpose interprocessor communication implemented through a distributed shared memory network connecting a plurality of processors, computers, multiprocessors, and electronic and optical devices. The invention teaches an apparatus for shared memory based data transfer between a multiplicity of asynchronously operating devices (processors, computers, multiprocessors, etc.) each using possibly distinct memory address translation architectures. The invention further teaches shared virtual memory network communication and administration based on a unique network memory address translation architecture. This architecture is compatible with and augments the address translation and cache block replacement mechanisms of existing devices. More particularly, the invention teaches an adapter card having input/output buffers, page tables and control/status registers for insertion into an operating device, or node, whereby all address translation, memory mapping and packet generation can be implemented. The invention teaches that all network activities can be completed with only write and control operations. An interconnecting switch part and bus arrangement facilitates communication among the network adapters.
TL;DR: In this paper, the authors present runtime and compile-time analysis for block structured codes on distributed memory parallel machines in an efficient and machine-independent fashion, which can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile time.
Abstract: In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library. >
TL;DR: Parallel array processor for massively parallel applications is formed with low power CMOS with DRAM processing while incorporating processing elements on a single chip as mentioned in this paper, which merges processor and memory with multiple PMEs (eight 16 bit processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins for networking.
Abstract: Parallel array processor for massively parallel applications is formed with low power CMOS with DRAM processing while incorporating processing elements on a single chip. Eight processors on a single chip have their own associated processing element, significant memory, and I/O and are interconnected with a hypercube based, but modified, topology. These nodes are then interconnected, either by a hypercube, modified hypercube, or ring, or ring within ring network topology. Conventional microprocessor MMPs consume pins and time going to memory. The new architecture merges processor and memory with multiple PMEs (eight 16 bit processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins for networking. Each chip will have eight 16 bit processors, each processor providing 5 MIPs performance. I/O has three internal ports and one external port shared by the plural processors on the chip. The scalable chip PME has internal and external connections for broadcast and asynchronous SIMD, MIMD and SIMIMD (SIMD/MIMD) with dynamic switching of modes. The chip can be used in systems which employ 32, 64 or 128,000 processors, and can be used for lower, intermediate and higher ranges. Local and global memory functions can all be provided by the chips themselves, and system can connect to and support other global memories and DASD. The chip can be used as a microprocessor accelerator, in personal computer applications, as a vision or avionics computer system, or as work-station or supercomputer.
TL;DR: This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical, and is completely distributed, less view-dependent, reasonably scalable, and flexible.
Abstract: As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.
TL;DR: A new mathematical representation for regular distributions called PITFALLS is presented and algorithms for redistribution based on this representation are discussed, showing low overheads for the redistribution algorithm as compared to naive runtime methods.
Abstract: Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. A significant contribution of this work is the ability to handle arbitrary source and target processor sets while performing redistribution; another is the ability to handle arbitrary dimensionality for the array being redistributed in a sealable manner. The results presented show low overheads for our redistribution algorithm as compared to naive runtime methods. >
TL;DR: This report compares the performance of different computer systems for basic message passing using Convex, Cray, IBM, Intel, KSR, Meiko, nCUBE, NEC, SGI and TMC multiprocessors.
Abstract: This report compares the performance of different computer systems for basic message passing. Latency and bandwidth are measured on Convex, Cray, IBM, Intel, KSR, Meiko, nCUBE, NEC, SGI, and TMC multiprocessors. Communication performance is contrasted with the computational power of each system. The comparison includes both shared and distributed memory computers as well as networked workstation clusters.