Top 385 papers published in the topic of Distributed memory in 1992

Showing papers on "Distributed memory published in 1992"

Patent•

Multi-node cluster computer system incorporating an external coherency unit at each node to insure integrity of information stored in a shared, distributed memory

[...]

23 Dec 1992

TL;DR: In this article, a computer cluster architecture including a plurality of CPUs at each of a plurality-of- nodes is described, where each CPU has the property of coherency and includes a primary cache.

...read moreread less

Abstract: A computer cluster architecture including a plurality of CPUs at each of a plurality of nodes. Each CPU has the property of coherency and includes a primary cache. A local bus at each node couples: all the local caches, a local main memory having physical space assignable as-shared space and non-shared space and a local external coherency unit (ECU). An inter-node communication bus couples all the ECUs. Each ECU includes a monitoring section for monitoring the local and inter-node busses and a coherency section for a) responding to a non-shared cache-line request appearing on the local bus by directing the request to the non-shared space of the local memory and b) responding to a shared cache-line request appearing on the local bus by examining its coherence state to further determine if inter-node action is required to service the request and, if such action is required, transmitting a unique identifier and a coherency command to all the other ECUs. Each unit of information present in the shared space of the local memory is assigned, by the local ECU, a coherency state which may be: exclusive (the local copy of the requested information is unique in the cluster); 2) modified (the local copy has been updated by a CPU in the same node); 3) invalid (a local copy either does not exist or is known to be out-of-date); or 4) shared (the local copy is one of a plurality of current copies present in a plurality of nodes).

...read moreread less

246 citations

Journal Article•10.1109/2.156381•

DDM-a cache-only memory architecture

[...]

Erik Hagersten¹, Anders Landin¹, Seif Haridi¹•Institutions (1)

Swedish Institute of Computer Science¹

01 Sep 1992-IEEE Computer

TL;DR: The Data Diffusion Machine (DDM) as mentioned in this paper is a cache-only memory architecture that relies on a hierarchical network structure, and it can be seen as an extension of the COMA.

...read moreread less

Abstract: The Data Diffusion Machine (DDM), a cache-only memory architecture (COMA) that relies on a hierarchical network structure, is described. The key ideas behind DDM are introduced by describing a small machine, which could be a COMA on its own or a subsystem of a larger COMA, and its protocol. A large machine with hundreds of processors is also described. The DDM prototype project is discussed, and simulated performance results are presented. >

...read moreread less

236 citations

Proceedings Article•10.1145/133057.133079•

MemSpy: analyzing memory system bottlenecks in programs

[...]

Margaret Martonosi¹, Anoop Gupta¹, Thomas Anderson²•Institutions (2)

Stanford University¹, University of California, Berkeley²

1 Jun 1992

TL;DR: MemSpy is described, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs and introduces the notion of data oriented, in addition to code oriented, performance tuning.

...read moreread less

Abstract: To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task.This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.

...read moreread less

195 citations

Computer simulation of interacting dynamic mechanical systems using distributed memory parallel processors

[...]

Michael C. Stanley¹, Ed Colgate•Institutions (1)

Northwestern University¹

1 Dec 1992

169 citations

Patent•

Microprocessor architecture capable of supporting multiple heterogeneous processors

[...]

Lenz Delek J¹, Yasuaki Hagiwara², Te-Li Lau², Cheng-Long Tang², Le Trong Nguyen² - Show less +1 more•Institutions (2)

Samsung¹, Epson²

7 Jul 1992

TL;DR: In this paper, the authors propose a memory control unit for controlling access by one or more devices within a processor to a memory array unit external to the processor via ports of the processor.

...read moreread less

Abstract: A memory control unit for controlling access, by one or more devices within a processor, to a memory array unit external to the processor via one or more memory ports of the processor. The memory control unit includes a switch network to transfer data between the one or more devices of the processor and the one or more memory ports of the processor. The memory control unit also includes a switch arbitration unit to arbitrate for the switch network, and a port arbitration unit to arbitrate for the one or more memory ports.

...read moreread less

143 citations

Patent•

Imaging and graphics processing system

[...]

Alexander Thomas¹, Kim Yongmin¹, Huynwook Park¹, Kil-Su Eo¹, Jing-Ming Jong¹ - Show less +1 more•Institutions (1)

University of Washington¹

12 Aug 1992

TL;DR: In this article, the authors present a unified image and graphics processing system, which includes a parallel vector processing unit, a graphics subsystem, a shared memory and a set of high-speed data buses for connecting all of the other components.

...read moreread less

Abstract: The present invention provides a unified image and graphics processing system that provides both image and graphics processing at high speeds. The system includes a parallel vector processing unit, a graphics subsystem, a shared memory and a set of high-speed data buses for connecting all of the other components. Generally, the parallel vector processing unit includes a series of vector processors. Each processor includes a vector address generator for efficient generation of memory addresses for regular address sequences. In order to synchronize and control the vector processors' accesses to shared memory, the parallel vector processing unit includes shared memory access logic. The logic is incorporated into each vector processor. The graphics subsystem includes a series of polygon processors in a pipelined configuration. Each processor is connected in the pipeline by a first-in-first-out (FIFO) buffer for passing data results. Additionally, each polygon processor is connected to a local shared memory in which programm instructions and data are stored. The graphics subsystem also includes a device addressing mechanism for identifying a destination device using a tagged address. The shared memory, the parallel vector processor and the graphics subsystem also incorporate an abbreviated addressing scheme, which reduces the amount of information required to request sequential addresses from the shared memory.

...read moreread less

114 citations

Journal Article•10.1109/71.159038•

Heterogeneous distributed shared memory

[...]

Songnian Zhou¹, Michael Stumm¹, Kai Li², David Wortman¹•Institutions (2)

Systems Research Institute¹, Princeton University²

01 Sep 1992-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this paper, the design, implementation, and performance of heterogeneous distributed shared memory (HDSM) systems are studied and a prototype HDSM system that integrates very different types of hosts has been developed, and a number of applications of this system are reported.

...read moreread less

Abstract: The design, implementation, and performance of heterogeneous distributed shared memory (HDSM) are studied. A prototype HDSM system that integrates very different types of hosts has been developed, and a number of applications of this system are reported. Experience shows that despite a number of difficulties in data conversion, HDSM is implementable with minimal loss in functional and performance transparency when compared to homogeneous DSM systems. >

...read moreread less

109 citations

Journal Article•10.1016/0167-739X(92)90040-I•

A methodology for the development and the support of massively parallel programs

[...]

Marco Danelutto¹, Marco Danelutto², Robert Di Meglio¹, Salvatore Orlando¹, Salvatore Orlando², Susanna Pelagatti¹, Marco Vanneschi¹ - Show less +3 more•Institutions (2)

University of Pisa¹, Hewlett-Packard²

01 Jul 1992-Future Generation Computer Systems

TL;DR: This work presents a methodology to easily write efficient, high performance and portable massively parallel programs, based on the definition of a new explicitly parallel programming language, namely P 3 L, and of a set of compiling tools that perform automatic adaptation of the program features to the target architecture hardware.

...read moreread less

108 citations

Proceedings Article•10.1145/143369.143372•

Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines

[...]

Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng

1 Aug 1992

TL;DR: This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance on MIMD distributed-memory machines; they are analyzed and empirically evaluated for stencil computations.

...read moreread less

Abstract: The Fortran D compiler uses data decomposition specifications to automatically translate Fortran programs for execution on MIMD distributed-memory machines. This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance; they are analyzed and empirically evaluated for stencil computations. Profitability formulas are derived for each optimization. Results show that exploiting parallelism for pipelined computations, reductions, and scans is vital. Message vectorization, collective communication, and efficient coarse-grain pipelining also significantly affect performance.

...read moreread less

107 citations

Automatic data partitioning on distributed memory multicomputers

[...]

Manish Gupta¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

1 Jan 1992

TL;DR: A novel approach, the constraint-based approach, to the problem of automatic data partitioning for numeric programs, implemented as part of a compiler called P scARADIGM, that accepts Fortran 77 programs, and specifies the partitioning scheme to be used for each array in the program.

...read moreread less

Abstract: Distributed-memory parallel computers are increasingly being used to provide high levels of performance for scientific applications Unfortunately, such machines are not very easy to program A number of research efforts seek to alleviate this problem by developing compilers that take over the task of generating communication The communication overheads and the extent of parallelism exploited in the resulting target program are determined largely by the manner in which data is partitioned across different processors of the machine Most of the compilers provide no assistance to the programmer in the crucial task of determining a good data partitioning scheme This thesis presents a novel approach, the constraint-based approach, to the problem of automatic data partitioning for numeric programs In this approach, the compiler identifies some desirable requirements on the distribution of various arrays being referenced in each statement based on performance considerations These desirable requirements are referred to as constraints For each constraint, the compiler determines a quality measure that captures its importance with respect to the performance of the program The quality measure is obtained through static performance estimation, without actually generating the target data-parallel program with explicit communication Each data distribution decision is taken by combining all the relevant constraints The compiler attempts to resolve any conflicts between constraints such that the overall execution time of the parallel program is minimized This approach has been implemented as part of a compiler called P scARADIGM, that accepts Fortran 77 programs, and specifies the partitioning scheme to be used for each array in the program We have obtained results on some programs taken from the Linpack and Eispack libraries, and the Perfect Benchmarks These results are quite promising, and demonstrate the feasibility of automatic data partitioning for a significant class of scientific application programs with regular computations

...read moreread less

103 citations

Proceedings Article•10.1109/SHPCC.1992.232670•

A look at scalable dense linear algebra libraries

[...]

Jack Dongarra¹, R.A. van de Geijn, David W. Walker²•Institutions (2)

University of Tennessee¹, University Of Tennessee System²

26 Apr 1992

TL;DR: Discusses the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers and proposes the square block scattered decomposition as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems.

...read moreread less

Abstract: Discusses the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers. The square block scattered decomposition is proposed as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems. An object-oriented interface to the library permits more portable applications to be written, and is easy to learn and use, since details of the parallel implementation are hidden from the user. Experiments on the Intel Touchstone Delta system with a prototype code that uses the square block scattered decomposition to perform LU factorization are presented and analyzed. It was found that the code was both scalable and efficient, performing at about 14 GFLOPS (double precision) for the largest problem considered. >

...read moreread less

Proceedings Article•10.1145/143365.143506•

Characterizing the caching and synchronization performance of a multiprocessor operating system

[...]

Josep Torrellas¹, Anoop Gupta², John L. Hennessy²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Stanford University²

1 Sep 1992

TL;DR: The cache performance of a commercial System V UNIX rtrttrtittg on a four-CPU multiprocessor is characterized and three major sources of OS misses are revealed: instruction fetehea, process migratiom and data accesses in block operations.

...read moreread less

Abstract: Good cache memory performance is essential to achieving high CPU utilization in shared-memory multiprocessors. While the performance of caches is determined by both application end operating system (OS ) references, most research has focused on the cache performance of applications afone. This is partiafly due to the difficulty of measuring OS activity and as a resrtl~ the cache performance of the OS is largely unknown. In this paper, we characterize the cache performance of a commercial System V UNIX rtrttrtittg on a four-CPU multiprocessor. The related issue of the performance impact of the OS synchronization activity is tdso stttdicd. For our study, we use a hardware monitor that records the cache misses in the machine without perturbing it. We study three multiprocessor workloads: a parallel Compilq a multiprogrsmmed load and a commercial database. Our results show that OS misses occur frequently enough to stall CPUS for 17-21 ‘Yoof their non-idle time. Further, if we include application misses induced by OS interference in the cache, then the SQU time reaches 25%. A detailed analysis reveals three major sources of OS misses: instruction fetehea, process migratiom and data accesses in block operations. As for synchronization behavior, we find that OS syncfrrordzation has low overhead if supported correctly end that OS locks show good locality and low contention.

...read moreread less

Patent•

Multiprocessor system with write generate method for updating cache

[...]

Arun K. Somani¹, Craig M. Wittenbrink¹, Chung-Ho Chen¹, Robert E. Johnson¹, Kenneth Cooper¹, Robert M. Haralick¹ - Show less +2 more•Institutions (1)

University of Washington¹

27 Apr 1992

TL;DR: In this paper, a write generate mode is implemented for updating cache by first allocating lines of shared memory as write before read areas and cache tags are updated directly on cache misses without reading from memory.

...read moreread less

Abstract: A plurality of program processors, shared memory, dual port memory, external cache memory and a control processor form a multiprocessor system. A shared memory bus links the program processors, shared memory, dual port memory and external cache memory. Program processor I/O occurs through a pair of serial I/O channels coupled to one port of the dual port memory. A write generate mode is implemented for updating cache by first allocating lines of shared memory as write before read areas. For such lines, cache tags are updated directly on cache misses without reading from memory. A hit is forced for such line, resulting in valid data at the updated part and invalid data at the remaining portion. Thus, part of the line is written to and the rest invalidated. The invalid portions are not read, unless preceded by a write operation. The mode reduces the number of bus cycles by making write misses more efficient.

...read moreread less

Proceedings Article•10.1145/143095.143134•

A dynamic scheduling method for irregular parallel programs

[...]

Steven Lucco

1 Jul 1992

TL;DR: A fundamental relationship between three quantities that characterize an irregular parallel computation is shown: the total available parallelism, the optimal grain size, and the statistical variance of execution times for individual tasks, which yields a dynamic scheduling algorithm that substantially reduces the overhead of executing irregular parallel operations.

...read moreread less

Abstract: This paper develops a methodology for compiling and executing irregular parallel programs. Such programs implement parallel operations whose size and work distribution depend on input data. We show a fundamental relationship between three quantities that characterize an irregular parallel computation: the total available parallelism, the optimal grain size, and the statistical variance of execution times for individual tasks. This relationship yields a dynamic scheduling algorithm that substantially reduces the overhead of executing irregular parallel operations.We incorporated this algorithm into an extended Fortran compiler. The compiler accepts as input a subset of Fortran D which includes blocked and cyclic decompositions and perfect alignment; it outputs Fortran 77 augmented with calls to library routines written in C. For irregular parallel operations, the compiled code gathers information about available parallelism and task execution time variance and uses this information to schedule the operation. On distributed memory architectures, the compiler encodes information about data access patterns for the runtime scheduling system so that it can preserve communication locality.We evaluated these compilation techniques using a set of application programs including climate modeling, circuit simulation, and x-ray tomography, that contain irregular parallel operations. The results demonstrate that, for these applications, the dynamic techniques described here achieve near-optimal efficiency on large numbers of processors. In addition, they perform significantly better, on these problems, than any previously proposed static or dynamic scheduling algorithm.

...read moreread less

Book Chapter•10.1016/B978-0-444-88712-2.50007-X•

Vienna Fortran—a Fortran language extension for distributed memory multiprocessors

[...]

Barbara Chapman¹, Piyush Mehrotra², Hans P. Zima¹•Institutions (2)

University of Vienna¹, Langley Research Center²

3 Jan 1992

TL;DR: This paper presents the basic features of Vienna Fortran along with a set of examples illustrating the use of these features and presents the advantages of a shared memory programming paradigm while explicitly controlling the placement of data.

...read moreread less

Abstract: Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.

...read moreread less

Proceedings Article•10.1145/143369.143377•

Automatic data mapping for distributed-memory parallel computers

[...]

Skef Wholey

1 Aug 1992

TL;DR: A system which automatically determines efficient ways of mapping data onto processors is described and evaluated, which is applicable and effective across a variety of architectures.

...read moreread less

Abstract: The performance of a program on a distributed-memory parallel computer depends on the algorithms employed, the structure and speed of the machine's communication network, and the ways in which data are distributed to the processors. This paper addresses the last of these concerns, the problem of data mapping.The paper describes and evaluated a system which automatically determines efficient ways of mapping data onto processors. The system is applicable and effective across a variety of architectures. Simulation results for machine with different interconnection schemes, including linear arrays, two-dimensional meshes, and the hypercubes, and measured running times for the CM-2 show that good data mapping often improves performance by at least 20% and in some cases by more than a factor of two.

...read moreread less

Journal Article•10.1016/0167-8191(92)90011-U•

Reduction to condensed form for the Eigenvalue problem on distributed memory architectures

[...]

Jack Dongarra¹, Jack Dongarra², Robert A. van de Geijn³•Institutions (3)

University of Tennessee¹, Oak Ridge National Laboratory², University of Texas at Austin³

1 Jan 1992

TL;DR: This paper describes a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively, based on LAPACK sequential codes and use a panel-wrapped mapping ofMatrices to nodes.

...read moreread less

Abstract: In this paper, we describe a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively. The methods are based on LAPACK sequential codes and use a panel-wrapped mapping of matrices to nodes. Results from experiments on the Intel Touchstone Delta are given.

...read moreread less

Proceedings Article•10.1109/SUPERC.1992.236653•

Applications and performance analysis of a compile-timeoptimization approach for list scheduling algorithms on distributed memory multiprocessors (Journal)

[...]

Yeh-Ching Chung, S. Ranka

1 Jan 1992

Proceedings Article•10.1145/139669.139674•

A performance study of memory consistency models

[...]

Richard N. Zucker¹, Jean-Loup Baer•Institutions (1)

University of Washington¹

1 Apr 1992

TL;DR: It is found that substantial benefits can be accrued by using relaxed models but the magnitudes of the benefits depend on the architecture being modeled, the benchmarks, and how the code is scheduled.

...read moreread less

Abstract: Recent advances in technology are such that the speed of processors is increasing faster than memory latency is decreasing. Therefore the relative cost of a cache miss is becoming more important. However, the full cost of a cache miss need not be paid every time in a multiprocessor. The frequency with which the processor must stall on a cache miss can be reduced by using a relaxed model of memory consistency.In this paper, we present the results of instruction-level simulation studies on the relative performance benefits of using different models of memory consistency. Our vehicle of study is a shared-memory multiprocessor with processors and associated write-back caches connected to global memory modules via an Omega network. The benefits of the relaxed models, and their increasing hardware complexity, are assessed with varying cache size, line size, and number of processors. We find that substantial benefits can be accrued by using relaxed models but the magnitudes of the benefits depend on the architecture being modeled, the benchmarks, and how the code is scheduled. We did not find any major difference in levels of improvement among the various relaxed models.

...read moreread less

Book Chapter•10.1007/BFB0035175•

Time-lapse snapshots

[...]

Cynthia Dwork¹, Maurice Herlihy, Serge Plotkin², Orli Waarts²•Institutions (2)

IBM¹, Stanford University²

1 May 1992

TL;DR: This paper introduces the notion of a weak snapshot scan, a slightly weaker primitive that has a more efficient implementation, and gives two examples of algorithms whose performance can be enhanced while retaining a simple modular structure: bounded concurrent timestamping, and bounded randomized consensus.

...read moreread less

Abstract: . A snapshot scan algorithm takes an "instantaneous" picture of a region of shared memory that may he updated by concurrent processes. Many complex shared memory algorithms can be greatly simplified by structuring them around the snapshot scan abstraction. Unforinnately, the substantial decrease in conceptual complity is quite often counterbalanced by an increase in computational complexity. In this paper, we introduce the notion of a weak snapshot scan, a slightly weaker primitive that has a more efficient implementation. We propose the following methodology for using this abstraction: first, design and verify an algorithm using the more powerful snapshot scan, and second, replace the more powerful but less efficient snapshot with the weaker but more efficient snapshot, and show that the weaker abstraction nevertheless suffices to ensure the correctness of the enclosing algorithm. We give two examples of algorithms whose performance can be enhanced while retaining a simple modular structure: bounded concurrent timestamping, and bounded randomized consensus. The resulting timestamping protocol is the fastest known bounded concurrent timestamping protocol. The resulting randomized consensus protocol matches the computational complexity of the best known protocol that uses only bouned values.

...read moreread less

Standards for message-passing in a distributed memory environment

[...]

David W. Walker

1 Aug 1992

TL;DR: The report discusses the main issues raised in the CRPC workshop, and describes proposed desirable features of a message passing standard for distributed memory environments.

...read moreread less

Abstract: This report presents a summary of the main ideas presented at the First CRPC Work-shop on Standards for Message Passing in a Distributed Memory Environment, held April 29-30, 1992, in Williamsburg, Virginia. This workshop attracted 68 attendees including representative from major hardware and software vendors, and was the first in a series of workshops sponsored by the Center for Research on Parallel Computation. The aim of this series of workshops is to develop and implement a standard for message passing on distributed memory concurrent computers, thereby making it easier to develop efficient, portable application codes for such machines. The report discusses the main issues raised in the CRPC workshop, and describes proposed desirable features of a message passing standard for distributed memory environments.

...read moreread less

Report•10.2172/10142365•

Kendall square multiprocessor: early experiences and performance

[...]

Thomas H. Dunigan¹•Institutions (1)

Oak Ridge National Laboratory¹

1 Apr 1992

TL;DR: The basic architecture of the shared-memory multiprocessor is described, and computational and I/O performance is measured for both serial and parallel programs.

...read moreread less

Abstract: Initial performance results and early experiences are reported for the Kendall Square Research multiprocessor. The basic architecture of the shared-memory multiprocessor is described, and computational and I/O performance is measured for both serial and parallel programs. Experiences in porting various applications are described.

...read moreread less

Book•10.1007/978-1-4615-3604-8•

Scalable Shared Memory Multiprocessors

[...]

Michel Dubois, Shreekant Thakkar

1 Jan 1992

TL;DR: The synchronization topic of MIMD combining trees their motivation, their structure, their parameters and the principles using fetchand-add are developed and the combining window is shown, which shows how to bound node buffer size.

...read moreread less

Abstract: Philip Bitar Aquarius Project Computer Science Division University of California Berkeley, CA 94720 bitar@berkeley.edu We develop the synchronization topic of MIMD combining trees their motivation, their structure, their parameters and we illustrate these principles using fetchand-add. We define the concept of combining window. an interval of time during which a request is held in a combining node in order to allow it to combine with subsequent incoming requests. We show that the combining window is necessary in order to realize the dual fonns of concurrency execution and storage concurrency that a combining tree is designed to achieve. Execution concurrency among the nodes of a combining tree enables the tree to achieve the speed up that it is designed to give. Without sufficient execution concurrency, the tree will not achieve the desired speed up. Storage concurrency among the nodes of a combining tree enables the tree to achieve the buffer storage that is necessary in order to implement the combining of requests. Without sufficient storage concurrency, node buffers will overflow. More specifically, the combining window shows how to bound node buffer size.

...read moreread less

Proceedings Article•10.1109/SHPCC.1992.232688•

Vienna Fortran 90

[...]

Siegfried Benkner¹, Barbara Chapman¹, Hans P. Zima¹•Institutions (1)

University of Vienna¹

26 Apr 1992

TL;DR: The paper presents the major features of Vienna Fortran 90 and gives examples of their use and the advantages of the shared memory programming paradigm with mechanisms for explicit user control of those aspects of the program which have the greatest impact on efficiency.

...read moreread less

Abstract: Vienna Fortran 90 is a language extension of Fortran 90 which enables the user to write programs for distributed memory multiprocessors using global data references only. Performance of software on such systems is profoundly influenced by the manner in which data is distributed to the processors. Hence, Vienna Fortran 90 provides the user with a wide range of facilities for the mapping of data to processors. It combines the advantages of the shared memory programming paradigm with mechanisms for explicit user control of those aspects of the program which have the greatest impact on efficiency. The paper presents the major features of Vienna Fortran 90 and gives examples of their use. >

...read moreread less

Report•10.2172/10176473•

Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors

[...]

E.F. D`Azevedo, C.H. Romine

1 Sep 1992

TL;DR: This paper presents a mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases and gives a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process.

...read moreread less

Abstract: The standard formulation of the conjugate gradient algorithm involves two inner product computations. The results of these two inner products are needed to update the search direction and the computed solution. In a distributed memory parallel environment, the computation and subsequent distribution of these two values requires two separate communication and synchronization phases. In this paper, we present a mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases. We give a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process. We also present empirical evidence of the stability of this modified algorithm.

...read moreread less

Proceedings Article•10.1145/147130.147139•

Parallel volume visualization on a hypercube architecture

[...]

Claudio Montani, Raffaele Perego, Roberto Scopigno

1 Dec 1992

TL;DR: A hybrid strategy to my tracing parallelitation is applied, using ray-dataflow within an image partition approach, which allows the flexible and efiectiue management of huge dataset on architectures with limited local memory.

...read moreread less

Abstract: A parallel solution to the visualiaation of high resolution uolume data is presented. Baaed on the ray tracing (RT) uiaualization technique, the system works on a distributed memory MIMD architecture. A hybrid strategy to my tracing parallelitation is applied, using ray-dataflow within an image partition approach. This strategy allows the flexible and efiectiue management of huge dataset on architectures with limited local memory. The dataaet is distributed over the nodes using a slice-partitioning technique. The simple data partition chosen implies a atraighforward communications pattern of the visualization processes and this improves both software design and eficiency, while providing deadlock prevention. The partitioning technique used and the network interconnection topology allow for the efjicient implementation of a statical load balancing technique through pre-rendering of a low resolution image. Details related to the practical issues involved in the parallelitation of volumetric RT are discussed, with particular reference to deadlock and termination issues.

...read moreread less

Book Chapter•10.1016/B978-0-444-88712-2.50014-7•

Distributed memory compiler methods for irregular problems—data copy reuse and runtime partitioning

[...]

Raja Das¹, Ravi Ponnusamy², Joel H. Saltz¹, Dimitri J. Mavriplis¹•Institutions (2)

Langley Research Center¹, Syracuse University²

3 Jan 1992

TL;DR: This paper outlines two methods which it is believed will play an important role in any distributed memory compiler able to handle sparse and unstructured problems and describes a viable mechanism for tracking and reusing copies of off-processor data.

...read moreread less

Abstract: Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods.

...read moreread less

Report•10.21236/ADA455473•

PDS: Direct Search Methods for Unconstrained Optimization on Either Sequential or Parallel Machines

[...]

Virginia Torczon

1 Mar 1992

TL;DR: PDS is a collection of Fortran subroutines for solving unconstrained nonlinear optimization problems using direct search methods usingDirect search methods for execution on shared memory parallel machines.

...read moreread less

Abstract: : PDS is a collection of Fortran subroutines for solving unconstrained nonlinear optimization problems using direct search methods. The software is written so that execution on sequential machines is straightforward while execution on Intel distributed memory machines, such as the iPSC/2, the iPSC/860 or the Touchstone Delta, can be accomplished simply by including a few well-defined routines containing calls to Intel-specific Fortran libraries. Those interested in using the methods on other distributed memory machines, even something as basic as a network of workstations or personal computers, need only modify these few subroutines to handle the global communication requirements. Furthermore, since the parallelism is clearly defined at the "doloop" level, it is a simple matter to insert compiler directives that allow for execution on shared memory parallel machines. Included here is an example of such directives, contained in comment statements, for execution on a Sequent Symmetry S81.

...read moreread less

Patent•

Multi-processor system having shared memory for storing the communication information used in communicating between processors

[...]

Masahiro Kitano¹, Yoshitaka Ohfusa¹, Katsuya Kohda¹, Keiichi Sasaki¹, Hiroyuki Okura¹, Katsumi Takeda¹ - Show less +2 more•Institutions (1)

Hitachi¹

15 Jun 1992

TL;DR: In this article, a method of communication between processors used with a multiprocessor system, comprises the steps of: storing information for specifying a processor connected to the shared memory for direct access thereto in a predetermined register of shared memory, feeding a communication instruction for instructing a first processor to communicate with a second processor via shared memory; checking, in response to the communication instruction, whether or not the first and second processors are connected to shared memory to enable direct access.

...read moreread less

Abstract: An information processing system comprises: plural processors; a shared memory connected to the plurality of processors for enabling communication between the processors; a unit disposed in the shared memory for storing information for specifying a processor connected thereto; and a unit for checking, when a first processor communicates with a second processor, whether or not the first and second processors are connected to the shared memory for direct access thereto by referring to the information storing means. A method of communication between processors used with a multiprocessor system, comprises the steps of: storing information for specifying a processor connected to the shared memory for direct access thereto in a predetermined register of the shared memory; feeding a communication instruction for instructing a first processor to communicate with a second processor via the shared memory; checking, in response to the communication instruction, whether or not the first and second processors are connected to the shared memory to enable direct access; storing communication information from the first processor in the shared memory, in response to confirmation that the first and second processors are connected to the shared memory; feeding a communication read interruption from the shared memory to the second processor; and reading out, in response to the communication read interruption, the communication information from the shared memory to feed the communication information to the second processor.

...read moreread less

Journal Article•10.1137/0613009•

Distributed and shared memory block algorithms for the triangular Sylvester equation with sep -1 estimators

[...]

Bo Kågström, Peter Poromaa

01 Jan 1992-SIAM Journal on Matrix Analysis and Applications

TL;DR: Coarse grain message passing and shared memory algorithms for solving the quasi-triangular Sylvester equation are discussed and estimators based on the Frobenius norm and the 1-norm, respectively are presented.

...read moreread less

Abstract: Coarse grain message passing and shared memory algorithms for solving the quasi-triangular Sylvester equation are discussed. The basic algorithm is of block type, i.e., rich in matrix-matrix operations. The focus is on computing reliable estimates of the ${\operatorname{sep}}^{ - 1} $ function (a natural condition number for the Sylvester equation and the invariant subspace problem). Estimators based on the Frobenius norm and the 1-norm, respectively, are presented. Accuracy, efficiency, and reliability results are presented. The applicability of the estimators to both the shared memory and distributed memory paradigms are discussed. Some performance results of the parallel block algorithms with condition estimators are also presented. The reliability of both estimators are very good. The Frobenius norm–based estimator is much more efficient in both sequential and parallel settings (on average between four to five times). Further, it is applicable to both the standard and generalized problems.

...read moreread less

...

Expand