Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 1999
  1. Home
  2. Topics
  3. Distributed memory
  4. 1999
Showing papers on "Distributed memory published in 1999"
Journal Article•10.2514/2.2413•
Constrained Multipoint Aerodynamic Shape Optimization Using an Adjoint Formulation and Parallel Computers

[...]

James Reuther1, Antony Jameson, Juan J. Alonso, Mark J. Rimlinger, David Saunders •
Ames Research Center1
01 Jan 1999-Journal of Aircraft
TL;DR: In this article, an aerodynamic shape optimization method that treats the design of complex aircraft configurations subject to high fidelity computational fluid dynamics (CFD), geometric constraints and multiple design points is described.
Abstract: An aerodynamic shape optimization method that treats the design of complex aircraft configurations subject to high fidelity computational fluid dynamics (CFD), geometric constraints and multiple design points is described. The design process will be greatly accelerated through the use of both control theory and distributed memory computer architectures. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on a higher order CFD method. In order to facilitate the integration of these high fidelity CFD approaches into future multi-disciplinary optimization (NW) applications, new methods must be developed which are capable of simultaneously addressing complex geometries, multiple objective functions, and geometric design constraints. In our earlier studies, we coupled the adjoint based design formulations with unconstrained optimization algorithms and showed that the approach was effective for the aerodynamic design of airfoils, wings, wing-bodies, and complex aircraft configurations. In many of the results presented in these earlier works, geometric constraints were satisfied either by a projection into feasible space or by posing the design space parameterization such that it automatically satisfied constraints. Furthermore, with the exception of reference 9 where the second author initially explored the use of multipoint design in conjunction with adjoint formulations, our earlier works have focused on single point design efforts. Here we demonstrate that the same methodology may be extended to treat complete configuration designs subject to multiple design points and geometric constraints. Examples are presented for both transonic and supersonic configurations ranging from wing alone designs to complex configuration designs involving wing, fuselage, nacelles and pylons.

400 citations

Patent•
Computer system and method for operating multiple operating systems in different partitions of the computer system and for allowing the different partitions to communicate with one another through shared memory

[...]

Robert C. Gulick1, Douglas E. Morrissey1, Charles R. Caldarale1, Bruce Alan Vessey1, Craig F. Russ1, Eugene W. Troxell1, Hans C. Mikkelsen1, Sharon M. Mauer1, Maureen P. Connell1, James R. Hunter1 •
Unisys1
17 Dec 1999
TL;DR: In this article, the physical address space of the processors in each partition is mapped to the respective exclusive memory windows assigned to each partition, so that the exclusive windows appear to the operating systems executing on those partitions as if they all start at the same base address.
Abstract: A computer system comprises a plurality of processing modules that can be configured into different partitions within the computer system, and a main memory. Each partition operates under the control of a separate operating system. At least one shared memory window is defined within the main memory to which multiple partitions have shared access, and each partition may also be assigned and exclusive memory window. Program code executing on different partitions enables those partitions to communicate with each other through the shared memory window. Means are also provided for mapping the physical address space of the processors in each partition to the respective exclusive memory windows assigned to each partition, so that the exclusive memory windows assigned to each partition appear to the respective operating systems executing on those partitions as if they all start at the same base address.

258 citations

Patent•
Memory controller for controlling memory accesses across networks in distributed shared memory processing systems

[...]

Howard Thomas Olnowich1•
IBM1
10 Sep 1999
TL;DR: In this article, a shared memory parallel processing system interconnected by a multi-stage network combines new system configuration techniques with special-purpose hardware to provide remote memory accesses across the network, while controlling cache coherency efficiently.
Abstract: A shared memory parallel processing system interconnected by a multi-stage network combines new system configuration techniques with special-purpose hardware to provide remote memory accesses across the network, while controlling cache coherency efficiently across the network. The system configuration techniques include a systematic method for partitioning and controlling the memory in relation to local verses remote accesses and changeable verses unchangeable data. Most of the special-purpose hardware is implemented in the memory controller and network adapter, which implements three send FIFOs and three receive FIFOs at each node to segregate and handle efficiently invalidate functions, remote stores, and remote accesses requiring cache coherency. The segregation of these three functions into different send and receive FIFOs greatly facilitates the cache coherency function over the network. In addition, the network itself is tailored to provide the best efficiency for remote accesses.

160 citations

Journal Article•10.1137/S1064827598339141•
Semicoarsening Multigrid on Distributed Memory Machines

[...]

Peter Brown1, Robert D. Falgout1, Jim E. Jones1•
Lawrence Livermore National Laboratory1
11 Dec 1999-SIAM Journal on Scientific Computing
TL;DR: This paper presents the results of a scalability study for a three-dimensional semicoarsening multigrid solver on a distributed memory computer, and examines the scalability of the solver theoretically and experimentally.
Abstract: This paper presents the results of a scalability study for a three-dimensional semicoarsening multigrid solver on a distributed memory computer. In particular, we are interested in the scalability of the solver---how the solution time varies as both problem size and number of processors are increased. For an iterative linear solver, scalability involves both algorithmic issues and implementation issues. We examine the scalability of the solver theoretically by constructing a simple parallel model and experimentally by results obtained on an IBM SP. The results are compared with those obtained for other solvers on the same computer.

138 citations

Proceedings Article•10.1145/300979.300980•
Maps: a compiler-managed memory system for raw machines

[...]

Rajeev Barua1, Walter Lee1, Saman Amarasinghe1, Anant Agarwal1•
Massachusetts Institute of Technology1
1 May 1999
TL;DR: Maps, a compiler managed memory system for Raw architectures, is implemented based on the SUIF infrastructure and it is demonstrated that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for regular applications and about 5-foldspeedup on 16 or more tiles for irregular applications.
Abstract: This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck," with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to the mostly static memory system.

135 citations

Journal Article•
Towards Self-Tuning Memory Management for Data Servers

[...]

Gerhard Weikum1, Arnd Christian König, Achim Kraiss, Markus Sinnwell•
Max Planck Society1
01 Jan 1999-IEEE Data(base) Engineering Bulletin
TL;DR: This paper gives an overview of self-tuning methods for a spectrum of memory management issues, ranging from traditional caching to exploiting distributed memory in a server cluster and speculative prefetching in a Web-based system.
Abstract: Although today’s computers provide huge amounts of main memory, the ever-increasing load of large data servers, imposed by resource-intensive decision-support queries and accesses to multimedia and other complex data, often leads to memory contention and may result in severe performance degradation. Therefore, careful tuning of memory mangement is crucial for heavy-load data servers. This paper gives an overview of self-tuning methods for a spectrum of memory management issues, ranging from traditional caching to exploiting distributed memory in a server cluster and speculative prefetching in a Web-based system. The common, fundamental elements in these methods include on-line load tracking, near-future access prediction based on stochastic models and the available on-line statistics, and dynamic and automatic adjustment of control parameters in a feedback loop. 1 The Need for Memory Tuning Although memory is relatively inexpensive and modern computer systems are amply equipped with it, memory contention on heavily loaded data servers is a common cause of performance problems. The reasons are threefold: Servers are operating with a multitude of complex software, ranging from the operating system to database systems, object request brokers, and application services. Much of this software has been written so as to quickly penetrate the market rather than optimizing memory usage and other resource consumption. The distinctive characteristic and key problem of a data server is that it operates in multi-user mode, serving many clients concurrently or in parallel. Therefore, a server needs to divide up its resources among the simultaneously active threads for executing queries, transactions, stored procedures, Web applications, etc. Often, multiple data-intensive decision-support queries compete for memory. The data volumes that need to be managed by a server seem to be growing without limits. One part of this trend is that multimedia data types such as images, speech, or video have become more popular and are being merged into conventional-data applications (e.g., images or videos for insurance claims). The other Copyright 1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

124 citations

Journal Article•10.1016/S0168-9274(98)00117-2•
Approximate sparsity patterns for the inverse of a matrix and preconditioning

[...]

Thomas Huckle1•
Technische Universität München1
01 Jun 1999-Applied Numerical Mathematics
TL;DR: This paper compares different strategies for choosing a-priori an approximate sparsity structure of A −1 and exactly determines the submatrices that are used in the SPAI algorithm to compute one new column of the sparse approximate inverse M.

123 citations

Journal Article•10.1145/357783.331677•
The processor-memory bottleneck: problems and solutions

[...]

Nihar R. Mahapatra, Balakrishna Venkatrao
01 Apr 1999-ACM Crossroads Student Magazine
TL;DR: This article examines the problem of an increasing Processor - Memory Performance Gap, which now is the primary obstacle to improved computer system performance.
Abstract: The rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM (Dynamic Random Access Memory) speed. So although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one. Hence computer designers are faced with an increasing Processor - Memory Performance Gap [1], which now is the primary obstacle to improved computer system performance. This article examines this problem as well as its various solutions.

117 citations

Journal Article•10.1109/78.747802•
An effective memory addressing scheme for FFT processors

[...]

Yutai Ma1•
Linköping University1
01 Mar 1999-IEEE Transactions on Signal Processing
TL;DR: The memory organization of FFT processors is considered and a new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations.
Abstract: The memory organization of FFT processors is considered. The new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations. The advantage of this memory addressing scheme lies in the fact that it reduces the delay of address generation nearly by half compared to existing ones.

103 citations

Journal Article•10.1137/S1064827598336951•
A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures

[...]

Françoise Tisseur, Jack Dongarra
01 May 1999-SIAM Journal on Scientific Computing
TL;DR: This algorithm yields the first scalable, portable, and numerically stable parallel divide and conquer eigensolver and is compared with that of the QR algorithm and of bisection followed by inverse iteration on an IBM SP2 and a cluster of Pentium PIIs.
Abstract: We present a new parallel implementation of a divide and conquer algorithm for computing the spectral decomposition of a symmetric tridiagonal matrix on distributed memory architectures. The implementation we develop differs from other implementations in that we use a two-dimensional block cyclic distribution of the data, we use the Lowner theorem approach to compute orthogonal eigenvectors, and we introduce permutations before the back transformation of each rank-one update in order to make good use of deflation. This algorithm yields the first scalable, portable, and numerically stable parallel divide and conquer eigensolver. Numerical results confirm the effectiveness of our algorithm. We compare performance of the algorithm with that of the QR algorithm and of bisection followed by inverse iteration on an IBM SP2 and a cluster of Pentium PIIs.

101 citations

Proceedings Article•10.1109/IPPS.1999.760492•
OpenMP for networks of SMPs

[...]

Y.C. Hu1, Honghui Lu, Alan L. Cox, Willy Zwaenepoel1•
Rice University1
12 Apr 1999
TL;DR: A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions.
Abstract: In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water; NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.
Patent•
Computer data packet switching and load balancing system using a general-purpose multiprocessor architecture

[...]

Cher-Wen Lin1, Kumar Ramaswamy1, Mizanur Mohammed Rahman1, Randall D. Rettberg1, Robert A. Doolittle1 •
Sun Microsystems1
15 Nov 1999
TL;DR: In this article, a data packet switching and server load balancing device is provided by a general-purpose multiprocessor computer system (10), which comprises a plurality of symmetrical processors coupled together by a common data bus.
Abstract: A data packet switching and server load balancing device is provided by a general-purpose multiprocessor computer system (10). The general-purpose multiprocessor computer system (10) comprises a plurality of symmetrical processors (24o ... 24n) coupled together by a common data bus (12), a main memory (14) shared by the processors (24o ... 24n), and a plurality of network interfaces (17i ... 17m) each adapted to be coupled to respective external networks for receiving and sending data packets via a particular communication protocol, such as Transmission Control/Internet Protocol (TCP/IP). A first one of the processors (24o ... 24n) is adaptive to serve as a control processor and remaining ones of the processors (24o ... 24n) are adapted to serve as data packet switching processors.
Journal Article•10.1137/S1064827597323208•
A Parallel Algorithm for Mesh Smoothing

[...]

Lori A. Freitag1, Mark T. Jones2, Paul E. Plassmann2•
Argonne National Laboratory1, Virginia Tech2
01 May 1999-SIAM Journal on Scientific Computing
TL;DR: A mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm that proves that the parallel algorithm has a provably fast runtime bound and executes correctly for a parallel random access machine (PRAM) computational model.
Abstract: Maintaining good mesh quality during the generation and refinement of unstructured meshes in finite-element applications is an important aspect in obtaining accurate discretizations and well-conditioned linear systems. In this article, we present a mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm. We prove that the parallel algorithm has a provably fast runtime bound and executes correctly for a parallel random access machine (PRAM) computational model. We extend the PRAM algorithm to distributed memory computers and report results for two- and three-dimensional simplicial meshes that demonstrate the efficiency and scalability of this approach for a number of different test cases. We also examine the effect of different architectures on the parallel algorithm and present results for the IBM SP supercomputer and an ATM-connected network of SPARC Ultras.
Journal Article•10.1109/43.739054•
Local memory exploration and optimization in embedded systems

[...]

Preeti Ranjan Panda1, Nikil Dutt2, Alexandru Nicolau2•
Synopsys1, University of California, Irvine2
01 Jan 1999-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
TL;DR: This work presents an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme, and demonstrates that its estimations closely follow the actual simulated performance at significantly reduced run times.
Abstract: Embedded processor-based systems allow for the tailoring of the on-chip memory architecture based on application specific requirements. We present an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme. The analytical technique has the important advantage of enabling a fast evaluation of candidate memory architectures in the early stages of system design. Many digital signal-processing applications involve array accesses and loop nests that can benefit from such an exploration. Our experiments demonstrate that our estimations closely follow the actual simulated performance at significantly reduced run times.
Patent•
Scaleable shared-memory multi-processor computer system having repetitive chip structure with efficient busing and coherence controls

[...]

Michael Ignatowski1, Thomas J. Heller1, Gottfried Andreas Goldiran1•
IBM1
15 Sep 1999
TL;DR: In this paper, the authors propose a shared memory architecture for a high-performance shared-memory computer system, where each node has a nodal cache, nodal directory and nodal electronic switches to manage all transfers and data coherence among all cells in the same node and in different nodes.
Abstract: A novel structure for a highly-scaleable high-performance shared-memory computer system having simplified manufacturability. The computer system contains a repetition of system cells, in which each cell is comprised of a processor chip and a memory subset (having memory chips such as DRAMs or SRAMs) connected to the processor chip by a local memory bus. A unique type of intra-nodal busing connects each system cell in each node to each other cell in the same node. The memory subsets in the different cells need not have equal sizes, and the different nodes need not have the same number of cells. Each node has a nodal cache, a nodal directory and nodal electronic switches to manage all transfers and data coherence among all cells in the same node and in different nodes. The collection of all memory subsets in the computer system comprises the system shared memory, in which data stored in any memory subset is accessible to the processors on the other processor chips in the system. Each location in the system shared memory has a unique real address, which may be used by any processor in the system. Thus, the same memory addresses may be used in the executable instructions of all processors in the system. The nodal directories automatically manage the coherence of all data being changed in all processor caches in the computer system, regardless of where the data is stored in the shared memory of the system and regardless of which cell in the system contains the processor changing the data to provide data coherence across all nodes in the computer system.
Proceedings Article•10.1145/305138.305206•
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

[...]

Xavier Martorell1, Eduard Ayguadé1, Nacho Navarro1, Julita Corbalan1, Marc Gonzalez1, Jesús Labarta1 •
Polytechnic University of Catalonia1
1 May 1999
TL;DR: This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping.
Abstract: This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping. Two work generation schemes and one join mechanism are designed, implemented, evaluated and compared with the ones used in the IRIX MP library, an efficient implementation which supports a single level of parallelism. Supporting multiple levels of parallelism is a current research goal, both in shared and distributed memory machines. Our proposals include a first work generation scheme (GWD, or global work descriptor) which supports multiple levels of parallelism, but not processor grouping. The second work generation scheme (LWD, or local work descriptor) has been designed to support multiple levels of parallelism and processor grouping. Processor grouping is needed to distribute processors among different parts of the computation and maintain the working set of each processor across different parallel constructs. The mechanisms are evaluated using synthetic benchmarks, two SPEC95fp applications and one NAS application. The performance evaluation concludes that: i) the overhead of the proposed mechanisms is similar to the overhead of the existing ones when exploiting a single level of parallelism, and ii) a remarkable improvement in performance is obtained for applications that have multiple levels of parallelism. The comparison with the traditional single-level parallelism exploitation gives an improvement in the range of 30-65% for these applications.
Journal Article•10.1109/76.752091•
Optimal memory organization for scalable texture codecs in MPEG-4

[...]

Gauthier Lafruit1, Lode Nachtergaele1, Jan Bormans1, Marc Engels1, Ivo Bolsens1 •
Katholieke Universiteit Leuven1
01 Mar 1999-IEEE Transactions on Circuits and Systems for Video Technology
TL;DR: A new calculation schedule is proposed that reduces this buffer memory size with up to two orders of magnitude, while still ensuring a number of external (off-chip) memory accesses that is very close to the theoretical minimum.
Abstract: This paper addresses the problem of minimizing memory size and memory accesses in multiresolution texture coding architectures for discrete cosine transform (DCT) and wavelet-based schemes used, for example, in virtual-world walk-throughs or facial animation scenes of an MPEG-4 system. The problem of minimizing the memory cost is important since memory accesses, memory bandwidth limitations, and in general the correct handling of the data flows have become the true critical issues in designing high-speed and low-power video-processing architectures and in efficiently using multimedia processors. For instance, the straightforward implementation of a multiresolution texture codec typically needs an extra memory buffer of the same size as the image to be encoded/decoded. We propose a new calculation schedule that reduces this buffer memory size with up to two orders of magnitude, while still ensuring a number of external (off-chip) memory accesses that is very close to the theoretical minimum. The analysis is generic and is therefore useful for both wavelet and multiresolution DCT codecs.
Journal Article•10.1109/5.747867•
Adaptive protocols for software distributed shared memory

[...]

Cristiana Amza1, Alan L. Cox1, Sandhya Dwarkadas2, Li-Jie Jin3, Karthick Rajamani1, Willy Zwaenepoel1 •
Rice University1, University of Rochester2, University of Virginia3
1 Mar 1999
TL;DR: The results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial and the results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration.
Abstract: We demonstrate the benefits of software shared memory protocols that adapt at run time to the memory access patterns observed in the applications. This adaptation is automatic-no user annotations are required-and does not rely on compiler support or special hardware. We investigate adaptation between singleand multiple-writer protocols, dynamic aggregation of pages into a larger transfer unit, and adaptation between invalidate and update. Our results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial. The results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration.
Journal Article•10.1016/S1383-7621(98)00019-8•
Compiler support for task scheduling in hierarchical execution models

[...]

Thomas Rauber1, Gudula Rünger2•
Martin Luther University of Halle-Wittenberg1, Leipzig University2
01 Jan 1999-Journal of Systems Architecture
TL;DR: For an important subclass of structured method parallelism, a scheduling methodology which takes data redistributions between multiprocessor tasks into account is presented which is designed for an integration into a parallel compiler tool.
Book Chapter•10.1007/978-3-642-59901-9_6•
Immunological Memory is Associative

[...]

Derek J. Smith1•
University of New Mexico1
1 Jan 1999
TL;DR: This paper argues that immunological memory is in the same class of associative memories as Kanerva'sparse Distributed Memory, Albus'sCerebellar Model Arithmetic Computer, and Marr's Theory of the Cerebellar Cortex.
Abstract: This paper argues that immunological memory is in the same class of associative memories as Kanerva"sSparse Distributed Memory,Albus"sCerebellar Model Arithmetic Computer,and Marr"sTheory of the Cerebellar Cortex.This class of memories Enrives its associative and robust nature from a sparse sampling of a huge input space by recognition units (B and T cells in the immune system) and a distribution of the memory among many inEnpenEnnt units (B and T cells in the memory population in the immune system).
Journal Article•10.1109/5.747870•
Shared virtual memory: progress and challenges

[...]

Liviu Iftode1, Jaswinder Pal Singh•
Rutgers University1
1 Jan 1999
TL;DR: A survey of the key developments in shared virtual memory research is provided, placing the multitrack flow of ideas and results obtained so far in a comprehensive new framework.
Abstract: Shared virtual memory, a technique for supporting a shared address space in software on parallel systems, has undergone a decade of research, with significant maturing of protocols and communication layers having now been achieved. We provide a survey of the key developments in this research, placing the multitrack flow of ideas and results obtained so far in a comprehensive new framework. Four major research tracks are covered: relaxed consistency models; protocol laziness; architectural support; and application-driven research. Several related avenues are also discussed, such as fine grained software coherence, software protocols across multiprocessor nodes, and performance scalability. We summarize comparative performance results from the literature, discuss their limitations, and identify lessons learned so far, key outstanding questions, and important directions for future research in this area.
Journal Article•10.1109/5.747864•
The MIT Alewife Machine

[...]

Anant Agarwal1, Ricardo Bianchini, David Chaiken, Frederic T. Chong, Kirk L. Johnson, David M. Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, Donald Yeung •
Massachusetts Institute of Technology1
1 Mar 1999
TL;DR: Results show that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives, and shows that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations.
Abstract: A variety of models for parallel architectures, such as shared memory, message passing, and data flow, have converged in the recent past to a hybrid architecture form called distributed shared memory (DSM) Alewife, an early prototype of such DSM architectures, uses hybrid software and hardware mechanisms to support coherent shared memory, efficient user level messaging, fine grain synchronization, and latency tolerance Alewife supports up to 512 processing nodes connected over a scalable and cost effective mesh network at a constant cost per node Four mechanisms combine to achieve Alewife's goals of scalability and programmability: software extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms-including block multithreading and prefetching-mask unavoidable delays due to communication Extensive results from microbenchmarks, together with over a dozen complete applications running on a 32-node prototype, demonstrate that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives Our results further show that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations
Patent•
Flexible probe/probe response routing for maintaining coherency

[...]

James B. Keller1, Dale E. Gulick1•
Advanced Micro Devices1
26 Aug 1999
TL;DR: In this article, a flexible probe command/response routing scheme is proposed for a distributed memory system with multiple processing nodes coupled to separate memories which may form a shared memory system. But the scheme is limited to the case of read and write transactions, where the target may determine when to commit the write data to memory and receive any dirty data to be merged with the read data.
Abstract: A computer system may include multiple processing nodes, one or more of which may be coupled to separate memories which may form a distributed memory system. The processing nodes may include caches, and the computer system may maintain coherency between the caches and the distributed memory system. Particularly, the computer system may implement a flexible probe command/response routing scheme. The scheme may employ an indication within the probe command which identifies a receiving node to receive the probe responses. For example, probe commands indicating that the target or the source of transaction should receive probe responses corresponding to the transaction may be included. Probe commands may specify the source of the transaction as the receiving node for read transactions (such that dirty data is delivered to the source node from the node storing the dirty data). On the other hand, for write transactions (in which data is being updated in memory at the target node of the transaction), the probe commands may specify the target of the transaction as the receiving node. In this manner, the target may determine when to commit the write data to memory and may receive any dirty data to be merged with the write data.
Patent•
Secure partitioning of shared memory based multiprocessor system

[...]

Hubertus Franke1, Mark E. Giampapa1, Joefon Jann1, Douglas J. Joseph1, Pratap Pattnaik1 •
IBM1
23 Feb 1999
TL;DR: In this article, the authors propose a method and apparatus for sharing memory in a multiprocessor computing system, which provides a number of system buses with each bus being connected to a respective memory controller which controls a corresponding partition of the memory.
Abstract: A method and apparatus for sharing memory in a multiprocessor computing system. More specifically, this invention provides a number of system buses with each bus being connected to a respective memory controller which controls a corresponding partition of the memory. Any one of the processors can use any one of the system buses to send real addresses to the connected memory controller which then converts the real addresses into physical addresses corresponding to the partition of memory that is controlled by the receiving memory controller. The processors can be dynamically assigned to different partitions of the memory by via a switching mechanism.
A modular simulation system for the bidomain equations

[...]

J.B. Pormann, John A. Board
1 Jan 1999
TL;DR: A modular framework for development of a Simulation System whereby a researcher may mix and match program elements to generate a simulator tailored to their particular problem, and the results presented here show that a modular Simulation System is feasible for a wide range of applications, and that it can obtain very good performance over thisrange of applications.
Abstract: Cardiac arrhythmias and fibrillation are potentially life threatening diseases that can result from the improper conduction of electrical impulses in the heart. Experimental study of such cardiac abnormalities are dangerous at best, often requiring the subject to be placed in fibrillation for some time before attempting a large “rescue” shock. Thus, most all studies are done in animals and not humans. Furthermore, there is some indication that heart size may have considerable implications for fibrillation and other conduction abnormalities. Thus animal models for defibrillation studies must be chosen with great care. As an alternative, researchers are now using computer simulation to study the factors that generate and sustain arrhythmias, hoping to obtain at least preliminary data to guide fewer, more targeted experimental studies. Computer simulations of the Bidomain Equations have become very complex as they have been applied to many problems in cardiac electrophysiology. More complex membrane dynamics, irregular grids, and 3-D data sets axe all being investigated. Software engineering principles will need to be applied to manage this continuing growth in complexity. We propose a modular framework for development of a Simulation System whereby a researcher may mix and match program elements to generate a simulator tailored to their particular problem. The modular approach will simplify the generation and maintenance of the different program elements and it will enable the end-researcher to determine the proper mix of complexity versus speed for their particular problem of interest. The contrary approach, one monolithic program which can run all simulations of all complexities, is simply unrealistic. It would impose too great a burden on maintenance and upgradability, and it would be difficult to provide good performance for a wide range of applications. The modular approach also allows for the incremental inclusion of various complexities in the bidomain model. From a simple 2-D homogeneous, isotropic regular grid, monodomain simulation, we can progress, step by step, to a bidomain simulation with a fully implicit time-integration scheme on irregular, 3-D grids with arbitrary anisotropy and inhomogeneity, with a non-trivial membrane model. Simulations with such a wealth of complexity have not been performed to date. As microprocessors have become cheaper and more powerful, parallel computing has become more widespread. Machines with hundreds of high-performance CPUs connected by fast networks are commonplace and are now capable of surpassing traditional vector-based supercomputers in terms of overall performance. The Simulation System presented here incorporates data-parallelism to allow large scale Bidomain problems to be run on these newest parallel supercomputers. The large amount of distributed memory in such machines can be harnessed to allow extremely large scale simulations to be run. The large number of CPUs provide a tremendous amount of computational power which can be used to run such simulations more quickly. Finally, the results presented here show that a modular Simulation System is feasible for a wide range of applications, and that it can obtain very good performance over this range of applications. The parallel speed-up seen was very good, regularly achieving a factor of 13 speed-up on 16 processors. The results presented here also show that we can simulate bidomain problems using an implicit time-integrator with an irregular, anisotropic and inhomogeneous, grid and a non-trivial membrane model. We are able to run such simulations on parallel computers, thereby harnessing a tremendous amount of memory and computational resources. Such simulations have not been run to date.
Proceedings Article•10.1145/300979.301003•
Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

[...]

Angelos Bilas1, Cheng Liao2, Jaswinder Pal Singh2•
University of Toronto1, Princeton University2
1 May 1999
TL;DR: The performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and the value of each of the mechanisms in different applications is shown.
Abstract: The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardware-coherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.
Patent•
Highly componentized system architecture with a demand-loading namespace and programming model

[...]

Alessandro Forin, Johannes Helander, Andrew Raffman
31 Mar 1999
TL;DR: In this paper, the authors present a software executable on a computer having a working memory with demand-loadable components initially stored outside of the working memory, each component having an entry point including a constructor for an object.
Abstract: The invention is embodied in software executable on a computer having a working memory with demand-loadable components initially stored outside of the working memory, each component having an entry point including a constructor for an object. Preferably, the demand-loadable components are initially provided in a memory within the computer or a location external of the computer. A Namespace in the working memory provides access in the working memory to the components as they become needed by applications running in the computer. The Namespace provides the access by managing demand-loading and unloading of the components in the working memory.
Patent•
Automatic teller machine with secure variable storage for internet applications

[...]

John Martin Flenley1•
IBM1
29 Dec 1999
TL;DR: In this paper, a memory controller component for an Internet application such as a bank, airline or supermarket web site is disclosed, instantiated from a web page in the application and exposes a number of methods.
Abstract: A memory controller component ( 10, 30, 40 ) for an Internet application such as a bank, airline or supermarket web site is disclosed. The component is instantiated from a web page in the application and exposes a number of methods. A connect method checks if a predetermined area of named shared memory has been allocated, and responsive to the named shared memory not being allocated, allocates named shared memory. A setVariable method allocates an area in the shared memory for a variable and stores the variable. A getVariable method locates the variable in shared memory. Corresponding get and set encrypted variable methods are also disclosed. Furthermore, an ATM running an Internet browser enabling a user to swap between web sites at an ATM is disclosed.
Patent•
Host access to shared memory with a high priority mode

[...]

Laurent Six1, Armelle Laine1, Daniel Mazzocco1, Gerald Ollivier1•
Texas Instruments1
9 Jun 1999
TL;DR: In this paper, a digital system is provided with a memory that can be shared by two or more data requestors, and two modes of access are provided: shared access memory (SAM) and host only memory (HOM) access modes.
Abstract: A digital system is provided with a memory ( 42 ) that can be shared by two or more data requestors ( 10, 20 ). Two modes of access are provided. In a shared access memory (SAM) access mode, all of the data requestors can sequentially access the memory. In a host only memory (HOM) access mode, the memory is connected directly to one of the requestors, such as a host processor ( 10 ), so that high bandwidth transfers can be performed. The HOM access mode is entered when a priority assigned to the host processor is set to be higher than a priority assigned to any other requester. Registers for holding the priority assignments can be written by at least one of the requesters.
Proceedings Article•10.5555/339492.340063•
Memory bank customization and assignment in behavioral synthesis

[...]

Preeti Ranjan Panda1•
Synopsys1
7 Nov 1999
TL;DR: This work addresses the topic of incorporating the application-specific customization of memory bank configuration into behavioral synthesis, and presents a heuristic for determining the access sequence that minimizes page misses in a bank while respecting data dependences.
Abstract: With increasing design complexity and chip area, on-chip memory has become an important component whose integration needs to be addressed during system design. Modern embedded DRAM technology allows for large amounts of on-chip memory space. However, in order to utilize the available memory intelligently, the memory has to be appropriately customized for the specific application. We address the topic of incorporating the application-specific customization of memory bank configuration into behavioral synthesis. The strategy involves a partitioning of behavioral arrays into memory banks based on a cost function that estimates the performance implications. For a given candidate partition, we present a heuristic for determining the access sequence that minimizes page misses in a bank while respecting data dependences. The output of the exploration is a graph displaying the variation of delay and memory area with the bank configuration. Our experiments on several memory-intensive examples confirm that the exploration results can provide critical feedback to the designer about the optimal memory configuration for a given application.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve