Top 354 papers published in the topic of Distributed memory in 1999

Showing papers on "Distributed memory published in 1999"

Journal Article•10.2514/2.2413•

Constrained Multipoint Aerodynamic Shape Optimization Using an Adjoint Formulation and Parallel Computers

[...]

James Reuther¹, Antony Jameson, Juan J. Alonso, Mark J. Rimlinger, David Saunders - Show less +1 more•Institutions (1)

01 Jan 1999-Journal of Aircraft

TL;DR: In this article, an aerodynamic shape optimization method that treats the design of complex aircraft configurations subject to high fidelity computational fluid dynamics (CFD), geometric constraints and multiple design points is described.

...read moreread less

Abstract: An aerodynamic shape optimization method that treats the design of complex aircraft configurations subject to high fidelity computational fluid dynamics (CFD), geometric constraints and multiple design points is described. The design process will be greatly accelerated through the use of both control theory and distributed memory computer architectures. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on a higher order CFD method. In order to facilitate the integration of these high fidelity CFD approaches into future multi-disciplinary optimization (NW) applications, new methods must be developed which are capable of simultaneously addressing complex geometries, multiple objective functions, and geometric design constraints. In our earlier studies, we coupled the adjoint based design formulations with unconstrained optimization algorithms and showed that the approach was effective for the aerodynamic design of airfoils, wings, wing-bodies, and complex aircraft configurations. In many of the results presented in these earlier works, geometric constraints were satisfied either by a projection into feasible space or by posing the design space parameterization such that it automatically satisfied constraints. Furthermore, with the exception of reference 9 where the second author initially explored the use of multipoint design in conjunction with adjoint formulations, our earlier works have focused on single point design efforts. Here we demonstrate that the same methodology may be extended to treat complete configuration designs subject to multiple design points and geometric constraints. Examples are presented for both transonic and supersonic configurations ranging from wing alone designs to complex configuration designs involving wing, fuselage, nacelles and pylons.

...read moreread less

400 citations

Patent•

Computer system and method for operating multiple operating systems in different partitions of the computer system and for allowing the different partitions to communicate with one another through shared memory

[...]

Robert C. Gulick¹, Douglas E. Morrissey¹, Charles R. Caldarale¹, Bruce Alan Vessey¹, Craig F. Russ¹, Eugene W. Troxell¹, Hans C. Mikkelsen¹, Sharon M. Mauer¹, Maureen P. Connell¹, James R. Hunter¹ - Show less +6 more•Institutions (1)

Unisys¹

17 Dec 1999

TL;DR: In this article, the physical address space of the processors in each partition is mapped to the respective exclusive memory windows assigned to each partition, so that the exclusive windows appear to the operating systems executing on those partitions as if they all start at the same base address.

...read moreread less

Abstract: A computer system comprises a plurality of processing modules that can be configured into different partitions within the computer system, and a main memory. Each partition operates under the control of a separate operating system. At least one shared memory window is defined within the main memory to which multiple partitions have shared access, and each partition may also be assigned and exclusive memory window. Program code executing on different partitions enables those partitions to communicate with each other through the shared memory window. Means are also provided for mapping the physical address space of the processors in each partition to the respective exclusive memory windows assigned to each partition, so that the exclusive memory windows assigned to each partition appear to the respective operating systems executing on those partitions as if they all start at the same base address.

...read moreread less

258 citations

Patent•

Memory controller for controlling memory accesses across networks in distributed shared memory processing systems

[...]

Howard Thomas Olnowich¹•Institutions (1)

IBM¹

10 Sep 1999

TL;DR: In this article, a shared memory parallel processing system interconnected by a multi-stage network combines new system configuration techniques with special-purpose hardware to provide remote memory accesses across the network, while controlling cache coherency efficiently.

...read moreread less

Abstract: A shared memory parallel processing system interconnected by a multi-stage network combines new system configuration techniques with special-purpose hardware to provide remote memory accesses across the network, while controlling cache coherency efficiently across the network. The system configuration techniques include a systematic method for partitioning and controlling the memory in relation to local verses remote accesses and changeable verses unchangeable data. Most of the special-purpose hardware is implemented in the memory controller and network adapter, which implements three send FIFOs and three receive FIFOs at each node to segregate and handle efficiently invalidate functions, remote stores, and remote accesses requiring cache coherency. The segregation of these three functions into different send and receive FIFOs greatly facilitates the cache coherency function over the network. In addition, the network itself is tailored to provide the best efficiency for remote accesses.

...read moreread less

160 citations

Journal Article•10.1137/S1064827598339141•

Semicoarsening Multigrid on Distributed Memory Machines

[...]

Peter Brown¹, Robert D. Falgout¹, Jim E. Jones¹•Institutions (1)

Lawrence Livermore National Laboratory¹

11 Dec 1999-SIAM Journal on Scientific Computing

TL;DR: This paper presents the results of a scalability study for a three-dimensional semicoarsening multigrid solver on a distributed memory computer, and examines the scalability of the solver theoretically and experimentally.

...read moreread less

Abstract: This paper presents the results of a scalability study for a three-dimensional semicoarsening multigrid solver on a distributed memory computer. In particular, we are interested in the scalability of the solver---how the solution time varies as both problem size and number of processors are increased. For an iterative linear solver, scalability involves both algorithmic issues and implementation issues. We examine the scalability of the solver theoretically by constructing a simple parallel model and experimentally by results obtained on an IBM SP. The results are compared with those obtained for other solvers on the same computer.

...read moreread less

138 citations

Proceedings Article•10.1145/300979.300980•

Maps: a compiler-managed memory system for raw machines

[...]

Rajeev Barua¹, Walter Lee¹, Saman Amarasinghe¹, Anant Agarwal¹•Institutions (1)

Massachusetts Institute of Technology¹

1 May 1999

TL;DR: Maps, a compiler managed memory system for Raw architectures, is implemented based on the SUIF infrastructure and it is demonstrated that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for regular applications and about 5-foldspeedup on 16 or more tiles for irregular applications.

...read moreread less

Abstract: This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck," with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to the mostly static memory system.

...read moreread less

135 citations

Journal Article•

Towards Self-Tuning Memory Management for Data Servers

[...]

Gerhard Weikum¹, Arnd Christian König, Achim Kraiss, Markus Sinnwell•Institutions (1)

Max Planck Society¹

01 Jan 1999-IEEE Data(base) Engineering Bulletin

TL;DR: This paper gives an overview of self-tuning methods for a spectrum of memory management issues, ranging from traditional caching to exploiting distributed memory in a server cluster and speculative prefetching in a Web-based system.

...read moreread less

Abstract: Although today’s computers provide huge amounts of main memory, the ever-increasing load of large data servers, imposed by resource-intensive decision-support queries and accesses to multimedia and other complex data, often leads to memory contention and may result in severe performance degradation. Therefore, careful tuning of memory mangement is crucial for heavy-load data servers. This paper gives an overview of self-tuning methods for a spectrum of memory management issues, ranging from traditional caching to exploiting distributed memory in a server cluster and speculative prefetching in a Web-based system. The common, fundamental elements in these methods include on-line load tracking, near-future access prediction based on stochastic models and the available on-line statistics, and dynamic and automatic adjustment of control parameters in a feedback loop. 1 The Need for Memory Tuning Although memory is relatively inexpensive and modern computer systems are amply equipped with it, memory contention on heavily loaded data servers is a common cause of performance problems. The reasons are threefold: Servers are operating with a multitude of complex software, ranging from the operating system to database systems, object request brokers, and application services. Much of this software has been written so as to quickly penetrate the market rather than optimizing memory usage and other resource consumption. The distinctive characteristic and key problem of a data server is that it operates in multi-user mode, serving many clients concurrently or in parallel. Therefore, a server needs to divide up its resources among the simultaneously active threads for executing queries, transactions, stored procedures, Web applications, etc. Often, multiple data-intensive decision-support queries compete for memory. The data volumes that need to be managed by a server seem to be growing without limits. One part of this trend is that multimedia data types such as images, speech, or video have become more popular and are being merged into conventional-data applications (e.g., images or videos for insurance claims). The other Copyright 1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

...read moreread less

124 citations

Journal Article•10.1016/S0168-9274(98)00117-2•

Approximate sparsity patterns for the inverse of a matrix and preconditioning

[...]

Thomas Huckle¹•Institutions (1)

Technische Universität München¹

01 Jun 1999-Applied Numerical Mathematics

TL;DR: This paper compares different strategies for choosing a-priori an approximate sparsity structure of A −1 and exactly determines the submatrices that are used in the SPAI algorithm to compute one new column of the sparse approximate inverse M.

...read moreread less

123 citations

Journal Article•10.1145/357783.331677•

The processor-memory bottleneck: problems and solutions

[...]

Nihar R. Mahapatra, Balakrishna Venkatrao

01 Apr 1999-ACM Crossroads Student Magazine

TL;DR: This article examines the problem of an increasing Processor - Memory Performance Gap, which now is the primary obstacle to improved computer system performance.

...read moreread less

Abstract: The rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM (Dynamic Random Access Memory) speed. So although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one. Hence computer designers are faced with an increasing Processor - Memory Performance Gap [1], which now is the primary obstacle to improved computer system performance. This article examines this problem as well as its various solutions.

...read moreread less

117 citations

Journal Article•10.1109/78.747802•

An effective memory addressing scheme for FFT processors

[...]

Yutai Ma¹•Institutions (1)

Linköping University¹

01 Mar 1999-IEEE Transactions on Signal Processing

TL;DR: The memory organization of FFT processors is considered and a new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations.

...read moreread less

Abstract: The memory organization of FFT processors is considered. The new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations. The advantage of this memory addressing scheme lies in the fact that it reduces the delay of address generation nearly by half compared to existing ones.

...read moreread less

103 citations

Journal Article•10.1137/S1064827598336951•

A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures

[...]

Françoise Tisseur, Jack Dongarra

01 May 1999-SIAM Journal on Scientific Computing

TL;DR: This algorithm yields the first scalable, portable, and numerically stable parallel divide and conquer eigensolver and is compared with that of the QR algorithm and of bisection followed by inverse iteration on an IBM SP2 and a cluster of Pentium PIIs.

...read moreread less

Abstract: We present a new parallel implementation of a divide and conquer algorithm for computing the spectral decomposition of a symmetric tridiagonal matrix on distributed memory architectures. The implementation we develop differs from other implementations in that we use a two-dimensional block cyclic distribution of the data, we use the Lowner theorem approach to compute orthogonal eigenvectors, and we introduce permutations before the back transformation of each rank-one update in order to make good use of deflation. This algorithm yields the first scalable, portable, and numerically stable parallel divide and conquer eigensolver. Numerical results confirm the effectiveness of our algorithm. We compare performance of the algorithm with that of the QR algorithm and of bisection followed by inverse iteration on an IBM SP2 and a cluster of Pentium PIIs.

...read moreread less

101 citations

Proceedings Article•10.1109/IPPS.1999.760492•

OpenMP for networks of SMPs

[...]

Y.C. Hu¹, Honghui Lu, Alan L. Cox, Willy Zwaenepoel¹•Institutions (1)

Rice University¹

12 Apr 1999

TL;DR: A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions.

...read moreread less

Abstract: In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water; NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.

...read moreread less

Patent•

Computer data packet switching and load balancing system using a general-purpose multiprocessor architecture

[...]

Cher-Wen Lin¹, Kumar Ramaswamy¹, Mizanur Mohammed Rahman¹, Randall D. Rettberg¹, Robert A. Doolittle¹ - Show less +1 more•Institutions (1)

Sun Microsystems¹

15 Nov 1999

TL;DR: In this article, a data packet switching and server load balancing device is provided by a general-purpose multiprocessor computer system (10), which comprises a plurality of symmetrical processors coupled together by a common data bus.

...read moreread less

Abstract: A data packet switching and server load balancing device is provided by a general-purpose multiprocessor computer system (10). The general-purpose multiprocessor computer system (10) comprises a plurality of symmetrical processors (24o ... 24n) coupled together by a common data bus (12), a main memory (14) shared by the processors (24o ... 24n), and a plurality of network interfaces (17i ... 17m) each adapted to be coupled to respective external networks for receiving and sending data packets via a particular communication protocol, such as Transmission Control/Internet Protocol (TCP/IP). A first one of the processors (24o ... 24n) is adaptive to serve as a control processor and remaining ones of the processors (24o ... 24n) are adapted to serve as data packet switching processors.

...read moreread less

Journal Article•10.1137/S1064827597323208•

A Parallel Algorithm for Mesh Smoothing

[...]

Lori A. Freitag¹, Mark T. Jones², Paul E. Plassmann²•Institutions (2)

Argonne National Laboratory¹, Virginia Tech²

01 May 1999-SIAM Journal on Scientific Computing

TL;DR: A mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm that proves that the parallel algorithm has a provably fast runtime bound and executes correctly for a parallel random access machine (PRAM) computational model.

...read moreread less

Abstract: Maintaining good mesh quality during the generation and refinement of unstructured meshes in finite-element applications is an important aspect in obtaining accurate discretizations and well-conditioned linear systems. In this article, we present a mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm. We prove that the parallel algorithm has a provably fast runtime bound and executes correctly for a parallel random access machine (PRAM) computational model. We extend the PRAM algorithm to distributed memory computers and report results for two- and three-dimensional simplicial meshes that demonstrate the efficiency and scalability of this approach for a number of different test cases. We also examine the effect of different architectures on the parallel algorithm and present results for the IBM SP supercomputer and an ATM-connected network of SPARC Ultras.

...read moreread less

Journal Article•10.1109/43.739054•

Local memory exploration and optimization in embedded systems

[...]

Preeti Ranjan Panda¹, Nikil Dutt², Alexandru Nicolau²•Institutions (2)

Synopsys¹, University of California, Irvine²

01 Jan 1999-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: This work presents an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme, and demonstrates that its estimations closely follow the actual simulated performance at significantly reduced run times.

...read moreread less

Abstract: Embedded processor-based systems allow for the tailoring of the on-chip memory architecture based on application specific requirements. We present an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme. The analytical technique has the important advantage of enabling a fast evaluation of candidate memory architectures in the early stages of system design. Many digital signal-processing applications involve array accesses and loop nests that can benefit from such an exploration. Our experiments demonstrate that our estimations closely follow the actual simulated performance at significantly reduced run times.

...read moreread less

Patent•

Scaleable shared-memory multi-processor computer system having repetitive chip structure with efficient busing and coherence controls

[...]

Michael Ignatowski¹, Thomas J. Heller¹, Gottfried Andreas Goldiran¹•Institutions (1)

IBM¹

15 Sep 1999

TL;DR: In this paper, the authors propose a shared memory architecture for a high-performance shared-memory computer system, where each node has a nodal cache, nodal directory and nodal electronic switches to manage all transfers and data coherence among all cells in the same node and in different nodes.

...read moreread less

Abstract: A novel structure for a highly-scaleable high-performance shared-memory computer system having simplified manufacturability. The computer system contains a repetition of system cells, in which each cell is comprised of a processor chip and a memory subset (having memory chips such as DRAMs or SRAMs) connected to the processor chip by a local memory bus. A unique type of intra-nodal busing connects each system cell in each node to each other cell in the same node. The memory subsets in the different cells need not have equal sizes, and the different nodes need not have the same number of cells. Each node has a nodal cache, a nodal directory and nodal electronic switches to manage all transfers and data coherence among all cells in the same node and in different nodes. The collection of all memory subsets in the computer system comprises the system shared memory, in which data stored in any memory subset is accessible to the processors on the other processor chips in the system. Each location in the system shared memory has a unique real address, which may be used by any processor in the system. Thus, the same memory addresses may be used in the executable instructions of all processors in the system. The nodal directories automatically manage the coherence of all data being changed in all processor caches in the computer system, regardless of where the data is stored in the shared memory of the system and regardless of which cell in the system contains the processor changing the data to provide data coherence across all nodes in the computer system.

...read moreread less

Proceedings Article•10.1145/305138.305206•

Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

[...]

Xavier Martorell¹, Eduard Ayguadé¹, Nacho Navarro¹, Julita Corbalan¹, Marc Gonzalez¹, Jesús Labarta¹ - Show less +2 more•Institutions (1)

Polytechnic University of Catalonia¹

1 May 1999

TL;DR: This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping.

...read moreread less

Abstract: This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping. Two work generation schemes and one join mechanism are designed, implemented, evaluated and compared with the ones used in the IRIX MP library, an efficient implementation which supports a single level of parallelism. Supporting multiple levels of parallelism is a current research goal, both in shared and distributed memory machines. Our proposals include a first work generation scheme (GWD, or global work descriptor) which supports multiple levels of parallelism, but not processor grouping. The second work generation scheme (LWD, or local work descriptor) has been designed to support multiple levels of parallelism and processor grouping. Processor grouping is needed to distribute processors among different parts of the computation and maintain the working set of each processor across different parallel constructs. The mechanisms are evaluated using synthetic benchmarks, two SPEC95fp applications and one NAS application. The performance evaluation concludes that: i) the overhead of the proposed mechanisms is similar to the overhead of the existing ones when exploiting a single level of parallelism, and ii) a remarkable improvement in performance is obtained for applications that have multiple levels of parallelism. The comparison with the traditional single-level parallelism exploitation gives an improvement in the range of 30-65% for these applications.

...read moreread less

Journal Article•10.1109/76.752091•

Optimal memory organization for scalable texture codecs in MPEG-4

[...]

Gauthier Lafruit¹, Lode Nachtergaele¹, Jan Bormans¹, Marc Engels¹, Ivo Bolsens¹ - Show less +1 more•Institutions (1)

Katholieke Universiteit Leuven¹

01 Mar 1999-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A new calculation schedule is proposed that reduces this buffer memory size with up to two orders of magnitude, while still ensuring a number of external (off-chip) memory accesses that is very close to the theoretical minimum.

...read moreread less

Abstract: This paper addresses the problem of minimizing memory size and memory accesses in multiresolution texture coding architectures for discrete cosine transform (DCT) and wavelet-based schemes used, for example, in virtual-world walk-throughs or facial animation scenes of an MPEG-4 system. The problem of minimizing the memory cost is important since memory accesses, memory bandwidth limitations, and in general the correct handling of the data flows have become the true critical issues in designing high-speed and low-power video-processing architectures and in efficiently using multimedia processors. For instance, the straightforward implementation of a multiresolution texture codec typically needs an extra memory buffer of the same size as the image to be encoded/decoded. We propose a new calculation schedule that reduces this buffer memory size with up to two orders of magnitude, while still ensuring a number of external (off-chip) memory accesses that is very close to the theoretical minimum. The analysis is generic and is therefore useful for both wavelet and multiresolution DCT codecs.

...read moreread less

Journal Article•10.1109/5.747867•

Adaptive protocols for software distributed shared memory

[...]

Cristiana Amza¹, Alan L. Cox¹, Sandhya Dwarkadas², Li-Jie Jin³, Karthick Rajamani¹, Willy Zwaenepoel¹ - Show less +2 more•Institutions (3)

Rice University¹, University of Rochester², University of Virginia³

1 Mar 1999

TL;DR: The results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial and the results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration.

...read moreread less

Abstract: We demonstrate the benefits of software shared memory protocols that adapt at run time to the memory access patterns observed in the applications. This adaptation is automatic-no user annotations are required-and does not rely on compiler support or special hardware. We investigate adaptation between singleand multiple-writer protocols, dynamic aggregation of pages into a larger transfer unit, and adaptation between invalidate and update. Our results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial. The results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration.

...read moreread less

Journal Article•10.1016/S1383-7621(98)00019-8•

Compiler support for task scheduling in hierarchical execution models

[...]

Thomas Rauber¹, Gudula Rünger²•Institutions (2)

Martin Luther University of Halle-Wittenberg¹, Leipzig University²

01 Jan 1999-Journal of Systems Architecture

TL;DR: For an important subclass of structured method parallelism, a scheduling methodology which takes data redistributions between multiprocessor tasks into account is presented which is designed for an integration into a parallel compiler tool.

...read moreread less

Book Chapter•10.1007/978-3-642-59901-9_6•

Immunological Memory is Associative

[...]

Derek J. Smith¹•Institutions (1)

University of New Mexico¹

1 Jan 1999

TL;DR: This paper argues that immunological memory is in the same class of associative memories as Kanerva'sparse Distributed Memory, Albus'sCerebellar Model Arithmetic Computer, and Marr's Theory of the Cerebellar Cortex.

...read moreread less

Abstract: This paper argues that immunological memory is in the same class of associative memories as Kanerva"sSparse Distributed Memory,Albus"sCerebellar Model Arithmetic Computer,and Marr"sTheory of the Cerebellar Cortex.This class of memories Enrives its associative and robust nature from a sparse sampling of a huge input space by recognition units (B and T cells in the immune system) and a distribution of the memory among many inEnpenEnnt units (B and T cells in the memory population in the immune system).

...read moreread less

Journal Article•10.1109/5.747870•

Shared virtual memory: progress and challenges

[...]

Liviu Iftode¹, Jaswinder Pal Singh•Institutions (1)

Rutgers University¹

1 Jan 1999

TL;DR: A survey of the key developments in shared virtual memory research is provided, placing the multitrack flow of ideas and results obtained so far in a comprehensive new framework.

...read moreread less

Abstract: Shared virtual memory, a technique for supporting a shared address space in software on parallel systems, has undergone a decade of research, with significant maturing of protocols and communication layers having now been achieved. We provide a survey of the key developments in this research, placing the multitrack flow of ideas and results obtained so far in a comprehensive new framework. Four major research tracks are covered: relaxed consistency models; protocol laziness; architectural support; and application-driven research. Several related avenues are also discussed, such as fine grained software coherence, software protocols across multiprocessor nodes, and performance scalability. We summarize comparative performance results from the literature, discuss their limitations, and identify lessons learned so far, key outstanding questions, and important directions for future research in this area.

...read moreread less

Journal Article•10.1109/5.747864•

The MIT Alewife Machine

[...]

Anant Agarwal¹, Ricardo Bianchini, David Chaiken, Frederic T. Chong, Kirk L. Johnson, David M. Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, Donald Yeung - Show less +6 more•Institutions (1)

Massachusetts Institute of Technology¹

1 Mar 1999

TL;DR: Results show that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives, and shows that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations.

...read moreread less

Abstract: A variety of models for parallel architectures, such as shared memory, message passing, and data flow, have converged in the recent past to a hybrid architecture form called distributed shared memory (DSM) Alewife, an early prototype of such DSM architectures, uses hybrid software and hardware mechanisms to support coherent shared memory, efficient user level messaging, fine grain synchronization, and latency tolerance Alewife supports up to 512 processing nodes connected over a scalable and cost effective mesh network at a constant cost per node Four mechanisms combine to achieve Alewife's goals of scalability and programmability: software extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms-including block multithreading and prefetching-mask unavoidable delays due to communication Extensive results from microbenchmarks, together with over a dozen complete applications running on a 32-node prototype, demonstrate that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives Our results further show that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations

...read moreread less

Patent•

Flexible probe/probe response routing for maintaining coherency

[...]

James B. Keller¹, Dale E. Gulick¹•Institutions (1)

Advanced Micro Devices¹

26 Aug 1999

TL;DR: In this article, a flexible probe command/response routing scheme is proposed for a distributed memory system with multiple processing nodes coupled to separate memories which may form a shared memory system. But the scheme is limited to the case of read and write transactions, where the target may determine when to commit the write data to memory and receive any dirty data to be merged with the read data.

...read moreread less

Abstract: A computer system may include multiple processing nodes, one or more of which may be coupled to separate memories which may form a distributed memory system. The processing nodes may include caches, and the computer system may maintain coherency between the caches and the distributed memory system. Particularly, the computer system may implement a flexible probe command/response routing scheme. The scheme may employ an indication within the probe command which identifies a receiving node to receive the probe responses. For example, probe commands indicating that the target or the source of transaction should receive probe responses corresponding to the transaction may be included. Probe commands may specify the source of the transaction as the receiving node for read transactions (such that dirty data is delivered to the source node from the node storing the dirty data). On the other hand, for write transactions (in which data is being updated in memory at the target node of the transaction), the probe commands may specify the target of the transaction as the receiving node. In this manner, the target may determine when to commit the write data to memory and may receive any dirty data to be merged with the write data.

...read moreread less

Patent•

Secure partitioning of shared memory based multiprocessor system

[...]

Hubertus Franke¹, Mark E. Giampapa¹, Joefon Jann¹, Douglas J. Joseph¹, Pratap Pattnaik¹ - Show less +1 more•Institutions (1)

IBM¹

23 Feb 1999

TL;DR: In this article, the authors propose a method and apparatus for sharing memory in a multiprocessor computing system, which provides a number of system buses with each bus being connected to a respective memory controller which controls a corresponding partition of the memory.

...read moreread less

Abstract: A method and apparatus for sharing memory in a multiprocessor computing system. More specifically, this invention provides a number of system buses with each bus being connected to a respective memory controller which controls a corresponding partition of the memory. Any one of the processors can use any one of the system buses to send real addresses to the connected memory controller which then converts the real addresses into physical addresses corresponding to the partition of memory that is controlled by the receiving memory controller. The processors can be dynamically assigned to different partitions of the memory by via a switching mechanism.

...read moreread less

A modular simulation system for the bidomain equations

[...]

J.B. Pormann, John A. Board

1 Jan 1999

TL;DR: A modular framework for development of a Simulation System whereby a researcher may mix and match program elements to generate a simulator tailored to their particular problem, and the results presented here show that a modular Simulation System is feasible for a wide range of applications, and that it can obtain very good performance over thisrange of applications.

...read moreread less

Abstract: Cardiac arrhythmias and fibrillation are potentially life threatening diseases that can result from the improper conduction of electrical impulses in the heart. Experimental study of such cardiac abnormalities are dangerous at best, often requiring the subject to be placed in fibrillation for some time before attempting a large “rescue” shock. Thus, most all studies are done in animals and not humans. Furthermore, there is some indication that heart size may have considerable implications for fibrillation and other conduction abnormalities. Thus animal models for defibrillation studies must be chosen with great care. As an alternative, researchers are now using computer simulation to study the factors that generate and sustain arrhythmias, hoping to obtain at least preliminary data to guide fewer, more targeted experimental studies. Computer simulations of the Bidomain Equations have become very complex as they have been applied to many problems in cardiac electrophysiology. More complex membrane dynamics, irregular grids, and 3-D data sets axe all being investigated. Software engineering principles will need to be applied to manage this continuing growth in complexity. We propose a modular framework for development of a Simulation System whereby a researcher may mix and match program elements to generate a simulator tailored to their particular problem. The modular approach will simplify the generation and maintenance of the different program elements and it will enable the end-researcher to determine the proper mix of complexity versus speed for their particular problem of interest. The contrary approach, one monolithic program which can run all simulations of all complexities, is simply unrealistic. It would impose too great a burden on maintenance and upgradability, and it would be difficult to provide good performance for a wide range of applications. The modular approach also allows for the incremental inclusion of various complexities in the bidomain model. From a simple 2-D homogeneous, isotropic regular grid, monodomain simulation, we can progress, step by step, to a bidomain simulation with a fully implicit time-integration scheme on irregular, 3-D grids with arbitrary anisotropy and inhomogeneity, with a non-trivial membrane model. Simulations with such a wealth of complexity have not been performed to date. As microprocessors have become cheaper and more powerful, parallel computing has become more widespread. Machines with hundreds of high-performance CPUs connected by fast networks are commonplace and are now capable of surpassing traditional vector-based supercomputers in terms of overall performance. The Simulation System presented here incorporates data-parallelism to allow large scale Bidomain problems to be run on these newest parallel supercomputers. The large amount of distributed memory in such machines can be harnessed to allow extremely large scale simulations to be run. The large number of CPUs provide a tremendous amount of computational power which can be used to run such simulations more quickly. Finally, the results presented here show that a modular Simulation System is feasible for a wide range of applications, and that it can obtain very good performance over this range of applications. The parallel speed-up seen was very good, regularly achieving a factor of 13 speed-up on 16 processors. The results presented here also show that we can simulate bidomain problems using an implicit time-integrator with an irregular, anisotropic and inhomogeneous, grid and a non-trivial membrane model. We are able to run such simulations on parallel computers, thereby harnessing a tremendous amount of memory and computational resources. Such simulations have not been run to date.

...read moreread less

Proceedings Article•10.1145/300979.301003•

Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

[...]

Angelos Bilas¹, Cheng Liao², Jaswinder Pal Singh²•Institutions (2)

University of Toronto¹, Princeton University²

1 May 1999

TL;DR: The performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and the value of each of the mechanisms in different applications is shown.

...read moreread less

Abstract: The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardware-coherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.

...read moreread less

Patent•

Highly componentized system architecture with a demand-loading namespace and programming model

[...]

Alessandro Forin, Johannes Helander, Andrew Raffman

31 Mar 1999

TL;DR: In this paper, the authors present a software executable on a computer having a working memory with demand-loadable components initially stored outside of the working memory, each component having an entry point including a constructor for an object.

...read moreread less

Abstract: The invention is embodied in software executable on a computer having a working memory with demand-loadable components initially stored outside of the working memory, each component having an entry point including a constructor for an object. Preferably, the demand-loadable components are initially provided in a memory within the computer or a location external of the computer. A Namespace in the working memory provides access in the working memory to the components as they become needed by applications running in the computer. The Namespace provides the access by managing demand-loading and unloading of the components in the working memory.

...read moreread less

Patent•

Automatic teller machine with secure variable storage for internet applications

[...]

John Martin Flenley¹•Institutions (1)

IBM¹

29 Dec 1999

TL;DR: In this paper, a memory controller component for an Internet application such as a bank, airline or supermarket web site is disclosed, instantiated from a web page in the application and exposes a number of methods.

...read moreread less

Abstract: A memory controller component ( 10, 30, 40 ) for an Internet application such as a bank, airline or supermarket web site is disclosed. The component is instantiated from a web page in the application and exposes a number of methods. A connect method checks if a predetermined area of named shared memory has been allocated, and responsive to the named shared memory not being allocated, allocates named shared memory. A setVariable method allocates an area in the shared memory for a variable and stores the variable. A getVariable method locates the variable in shared memory. Corresponding get and set encrypted variable methods are also disclosed. Furthermore, an ATM running an Internet browser enabling a user to swap between web sites at an ATM is disclosed.

...read moreread less

Patent•

Host access to shared memory with a high priority mode

[...]

Laurent Six¹, Armelle Laine¹, Daniel Mazzocco¹, Gerald Ollivier¹•Institutions (1)

Texas Instruments¹

9 Jun 1999

TL;DR: In this paper, a digital system is provided with a memory that can be shared by two or more data requestors, and two modes of access are provided: shared access memory (SAM) and host only memory (HOM) access modes.

...read moreread less

Abstract: A digital system is provided with a memory ( 42 ) that can be shared by two or more data requestors ( 10, 20 ). Two modes of access are provided. In a shared access memory (SAM) access mode, all of the data requestors can sequentially access the memory. In a host only memory (HOM) access mode, the memory is connected directly to one of the requestors, such as a host processor ( 10 ), so that high bandwidth transfers can be performed. The HOM access mode is entered when a priority assigned to the host processor is set to be higher than a priority assigned to any other requester. Registers for holding the priority assignments can be written by at least one of the requesters.

...read moreread less

Proceedings Article•10.5555/339492.340063•

Memory bank customization and assignment in behavioral synthesis

[...]

Preeti Ranjan Panda¹•Institutions (1)

Synopsys¹

7 Nov 1999

TL;DR: This work addresses the topic of incorporating the application-specific customization of memory bank configuration into behavioral synthesis, and presents a heuristic for determining the access sequence that minimizes page misses in a bank while respecting data dependences.

...read moreread less

Abstract: With increasing design complexity and chip area, on-chip memory has become an important component whose integration needs to be addressed during system design. Modern embedded DRAM technology allows for large amounts of on-chip memory space. However, in order to utilize the available memory intelligently, the memory has to be appropriately customized for the specific application. We address the topic of incorporating the application-specific customization of memory bank configuration into behavioral synthesis. The strategy involves a partitioning of behavioral arrays into memory banks based on a cost function that estimates the performance implications. For a given candidate partition, we present a heuristic for determining the access sequence that minimizes page misses in a bank while respecting data dependences. The output of the exploration is a graph displaying the variation of delay and memory area with the bank configuration. Our experiments on several memory-intensive examples confirm that the exploration results can provide critical feedback to the designer about the optimal memory configuration for a given application.

...read moreread less

...

Expand