Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2000
  1. Home
  2. Topics
  3. Distributed memory
  4. 2000
Showing papers on "Distributed memory published in 2000"
Proceedings Article•10.1145/339647.339668•
Memory access scheduling

[...]

Scott Rixner1, William J. Dally2, Ujval J. Kapasi2, Peter Mattson2, John D. Owens2 •
Massachusetts Institute of Technology1, Stanford University2
1 May 2000
TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
Abstract: The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.

1,106 citations

Journal Article•10.1016/S0045-7825(99)00242-X•
Multifrontal parallel distributed symmetric and unsymmetric solvers

[...]

Patrick R. Amestoy, Iain S. Duff, Jean-Yves L'Excellent
14 Apr 2000-Computer Methods in Applied Mechanics and Engineering
TL;DR: In this paper, a new parallel distributed memory multifrontal approach is described to handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed.

1,029 citations

Book Chapter•10.1007/3-540-70734-4_16•
MUMPS: A General Purpose Distributed Memory Sparse Solver

[...]

Patrick R. Amestoy1, Iain S. Duff2, Jean-Yves L'Excellent, Jacko Koster3•
ENSEEIHT1, Rutherford Appleton Laboratory2, University of Bergen3
18 Jun 2000
TL;DR: Recently integrated features of MUMPS are reported on and the present performance of the solver on an SGI Origin 2000 and a CRAY T3E is illustrated.
Abstract: MUMPS is a public domain software package for the multifrontal solution of large sparse linear systems on distributed memory computers. The matrices can be symmetric positive definite, general symmetric, or unsymmetric, and possibly rank deficient. MUMPS exploits parallelism coming from the sparsity in the matrix and parallelism available for dense matrices. Additionally, large computational tasks are divided into smaller subtasks to enhance parallelism. MUMPS uses a distributed dynamic scheduling technique that allows numerical pivoting and the migration of computational tasks to lightly loaded processors. Asynchronous communication is used to overlap communication with computation. In this paper, we report on recently integrated features and illustrate the present performance of the solver on an SGI Origin 2000 and a CRAY T3E.

267 citations

Journal Article•10.1016/S0010-4655(00)00077-1•
Exploiting multiple levels of parallelism in Molecular Dynamics based calculations via modern techniques and software paradigms on distributed memory computers

[...]

Mark E. Tuckerman1, D.A. Yarne2, Shane O. Samuelson3, Adam Hughes3, Glenn J. Martyna3 •
Courant Institute of Mathematical Sciences1, University of Pennsylvania2, Indiana University3
09 Jun 2000-Computer Physics Communications
TL;DR: Modern molecular dynamics methods are reviewed and their application to quantum manybody systems and electronic structure calculations described, and it is shown how modern object oriented programming paradigms can be employed to implement multilevel parallel algorithms in a large computational package rapidly and efficiently.

207 citations

Book•10.1007/3-540-46502-2•
Large-scale parallel data mining

[...]

Mohammed J. Zaki, Ching-Tien Ho
1 Jan 2000
TL;DR: A High Performance Implementation of the Data Space Transfer Protocol (DSTP) and an efficient Parallel Classification Using Dimensional Aggregates for Mining Associations are implemented.
Abstract: Large-Scale Parallel Data Mining.- Parallel and Distributed Data Mining: An Introduction.- Mining Frameworks.- The Integrated Delivery of Large-Scale Data Mining: The ACSys Data Mining Project.- A High Performance Implementation of the Data Space Transfer Protocol (DSTP).- Active Mining in a Distributed Setting.- Associations and Sequences.- Efficient Parallel Algorithms for Mining Associations.- Parallel Branch-and-Bound Graph Search for Correlated Association Rules.- Parallel Generalized Association Rule Mining on Large Scale PC Cluster.- Parallel Sequence Mining on Shared-Memory Machines.- Classification.- Parallel Predictor Generation.- Efficient Parallel Classification Using Dimensional Aggregates.- Learning Rules from Distributed Data.- Clustering.- Collective, Hierarchical Clustering from Distributed, Heterogeneous Data.- A Data-Clustering Algorithm on Distributed Memory Multiprocessors.

187 citations

Patent•
Remote performance management to accelerate distributed processes

[...]

Trevor Deosaran, Ram Prabhakar
29 Dec 2000
TL;DR: In this article, an intelligent memory system, method, and computer program product for enabling stand-alone or distributed client-server software applications to operate at maximum speeds on a personal computer and the like.
Abstract: An intelligent memory system, method, and computer program product for enabling stand-alone or distributed client-server software applications to operate at maximum speeds on a personal computer and the like. An intelligent memory (IM) allows the acceleration of computer software processes through process virtual memory, application optimization, multiprocessor control, and system strategies. The IM includes both control logic and memory. The control logic uses an application database and system database to determine a set of modifications to the computer, application, and/or operating system, while the memory stores the application and allows the control logic to implement the set of modifications. A remote performance management system is also described which allows an IM service provider to supply the infrastructure to clients (e.g., e-businesses and the like who run World Wide Web servers) to facilitate and accelerate their content offerings to end user clients (i.e., consumers).

170 citations

Journal Article•10.1016/S0010-4655(99)00436-1•
Point-centered domain decomposition for parallel molecular dynamics simulation

[...]

Reto Koradi, Martin Billeter1, Peter Güntert2•
University of Gothenburg1, ETH Zurich2
01 Feb 2000-Computer Physics Communications
TL;DR: The point-centered domain decomposition algorithm is implemented in the new program Opal p using a standard message passing library, so that it runs on both shared memory and massively parallel distributed memory computers.

160 citations

Journal Article•10.1016/S0010-4655(00)00073-4•
The Distributed Data Interface in GAMESS

[...]

Graham D. Fletcher1, Michael W. Schmidt2, Brett M. Bode3, Mark S. Gordon2, Mark S. Gordon3 •
Ames Research Center1, Iowa State University2, Ames Laboratory3
09 Jun 2000-Computer Physics Communications
TL;DR: The Distributed Data Interface to permit storage of large data arrays in the aggregate memory of distributed memory, message passing computer systems is described and the good performance of a MP2 program using DDI is demonstrated.

152 citations

Patent•
Network processor, memory organization and methods

[...]

Brian Mitchell Bass1, Jean Calvignac1, Marco C. Heddes1, Piyush C. Patel1, Juan Guillermo Revilla1, Michael Steven Siegel1, Fabrice Jean Verplanken1 •
IBM1
24 Aug 2000
TL;DR: In this article, a network switch apparatus (10), components for such an apparatus, and methods of operating an apparatus in which data flow handling and flexibility is enhanced by the cooperation of a plurality of memory elements and a pluralityof interface processors formed on a semiconductor substrate (10).
Abstract: A network switch apparatus (10), components for such an apparatus, and methods of operating such an apparatus in which data flow handling and flexibility is enhanced by the cooperation of a plurality of memory elements and a plurality of interface processors formed on a semiconductor substrate (10). The memory elements and interface processors together form a network processor (10) capable of cooperating with other elements in executing instructions directing the flow of data in a network. Access to the memory elements is controlled in a particular manner and under operative rules which provide controlled multiple accesses of the plurality of memory elements by a plurality of processors.

125 citations

Book Chapter•10.1007/10722167_19•
Distributing Timed Model Checking - How the Search Order Matters

[...]

Gerd Behrmann1, Thomas Hune2, Frits W. Vaandrager3•
Aalborg University1, Aarhus University2, Radboud University Nijmegen3
15 Jul 2000
TL;DR: This paper addresses the problem of distributing model checking of timed automata and shows how in the timed case the search order of the state space is crucial for the effectiveness and scalability of the exploration.
Abstract: In this paper we address the problem of distributing model checking of timed automata. We demonstrate through four real life examples that the combined processing and memory resources of multi-processor computers can be effectively utilized. The approach assumes a distributed memory model and is applied to both a network of workstations and a symmetric multiprocessor machine. However, certain unexpected phenomena have to be taken into account. We show how in the timed case the search order of the state space is crucial for the effectiveness and scalability of the exploration. An effective heuristic to counter the effect of the search order is provided. Some of the results open up for improvements in the single processor case.

115 citations

Patent•
Single integrated circuit embodying a risc processor and a digital signal processor

[...]

Robert J. Gove1, Keith Balmer1, Nicholas Ing-Simmons1, Karl M. Guttag1•
Texas Instruments1
3 Mar 2000
TL;DR: In this paper, a single integrated circuit includes first and second data processors operating on different instruction sets independently operating on disjoint programs and data, and a shared data transfer controller and shared memory divided into plural independently accessible memory banks.
Abstract: A single integrated circuit includes first and second data processors operating on different instruction sets independently operating on disjoint programs and data. The single integrated circuit preferably includes an external interface, a shared data transfer controller and shared memory divided into plural independently accessible memory banks. The two data processors are preferably a digital signal processor (DSP) and a reduced instruction set computer (RISC) processor. The DSP and RISC processors are suitably programmed to perform differing aspects of computer image processing.
Journal Article•10.1006/JPDC.2000.1658•
OpenMP for Networks of SMPs

[...]

Y. Charlie Hu1, Honghui Lu1, Alan L. Cox1, Willy Zwaenepoel1•
Rice University1
01 Dec 2000-Journal of Parallel and Distributed Computing
TL;DR: This paper presents the first system that implements OpenMP on a network of shared-memory multiprocessors, and shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions.
Journal Article•10.1137/S1064827598345679•
A Note On Parallel Matrix Inversion

[...]

E. S. Quintana1, Gregorio Quintana1, Xiaobai Sun1, Robert A. van de Geijn•
James I University1
01 May 2000-SIAM Journal on Scientific Computing
TL;DR: This work presents one-sweep parallel algorithms for the inversion of general and symmetric positive definite matrices that feature simple programming and performance optimization while maintaining the same arithmetic cost and numerical properties of conventional inversion algorithms.
Abstract: We present one-sweep parallel algorithms for the inversion of general and symmetric positive definite matrices. The algorithms feature simple programming and performance optimization while maintaining the same arithmetic cost and numerical properties of conventional inversion algorithms. Our experiments on a Cray T3E-600 and a Beowulf cluster demonstrate high performance of implementations for distributed memory parallel computers.
Journal Article•10.1177/109434200001400202•
Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD

[...]

William Gropp1, David E. Keyes2, Lois Curfman McInnes1, Moulay D. Tidriri3•
Argonne National Laboratory1, Old Dominion University2, Iowa State University3
1 May 2000
TL;DR: This article shows that for the classical problem of three-dimensional transonic Euler flow about an M6 wing, ψNKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton’s method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver.
Abstract: Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (ΨNKS) algorithmic framework is presented as a widely applicable answer. This article shows that for the classical problem of three-dimensional transonic Euler flow about an M6 wing, ΨNKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton's method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of ΨNKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. The authors therefore distill several recommendations from their experience and reading of the literature on various algorithmic components of ΨNKS, and they describe a freely available MPI-based portable parallel software implementation of the solver employed here.
Patent•
Reader-writer lock for multiprocessor systems

[...]

Paul E. McKenney1, Brent A. Kingsbury1•
IBM1
10 Jan 2000
TL;DR: In this paper, a reader-writer lock is proposed to reduce the writer and reader overhead by employing lock structures that are shared among groups of processors that have lower latencies, which reduces the writer overhead that otherwise would exist if each processor in the system had a separate flag.
Abstract: A reader-writer lock minimizes writer and reader overhead by employing lock structures that are shared among groups of processors that have lower latencies. In the illustrated multiprocessor system having a non-uniform memory access (NUMA) architecture, each processor node has a lock structure comprised of a shared counter and associated flag for each CPU group. During a read, the counter can be changed only by processors within a CPU group performing a read. This reduces the reader overhead that otherwise would exist if all processors in the system shared a single counter. During a write, the shared flag can be changed by a process running on any processor in the system. The processors in a CPU group are notified of the write through the shared flag. This reduces the writer overhead that otherwise would exist if each processor in the system had a separate flag. The number of CPUs per group can be varied to optimize performance of the lock in different multiprocessor systems.
Journal Article•10.1016/S0167-8191(00)00010-7•
Pajé, an interactive visualization tool for tuning multi-threaded parallel applications

[...]

J. Chassin de Kergommeaux1, Benhur de Oliveira Stein2, Paul-Emile Bernard3•
Apache Corporation1, Universidade Federal de Santa Maria2, French Institute for Research in Computer Science and Automation3
15 Aug 2000
TL;DR: Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system, is described.
Abstract: This paper describes Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system. Paje is capable of representing a wide variety of interactions between threads. The main characteristics of Paje, interactivity and scalability, are exemplified by the performance tuning of a molecular dynamics application. In order to be easily extensible, the architecture of the system was based on components which are connected in a data flow graph to produce a given visualization tool. Innovative components were designed, in addition to “classical” components existing in similar visualization systems, to support scalability and interactivity.
Journal Article•10.1109/COMST.2000.5340716•
Buffer management for shared-memory ATM switches

[...]

M. Arpaci1, John A. Copeland1•
Georgia Institute of Technology1
01 Jan 2000-IEEE Communications Surveys and Tutorials
TL;DR: A survey of the buffer management methods that have been proposed for shared-memory packet switches and their strengths and weaknesses are described and evaluated using computer simulations.
Abstract: In the shared-memory switch architecture, output links share a single large memory, in which logical FIFO queues are assigned to each link. Although memory sharing can provide a better queuing performance than physically separated buffers, it requires carefully designed buffer management schemes for a fair and robust operation. This article presents a survey of the buffer management methods that have been proposed for shared-memory packet switches. Several buffer management policies are described, and their strengths and weaknesses are examined. The performances of various policies are evaluated using computer simulations. A comparison of the most important schemes is obtained with the help of the simulation results and the results provided in the literature. The survey concludes with a discussion of the possible future research areas related to shared-memory ATM switches.
Patent•
Multiprocessor with each processor element accessing operands in loaded input buffer and forwarding results to FIFO output buffer

[...]

Jon M. Huppenthal1, Paul A. Leskar1•
University of Colorado Colorado Springs1
3 May 2000
TL;DR: An enhanced memory algorithmic processor (MAP) as discussed by the authors is an architecture for multiprocessor computer systems which comprises an assembly that may include an array of field programmable gate arrays (FPGAs) functioning as the memory algorithms.
Abstract: An enhanced memory algorithmic processor (“MAP”) architecture for multiprocessor computer systems comprises an assembly that may comprise, for example, field programmable gate arrays (“FPGAs”) functioning as the memory algorithmic processors. The MAP elements may further include an operand storage, intelligent address generation, on board function libraries, result storage and multiple input/output (“I/O”) ports. The MAP elements are intended to augment, not necessarily replace, the high performance microprocessors in the system and, in a particular embodiment of the present invention, they may be connected through the memory subsystem of the computer system resulting in it being very tightly coupled to the system as well as being globally accessible from any processor in a multiprocessor computer system.
Dual-cubes: a new interconnection network for high-performance computer clusters

[...]

Yamin Li1, Shietung Peng1•
Hosei University1
1 Jan 2000
TL;DR: This paper introduces a new interconnection network for large-scale distributed memory multiprocessors called dual-cube, which mitigates the problem of increasing number of links in the large- scale hypercube network while keeps most of the topological properties of thehypercube network.
Abstract: The binary hypercube, or n-cube, has been widely used as the interconnection network in parallel computers. However, the major drawback of the hypercube is the increase in the number of communication links for each node with the increase in the total number of nodes in the system. This paper introduces a new interconnection network for large-scale distributed memory multiprocessors called dual-cube. This network mitigates the problem of increasing number of links in the large-scale hypercube network while keeps most of the topological properties of the hypercube network. We investigate the topological properties of the dualcube, compare them with other hypercube-like networks, and establish the basic routing and broadcasting algorithms for dual-cubes.
A Software Architecture for User Transparent Parallel Image Processing

[...]

Frank J. Seinstra1, Dennis C. Koelma1, Jan-Mark Geusebroek1•
University of Amsterdam1
1 Jan 2000
TL;DR: In this article, the authors describe a software architecture that allows image processing researchers to develop parallel applications in a transparent manner, where all parallelism is completely hidden from the user, and the main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers.
Abstract: This paper describes a software architecture that allows image processing researchers to develop parallel applications in a transparent manner. The architecture's main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers. Since the library has an application programming interface identical to that of an existing sequential library, all parallelism is completely hidden from the user.The first part of the paper discusses implementation aspects of the parallel library, and shows how sequential as well as parallel operations are implemented on the basis of so-called parallelizable patterns. A library built in this manner is easily maintainable, as extensive code redundancy is avoided. The second part of the paper describes the application of performance models to ensure efficiency of execution on all target platforms. Experiments show that for a realistic application performance predictions are highly accurate. These results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.
Patent•
Partition formation using microprocessors in a multiprocessor computer system

[...]

David Golden1, Dennis Mazur1, Richard Edward Bracken1•
Hewlett-Packard1
31 Aug 2000
TL;DR: In this article, a control system using microprocessors which communicate through a Local Area Network (private LAN) to control operation of both processors and input and output subsystems (IO system) of a multiprocessor computer system is presented.
Abstract: The invention is a control system using microprocessors which communicate through a Local Area Network (private LAN) to control operation of both processors and input and output subsystems (IO system) of a multiprocessor computer system. The processors each have memory associated therewith, and each processor has an IO system comprising a plurality of busses such as PCI busses, associated therewith. The processors are cabled together in a mesh arrangement so that messages can be transferred between any of the processors and delivered to memory associated with the destination processor, or delivered to an IO system associated with the destination processor, etc. The microprocessors are powered on when power is applied to the chassis of the multiprocessor system, and the microprocessors then control the processors of the multiprocessor system, including applying power to the processors, forming hard partitions containing selected processors, computing routes from a processor to a memory associated with any processor for read and write transactions, computing routes to IO subsystems associated with any processor of the hard partition, forming partition boundaries so that processors in one hard partition cannot read and write to memory or IO systems associated with processors in another hard partition, forming soft partitions of processors, controlling boot-up of operating systems executing on the processors of the multiprocessor computer system, removing power from a failed processor, providing power to a repaired processor, etc.
Journal Article•10.1364/AO.39.000671•
Architectural approach to the role of optics in monoprocessor and multiprocessor machines.

[...]

Jacques H. Collet, D. Litaize1, Jan Van Campenhout2, Chris Jesshope3, Marc P.Y. Desmulliez, Hugo Thienpont4, James R. Goodman5, Ahmed Louri6 •
University of Toulouse1, Ghent University2, Massey University3, Vrije Universiteit Brussel4, University of Wisconsin-Madison5, University of Arizona6
10 Feb 2000-Applied Optics
TL;DR: It is shown that perhaps the major explanation for why optical technologies have nearly been unable to penetrate into computers is that OI's generally do not shorten the memory-access time, which is the most critical issue for today's stored-program machines.
Abstract: The relevance of introducing optical interconnects ~OI’s! in monoprocessors and multiprocessors is studied from an architectural point of view. We show that perhaps the major explanation for why optical technologies have nearly been unable to penetrate into computers is that OI’s generally do not shorten the memory-access time, which is the most critical issue for today’s stored-program machines. In monoprocessors the memory-access time is dominated by the electronic latency of the memory itself. Thus implementing OI’s inside the memory hierarchy without changing the memory architecture cannot dramatically improve the global performance. In strongly coupled multiprocessors the node-bypass latency dominates. Therefore the higher the connectivity ~possibly with optics!, the shorter the path to another node, but the more expensive the network and the more complex the structure of electronic nodes. This relation leaves the choice of the best network open in terms of simplicity and latency reduction. The bottlenecks resulting from and the benefits of implementing OI’s are discussed with respect to symmetric multiprocessors, rings, and distributed shared-memory supercomputers. © 2000 Optical Society of
Journal Article•10.1109/78.824693•
A hardware efficient control of memory addressing for high-performance FFT processors

[...]

Y. Ma, Lars Wanhammar1•
Linköping University1
01 Mar 2000-IEEE Transactions on Signal Processing
TL;DR: The conventional memory organization of fast Fourier transform (FFT) processors is based on Cohen's (1976) scheme, but the new scheme reduces the hardware complexity of address generation by about 50% while improving the memory access speed.
Abstract: The conventional memory organization of fast Fourier transform (FFT) processors is based on Cohen's (1976) scheme. Compared with this scheme, our scheme reduces the hardware complexity of address generation by about 50% while improving the memory access speed. Much power consumption in memory is saved since only half of the memory is activated during memory access, and the number of coefficient access is reduced to a minimum by using a new ordering of FFT butterflies. Therefore, the new scheme is a superior solution to constructing high-performance FFT processors.
Journal Article•10.1007/BF02703630•
A survey of checkpointing algorithms for parallel and distributed computers

[...]

S. Kalaiselvi1, V. Rajaraman2•
Indian Institute of Science1, Jawaharlal Nehru Centre for Advanced Scientific Research2
01 Oct 2000-Sadhana-academy Proceedings in Engineering Sciences
TL;DR: This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems and concludes that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.
Abstract: Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.
Patent•
System and method for high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model

[...]

Bryan D. Boatright1, Rajesh Patel, Larry Edward Thatcher•
Intel1
29 Dec 2000
TL;DR: In this paper, the authors present a system and method for the high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model, which includes decoding a locked-memory instruction, obtaining exclusive ownership of a cacheline to be used by a load-lock operation, setting a bit to indicate the loadlock operation's ownership of the cacheline, and activating a snoop checking process.
Abstract: The present invention relates to locked memory instructions, and more specifically to a system and method for the high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model. In accordance with an embodiment of the present invention, a method for executing locked-memory instructions includes decoding a locked-memory instruction, obtaining exclusive ownership of a cacheline to be used by a load-lock operation, setting a bit to indicate the load-lock operation's ownership of the cacheline, and activating a snoop checking process. The method also includes modifying a load data value and storing the modified load data value. The method further includes determining that the cacheline is still exclusively owned, storing the load data value, determining that the cacheline is unsnooped, merging the modified load data value with the load data value, and releasing the locked-memory instruction to be retired.
Proceedings Article•10.1145/337292.337428•
Memory aware compilation through accurate timing extraction

[...]

Peter Grun1, Nikil Dutt1, Alexandru Nicolau1•
University of California, Irvine1
1 Jun 2000
TL;DR: A memory-aware compiler approach is described that exploits efficient memory access modes by extracting accurate timing information, allowing the compiler's scheduler to perform global code reordering to better hide the latency of memory operations.
Abstract: Memory delays represent a major bottleneck in embedded systems performance. Newer memory modules exhibiting efficient access modes (e.g., page-, burst-mode) partly alleviate this bottleneck. However, such features can not be efficiently exploited in processor-based embedded systems without memory-aware compiler support. We describe a memory-aware compiler approach that exploits such efficient memory access modes by extracting accurate timing information, allowing the compiler's scheduler to perform global code reordering to better hide the latency of memory operations. Our memory-aware compiler scheduled several benchmarks on the TI C6201 processor architecture interfaced with a 2-bank synchronous DRAM and generated average improvements of 24% over the best possible schedule using a traditional (memory-transparent) optimizing compiler, demonstrating the utility of our memory-aware compilation approach.
Patent•
System and method for fault handling and recovery in a multi-processing system having hardware resources shared between multiple partitions

[...]

Roger L. Gilbertson1, Mitchell A. Bauman1, Penny L. Svenkeson1, James L. DePenning1, Michael L. Haupt1, Donald R. Kalvestrand1, Daniel S. Tokoly1, Frederick G. Fellenser1, Maria A. Liedman1 •
Unisys1
28 Apr 2000
TL;DR: In this article, a support processor handles non-immediate problems and allows resetting of memory locations formerly owned by failed units, and continuous processing by non-failing units is allowable.
Abstract: Poisoning of specific memory locations as a process when a part of a multiprocessor computer system becomes faulty leads to ability to isolate specific data owned by individual failing units even in a shared memory area. Also continuous processing by non-failing units is allowable. A support processor handles non-immediate problems and allows resetting of memory locations formerly owned by failed units.
Patent•
Symmetric multiprocessing (SMP) system with fully-interconnected heterogenous microprocessors

[...]

Ravi Kumar Arimilli1, David William Siegel1•
IBM1
28 Dec 2000
TL;DR: Disclosed as mentioned in this paper is a fully-interconnected, heterogeneous, multiprocessor data processing system with a plurality of processors each having unique characteristics including, for example, different processing speeds (frequency) and different cache topologies (sizes, levels, etc.).
Abstract: Disclosed is a fully-interconnected, heterogenous, multiprocessor data processing system. The data processing system topology has a plurality of processors each having unique characteristics including, for example, different processing speeds (frequency) and different cache topologies (sizes, levels, etc.). Second and third generation heterogenous processors are connected to a specialized set of pins, connected to the system bus. The processors are interconnected and communicate via an enhanced communication protocol and specialized SMP bus topology that supports the heterogeneous topology and enables newer processors to support full downward compatibility to the previous generation processors. Various processor functions are modified to support operations on either of the processors depending on which processor is assigned which operations. The enhanced communication protocol, operating system, and other processor logic enable the heterogenous multiprocessor data processing system to operate as a symmetric multiprocessor system.
Journal Article•10.1137/S1064827598340809•
Algebraic Two-Level Preconditioners for the Schur Complement Method

[...]

Luiz Mariano Carvalho, Luc Giraud, P. Le Tallec
01 Jun 2000-SIAM Journal on Scientific Computing
TL;DR: A set of preconditioners for the Schur complement domain decomposition method that implement a global coupling mechanism, through coarse-space components, similar to the one proposed in Bramble, Pasciak, and Shatz, Math.
Abstract: The solution of elliptic problems is challenging on parallel distributed memory computers since their Green's functions are global. To address this issue, we present a set of preconditioners for the Schur complement domain decomposition method. They implement a global coupling mechanism, through coarse-space components, similar to the one proposed in [Bramble, Pasciak, and Shatz, Math. Comp., 47 (1986), pp. 103--134]. The definition of the coarse-space components is algebraic; they are defined using the mesh partitioning information and simple interpolation operators. These preconditioners are implemented on distributed memory computers without introducing any new global synchronization in the preconditioned conjugate gradient iteration. The numerical and parallel scalability of those preconditioners are illustrated on two-dimensional model examples that have anisotropy and/or discontinuity phenomena.
Journal Article•10.1109/71.895795•
Experiences with parallel N-body simulation

[...]

Pangfeng Liu1, Sandeep N. Bhatt2•
National Chung Cheng University1, Akamai Technologies2
01 Dec 2000-IEEE Transactions on Parallel and Distributed Systems
TL;DR: This paper describes the experiences developing high-performance code for astrophysical N-body simulations and uses a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures.
Abstract: This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance implementation. Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified. The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 48 percent of the peak machine rate, which compares favorably with other approaches.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve