Top 366 papers published in the topic of Distributed memory in 2000

Showing papers on "Distributed memory published in 2000"

Proceedings Article•10.1145/339647.339668•

Memory access scheduling

[...]

Scott Rixner¹, William J. Dally², Ujval J. Kapasi², Peter Mattson², John D. Owens² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Stanford University²

1 May 2000

TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.

...read moreread less

Abstract: The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.

...read moreread less

1,106 citations

Journal Article•10.1016/S0045-7825(99)00242-X•

Multifrontal parallel distributed symmetric and unsymmetric solvers

[...]

Patrick R. Amestoy, Iain S. Duff, Jean-Yves L'Excellent

14 Apr 2000-Computer Methods in Applied Mechanics and Engineering

TL;DR: In this paper, a new parallel distributed memory multifrontal approach is described to handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed.

...read moreread less

1,029 citations

Book Chapter•10.1007/3-540-70734-4_16•

MUMPS: A General Purpose Distributed Memory Sparse Solver

[...]

Patrick R. Amestoy¹, Iain S. Duff², Jean-Yves L'Excellent, Jacko Koster³•Institutions (3)

ENSEEIHT¹, Rutherford Appleton Laboratory², University of Bergen³

18 Jun 2000

TL;DR: Recently integrated features of MUMPS are reported on and the present performance of the solver on an SGI Origin 2000 and a CRAY T3E is illustrated.

...read moreread less

Abstract: MUMPS is a public domain software package for the multifrontal solution of large sparse linear systems on distributed memory computers. The matrices can be symmetric positive definite, general symmetric, or unsymmetric, and possibly rank deficient. MUMPS exploits parallelism coming from the sparsity in the matrix and parallelism available for dense matrices. Additionally, large computational tasks are divided into smaller subtasks to enhance parallelism. MUMPS uses a distributed dynamic scheduling technique that allows numerical pivoting and the migration of computational tasks to lightly loaded processors. Asynchronous communication is used to overlap communication with computation. In this paper, we report on recently integrated features and illustrate the present performance of the solver on an SGI Origin 2000 and a CRAY T3E.

...read moreread less

267 citations

Journal Article•10.1016/S0010-4655(00)00077-1•

Exploiting multiple levels of parallelism in Molecular Dynamics based calculations via modern techniques and software paradigms on distributed memory computers

[...]

Mark E. Tuckerman¹, D.A. Yarne², Shane O. Samuelson³, Adam Hughes³, Glenn J. Martyna³ - Show less +1 more•Institutions (3)

Courant Institute of Mathematical Sciences¹, University of Pennsylvania², Indiana University³

09 Jun 2000-Computer Physics Communications

TL;DR: Modern molecular dynamics methods are reviewed and their application to quantum manybody systems and electronic structure calculations described, and it is shown how modern object oriented programming paradigms can be employed to implement multilevel parallel algorithms in a large computational package rapidly and efficiently.

...read moreread less

207 citations

Book•10.1007/3-540-46502-2•

Large-scale parallel data mining

[...]

Mohammed J. Zaki, Ching-Tien Ho

1 Jan 2000

TL;DR: A High Performance Implementation of the Data Space Transfer Protocol (DSTP) and an efficient Parallel Classification Using Dimensional Aggregates for Mining Associations are implemented.

...read moreread less

Abstract: Large-Scale Parallel Data Mining.- Parallel and Distributed Data Mining: An Introduction.- Mining Frameworks.- The Integrated Delivery of Large-Scale Data Mining: The ACSys Data Mining Project.- A High Performance Implementation of the Data Space Transfer Protocol (DSTP).- Active Mining in a Distributed Setting.- Associations and Sequences.- Efficient Parallel Algorithms for Mining Associations.- Parallel Branch-and-Bound Graph Search for Correlated Association Rules.- Parallel Generalized Association Rule Mining on Large Scale PC Cluster.- Parallel Sequence Mining on Shared-Memory Machines.- Classification.- Parallel Predictor Generation.- Efficient Parallel Classification Using Dimensional Aggregates.- Learning Rules from Distributed Data.- Clustering.- Collective, Hierarchical Clustering from Distributed, Heterogeneous Data.- A Data-Clustering Algorithm on Distributed Memory Multiprocessors.

...read moreread less

187 citations

Patent•

Remote performance management to accelerate distributed processes

[...]

Trevor Deosaran, Ram Prabhakar

29 Dec 2000

TL;DR: In this article, an intelligent memory system, method, and computer program product for enabling stand-alone or distributed client-server software applications to operate at maximum speeds on a personal computer and the like.

...read moreread less

Abstract: An intelligent memory system, method, and computer program product for enabling stand-alone or distributed client-server software applications to operate at maximum speeds on a personal computer and the like. An intelligent memory (IM) allows the acceleration of computer software processes through process virtual memory, application optimization, multiprocessor control, and system strategies. The IM includes both control logic and memory. The control logic uses an application database and system database to determine a set of modifications to the computer, application, and/or operating system, while the memory stores the application and allows the control logic to implement the set of modifications. A remote performance management system is also described which allows an IM service provider to supply the infrastructure to clients (e.g., e-businesses and the like who run World Wide Web servers) to facilitate and accelerate their content offerings to end user clients (i.e., consumers).

...read moreread less

170 citations

Journal Article•10.1016/S0010-4655(99)00436-1•

Point-centered domain decomposition for parallel molecular dynamics simulation

[...]

Reto Koradi, Martin Billeter¹, Peter Güntert²•Institutions (2)

University of Gothenburg¹, ETH Zurich²

01 Feb 2000-Computer Physics Communications

TL;DR: The point-centered domain decomposition algorithm is implemented in the new program Opal p using a standard message passing library, so that it runs on both shared memory and massively parallel distributed memory computers.

...read moreread less

160 citations

Journal Article•10.1016/S0010-4655(00)00073-4•

The Distributed Data Interface in GAMESS

[...]

Graham D. Fletcher¹, Michael W. Schmidt², Brett M. Bode³, Mark S. Gordon², Mark S. Gordon³ - Show less +1 more•Institutions (3)

Ames Research Center¹, Iowa State University², Ames Laboratory³

09 Jun 2000-Computer Physics Communications

TL;DR: The Distributed Data Interface to permit storage of large data arrays in the aggregate memory of distributed memory, message passing computer systems is described and the good performance of a MP2 program using DDI is demonstrated.

...read moreread less

152 citations

Patent•

Network processor, memory organization and methods

[...]

Brian Mitchell Bass¹, Jean Calvignac¹, Marco C. Heddes¹, Piyush C. Patel¹, Juan Guillermo Revilla¹, Michael Steven Siegel¹, Fabrice Jean Verplanken¹ - Show less +3 more•Institutions (1)

IBM¹

24 Aug 2000

TL;DR: In this article, a network switch apparatus (10), components for such an apparatus, and methods of operating an apparatus in which data flow handling and flexibility is enhanced by the cooperation of a plurality of memory elements and a pluralityof interface processors formed on a semiconductor substrate (10).

...read moreread less

Abstract: A network switch apparatus (10), components for such an apparatus, and methods of operating such an apparatus in which data flow handling and flexibility is enhanced by the cooperation of a plurality of memory elements and a plurality of interface processors formed on a semiconductor substrate (10). The memory elements and interface processors together form a network processor (10) capable of cooperating with other elements in executing instructions directing the flow of data in a network. Access to the memory elements is controlled in a particular manner and under operative rules which provide controlled multiple accesses of the plurality of memory elements by a plurality of processors.

...read moreread less

125 citations

Book Chapter•10.1007/10722167_19•

Distributing Timed Model Checking - How the Search Order Matters

[...]

Gerd Behrmann¹, Thomas Hune², Frits W. Vaandrager³•Institutions (3)

Aalborg University¹, Aarhus University², Radboud University Nijmegen³

15 Jul 2000

TL;DR: This paper addresses the problem of distributing model checking of timed automata and shows how in the timed case the search order of the state space is crucial for the effectiveness and scalability of the exploration.

...read moreread less

Abstract: In this paper we address the problem of distributing model checking of timed automata. We demonstrate through four real life examples that the combined processing and memory resources of multi-processor computers can be effectively utilized. The approach assumes a distributed memory model and is applied to both a network of workstations and a symmetric multiprocessor machine. However, certain unexpected phenomena have to be taken into account. We show how in the timed case the search order of the state space is crucial for the effectiveness and scalability of the exploration. An effective heuristic to counter the effect of the search order is provided. Some of the results open up for improvements in the single processor case.

...read moreread less

115 citations

Patent•

Single integrated circuit embodying a risc processor and a digital signal processor

[...]

Robert J. Gove¹, Keith Balmer¹, Nicholas Ing-Simmons¹, Karl M. Guttag¹•Institutions (1)

Texas Instruments¹

3 Mar 2000

TL;DR: In this paper, a single integrated circuit includes first and second data processors operating on different instruction sets independently operating on disjoint programs and data, and a shared data transfer controller and shared memory divided into plural independently accessible memory banks.

...read moreread less

Abstract: A single integrated circuit includes first and second data processors operating on different instruction sets independently operating on disjoint programs and data. The single integrated circuit preferably includes an external interface, a shared data transfer controller and shared memory divided into plural independently accessible memory banks. The two data processors are preferably a digital signal processor (DSP) and a reduced instruction set computer (RISC) processor. The DSP and RISC processors are suitably programmed to perform differing aspects of computer image processing.

...read moreread less

Journal Article•10.1006/JPDC.2000.1658•

OpenMP for Networks of SMPs

[...]

Y. Charlie Hu¹, Honghui Lu¹, Alan L. Cox¹, Willy Zwaenepoel¹•Institutions (1)

Rice University¹

01 Dec 2000-Journal of Parallel and Distributed Computing

TL;DR: This paper presents the first system that implements OpenMP on a network of shared-memory multiprocessors, and shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions.

...read moreread less

Journal Article•10.1137/S1064827598345679•

A Note On Parallel Matrix Inversion

[...]

E. S. Quintana¹, Gregorio Quintana¹, Xiaobai Sun¹, Robert A. van de Geijn•Institutions (1)

James I University¹

01 May 2000-SIAM Journal on Scientific Computing

TL;DR: This work presents one-sweep parallel algorithms for the inversion of general and symmetric positive definite matrices that feature simple programming and performance optimization while maintaining the same arithmetic cost and numerical properties of conventional inversion algorithms.

...read moreread less

Abstract: We present one-sweep parallel algorithms for the inversion of general and symmetric positive definite matrices. The algorithms feature simple programming and performance optimization while maintaining the same arithmetic cost and numerical properties of conventional inversion algorithms. Our experiments on a Cray T3E-600 and a Beowulf cluster demonstrate high performance of implementations for distributed memory parallel computers.

...read moreread less

Journal Article•10.1177/109434200001400202•

Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD

[...]

William Gropp¹, David E. Keyes², Lois Curfman McInnes¹, Moulay D. Tidriri³•Institutions (3)

Argonne National Laboratory¹, Old Dominion University², Iowa State University³

1 May 2000

TL;DR: This article shows that for the classical problem of three-dimensional transonic Euler flow about an M6 wing, ψNKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton’s method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver.

...read moreread less

Abstract: Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (ΨNKS) algorithmic framework is presented as a widely applicable answer. This article shows that for the classical problem of three-dimensional transonic Euler flow about an M6 wing, ΨNKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton's method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of ΨNKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. The authors therefore distill several recommendations from their experience and reading of the literature on various algorithmic components of ΨNKS, and they describe a freely available MPI-based portable parallel software implementation of the solver employed here.

...read moreread less

Patent•

Reader-writer lock for multiprocessor systems

[...]

Paul E. McKenney¹, Brent A. Kingsbury¹•Institutions (1)

IBM¹

10 Jan 2000

TL;DR: In this paper, a reader-writer lock is proposed to reduce the writer and reader overhead by employing lock structures that are shared among groups of processors that have lower latencies, which reduces the writer overhead that otherwise would exist if each processor in the system had a separate flag.

...read moreread less

Abstract: A reader-writer lock minimizes writer and reader overhead by employing lock structures that are shared among groups of processors that have lower latencies. In the illustrated multiprocessor system having a non-uniform memory access (NUMA) architecture, each processor node has a lock structure comprised of a shared counter and associated flag for each CPU group. During a read, the counter can be changed only by processors within a CPU group performing a read. This reduces the reader overhead that otherwise would exist if all processors in the system shared a single counter. During a write, the shared flag can be changed by a process running on any processor in the system. The processors in a CPU group are notified of the write through the shared flag. This reduces the writer overhead that otherwise would exist if each processor in the system had a separate flag. The number of CPUs per group can be varied to optimize performance of the lock in different multiprocessor systems.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00010-7•

Pajé, an interactive visualization tool for tuning multi-threaded parallel applications

[...]

J. Chassin de Kergommeaux¹, Benhur de Oliveira Stein², Paul-Emile Bernard³•Institutions (3)

Apache Corporation¹, Universidade Federal de Santa Maria², French Institute for Research in Computer Science and Automation³

15 Aug 2000

TL;DR: Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system, is described.

...read moreread less

Abstract: This paper describes Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system. Paje is capable of representing a wide variety of interactions between threads. The main characteristics of Paje, interactivity and scalability, are exemplified by the performance tuning of a molecular dynamics application. In order to be easily extensible, the architecture of the system was based on components which are connected in a data flow graph to produce a given visualization tool. Innovative components were designed, in addition to “classical” components existing in similar visualization systems, to support scalability and interactivity.

...read moreread less

Journal Article•10.1109/COMST.2000.5340716•

Buffer management for shared-memory ATM switches

[...]

M. Arpaci¹, John A. Copeland¹•Institutions (1)

Georgia Institute of Technology¹

01 Jan 2000-IEEE Communications Surveys and Tutorials

TL;DR: A survey of the buffer management methods that have been proposed for shared-memory packet switches and their strengths and weaknesses are described and evaluated using computer simulations.

...read moreread less

Abstract: In the shared-memory switch architecture, output links share a single large memory, in which logical FIFO queues are assigned to each link. Although memory sharing can provide a better queuing performance than physically separated buffers, it requires carefully designed buffer management schemes for a fair and robust operation. This article presents a survey of the buffer management methods that have been proposed for shared-memory packet switches. Several buffer management policies are described, and their strengths and weaknesses are examined. The performances of various policies are evaluated using computer simulations. A comparison of the most important schemes is obtained with the help of the simulation results and the results provided in the literature. The survey concludes with a discussion of the possible future research areas related to shared-memory ATM switches.

...read moreread less

Patent•

Multiprocessor with each processor element accessing operands in loaded input buffer and forwarding results to FIFO output buffer

[...]

Jon M. Huppenthal¹, Paul A. Leskar¹•Institutions (1)

University of Colorado Colorado Springs¹

3 May 2000

TL;DR: An enhanced memory algorithmic processor (MAP) as discussed by the authors is an architecture for multiprocessor computer systems which comprises an assembly that may include an array of field programmable gate arrays (FPGAs) functioning as the memory algorithms.

...read moreread less

Abstract: An enhanced memory algorithmic processor (“MAP”) architecture for multiprocessor computer systems comprises an assembly that may comprise, for example, field programmable gate arrays (“FPGAs”) functioning as the memory algorithmic processors. The MAP elements may further include an operand storage, intelligent address generation, on board function libraries, result storage and multiple input/output (“I/O”) ports. The MAP elements are intended to augment, not necessarily replace, the high performance microprocessors in the system and, in a particular embodiment of the present invention, they may be connected through the memory subsystem of the computer system resulting in it being very tightly coupled to the system as well as being globally accessible from any processor in a multiprocessor computer system.

...read moreread less

Dual-cubes: a new interconnection network for high-performance computer clusters

[...]

Yamin Li¹, Shietung Peng¹•Institutions (1)

Hosei University¹

1 Jan 2000

TL;DR: This paper introduces a new interconnection network for large-scale distributed memory multiprocessors called dual-cube, which mitigates the problem of increasing number of links in the large- scale hypercube network while keeps most of the topological properties of thehypercube network.

...read moreread less

Abstract: The binary hypercube, or n-cube, has been widely used as the interconnection network in parallel computers. However, the major drawback of the hypercube is the increase in the number of communication links for each node with the increase in the total number of nodes in the system. This paper introduces a new interconnection network for large-scale distributed memory multiprocessors called dual-cube. This network mitigates the problem of increasing number of links in the large-scale hypercube network while keeps most of the topological properties of the hypercube network. We investigate the topological properties of the dualcube, compare them with other hypercube-like networks, and establish the basic routing and broadcasting algorithms for dual-cubes.

...read moreread less

A Software Architecture for User Transparent Parallel Image Processing

[...]

Frank J. Seinstra¹, Dennis C. Koelma¹, Jan-Mark Geusebroek¹•Institutions (1)

University of Amsterdam¹

1 Jan 2000

TL;DR: In this article, the authors describe a software architecture that allows image processing researchers to develop parallel applications in a transparent manner, where all parallelism is completely hidden from the user, and the main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers.

...read moreread less

Abstract: This paper describes a software architecture that allows image processing researchers to develop parallel applications in a transparent manner. The architecture's main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers. Since the library has an application programming interface identical to that of an existing sequential library, all parallelism is completely hidden from the user.The first part of the paper discusses implementation aspects of the parallel library, and shows how sequential as well as parallel operations are implemented on the basis of so-called parallelizable patterns. A library built in this manner is easily maintainable, as extensive code redundancy is avoided. The second part of the paper describes the application of performance models to ensure efficiency of execution on all target platforms. Experiments show that for a realistic application performance predictions are highly accurate. These results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.

...read moreread less

Patent•

Partition formation using microprocessors in a multiprocessor computer system

[...]

David Golden¹, Dennis Mazur¹, Richard Edward Bracken¹•Institutions (1)

Hewlett-Packard¹

31 Aug 2000

TL;DR: In this article, a control system using microprocessors which communicate through a Local Area Network (private LAN) to control operation of both processors and input and output subsystems (IO system) of a multiprocessor computer system is presented.

...read moreread less

Abstract: The invention is a control system using microprocessors which communicate through a Local Area Network (private LAN) to control operation of both processors and input and output subsystems (IO system) of a multiprocessor computer system. The processors each have memory associated therewith, and each processor has an IO system comprising a plurality of busses such as PCI busses, associated therewith. The processors are cabled together in a mesh arrangement so that messages can be transferred between any of the processors and delivered to memory associated with the destination processor, or delivered to an IO system associated with the destination processor, etc. The microprocessors are powered on when power is applied to the chassis of the multiprocessor system, and the microprocessors then control the processors of the multiprocessor system, including applying power to the processors, forming hard partitions containing selected processors, computing routes from a processor to a memory associated with any processor for read and write transactions, computing routes to IO subsystems associated with any processor of the hard partition, forming partition boundaries so that processors in one hard partition cannot read and write to memory or IO systems associated with processors in another hard partition, forming soft partitions of processors, controlling boot-up of operating systems executing on the processors of the multiprocessor computer system, removing power from a failed processor, providing power to a repaired processor, etc.

...read moreread less

Journal Article•10.1364/AO.39.000671•

Architectural approach to the role of optics in monoprocessor and multiprocessor machines.

[...]

Jacques H. Collet, D. Litaize¹, Jan Van Campenhout², Chris Jesshope³, Marc P.Y. Desmulliez, Hugo Thienpont⁴, James R. Goodman⁵, Ahmed Louri⁶ - Show less +4 more•Institutions (6)

University of Toulouse¹, Ghent University², Massey University³, Vrije Universiteit Brussel⁴, University of Wisconsin-Madison⁵, University of Arizona⁶

10 Feb 2000-Applied Optics

TL;DR: It is shown that perhaps the major explanation for why optical technologies have nearly been unable to penetrate into computers is that OI's generally do not shorten the memory-access time, which is the most critical issue for today's stored-program machines.

...read moreread less

Abstract: The relevance of introducing optical interconnects ~OI’s! in monoprocessors and multiprocessors is studied from an architectural point of view. We show that perhaps the major explanation for why optical technologies have nearly been unable to penetrate into computers is that OI’s generally do not shorten the memory-access time, which is the most critical issue for today’s stored-program machines. In monoprocessors the memory-access time is dominated by the electronic latency of the memory itself. Thus implementing OI’s inside the memory hierarchy without changing the memory architecture cannot dramatically improve the global performance. In strongly coupled multiprocessors the node-bypass latency dominates. Therefore the higher the connectivity ~possibly with optics!, the shorter the path to another node, but the more expensive the network and the more complex the structure of electronic nodes. This relation leaves the choice of the best network open in terms of simplicity and latency reduction. The bottlenecks resulting from and the benefits of implementing OI’s are discussed with respect to symmetric multiprocessors, rings, and distributed shared-memory supercomputers. © 2000 Optical Society of

...read moreread less

Journal Article•10.1109/78.824693•

A hardware efficient control of memory addressing for high-performance FFT processors

[...]

Y. Ma, Lars Wanhammar¹•Institutions (1)

Linköping University¹

01 Mar 2000-IEEE Transactions on Signal Processing

TL;DR: The conventional memory organization of fast Fourier transform (FFT) processors is based on Cohen's (1976) scheme, but the new scheme reduces the hardware complexity of address generation by about 50% while improving the memory access speed.

...read moreread less

Abstract: The conventional memory organization of fast Fourier transform (FFT) processors is based on Cohen's (1976) scheme. Compared with this scheme, our scheme reduces the hardware complexity of address generation by about 50% while improving the memory access speed. Much power consumption in memory is saved since only half of the memory is activated during memory access, and the number of coefficient access is reduced to a minimum by using a new ordering of FFT butterflies. Therefore, the new scheme is a superior solution to constructing high-performance FFT processors.

...read moreread less

Journal Article•10.1007/BF02703630•

A survey of checkpointing algorithms for parallel and distributed computers

[...]

S. Kalaiselvi¹, V. Rajaraman²•Institutions (2)

Indian Institute of Science¹, Jawaharlal Nehru Centre for Advanced Scientific Research²

01 Oct 2000-Sadhana-academy Proceedings in Engineering Sciences

TL;DR: This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems and concludes that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.

...read moreread less

Abstract: Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.

...read moreread less

Patent•

System and method for high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model

[...]

Bryan D. Boatright¹, Rajesh Patel, Larry Edward Thatcher•Institutions (1)

Intel¹

29 Dec 2000

TL;DR: In this paper, the authors present a system and method for the high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model, which includes decoding a locked-memory instruction, obtaining exclusive ownership of a cacheline to be used by a load-lock operation, setting a bit to indicate the loadlock operation's ownership of the cacheline, and activating a snoop checking process.

...read moreread less

Abstract: The present invention relates to locked memory instructions, and more specifically to a system and method for the high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model. In accordance with an embodiment of the present invention, a method for executing locked-memory instructions includes decoding a locked-memory instruction, obtaining exclusive ownership of a cacheline to be used by a load-lock operation, setting a bit to indicate the load-lock operation's ownership of the cacheline, and activating a snoop checking process. The method also includes modifying a load data value and storing the modified load data value. The method further includes determining that the cacheline is still exclusively owned, storing the load data value, determining that the cacheline is unsnooped, merging the modified load data value with the load data value, and releasing the locked-memory instruction to be retired.

...read moreread less

Proceedings Article•10.1145/337292.337428•

Memory aware compilation through accurate timing extraction

[...]

Peter Grun¹, Nikil Dutt¹, Alexandru Nicolau¹•Institutions (1)

University of California, Irvine¹

1 Jun 2000

TL;DR: A memory-aware compiler approach is described that exploits efficient memory access modes by extracting accurate timing information, allowing the compiler's scheduler to perform global code reordering to better hide the latency of memory operations.

...read moreread less

Abstract: Memory delays represent a major bottleneck in embedded systems performance. Newer memory modules exhibiting efficient access modes (e.g., page-, burst-mode) partly alleviate this bottleneck. However, such features can not be efficiently exploited in processor-based embedded systems without memory-aware compiler support. We describe a memory-aware compiler approach that exploits such efficient memory access modes by extracting accurate timing information, allowing the compiler's scheduler to perform global code reordering to better hide the latency of memory operations. Our memory-aware compiler scheduled several benchmarks on the TI C6201 processor architecture interfaced with a 2-bank synchronous DRAM and generated average improvements of 24% over the best possible schedule using a traditional (memory-transparent) optimizing compiler, demonstrating the utility of our memory-aware compilation approach.

...read moreread less

Patent•

System and method for fault handling and recovery in a multi-processing system having hardware resources shared between multiple partitions

[...]

Roger L. Gilbertson¹, Mitchell A. Bauman¹, Penny L. Svenkeson¹, James L. DePenning¹, Michael L. Haupt¹, Donald R. Kalvestrand¹, Daniel S. Tokoly¹, Frederick G. Fellenser¹, Maria A. Liedman¹ - Show less +5 more•Institutions (1)

Unisys¹

28 Apr 2000

TL;DR: In this article, a support processor handles non-immediate problems and allows resetting of memory locations formerly owned by failed units, and continuous processing by non-failing units is allowable.

...read moreread less

Abstract: Poisoning of specific memory locations as a process when a part of a multiprocessor computer system becomes faulty leads to ability to isolate specific data owned by individual failing units even in a shared memory area. Also continuous processing by non-failing units is allowable. A support processor handles non-immediate problems and allows resetting of memory locations formerly owned by failed units.

...read moreread less

Patent•

Symmetric multiprocessing (SMP) system with fully-interconnected heterogenous microprocessors

[...]

Ravi Kumar Arimilli¹, David William Siegel¹•Institutions (1)

IBM¹

28 Dec 2000

TL;DR: Disclosed as mentioned in this paper is a fully-interconnected, heterogeneous, multiprocessor data processing system with a plurality of processors each having unique characteristics including, for example, different processing speeds (frequency) and different cache topologies (sizes, levels, etc.).

...read moreread less

Abstract: Disclosed is a fully-interconnected, heterogenous, multiprocessor data processing system. The data processing system topology has a plurality of processors each having unique characteristics including, for example, different processing speeds (frequency) and different cache topologies (sizes, levels, etc.). Second and third generation heterogenous processors are connected to a specialized set of pins, connected to the system bus. The processors are interconnected and communicate via an enhanced communication protocol and specialized SMP bus topology that supports the heterogeneous topology and enables newer processors to support full downward compatibility to the previous generation processors. Various processor functions are modified to support operations on either of the processors depending on which processor is assigned which operations. The enhanced communication protocol, operating system, and other processor logic enable the heterogenous multiprocessor data processing system to operate as a symmetric multiprocessor system.

...read moreread less

Journal Article•10.1137/S1064827598340809•

Algebraic Two-Level Preconditioners for the Schur Complement Method

[...]

Luiz Mariano Carvalho, Luc Giraud, P. Le Tallec

01 Jun 2000-SIAM Journal on Scientific Computing

TL;DR: A set of preconditioners for the Schur complement domain decomposition method that implement a global coupling mechanism, through coarse-space components, similar to the one proposed in Bramble, Pasciak, and Shatz, Math.

...read moreread less

Abstract: The solution of elliptic problems is challenging on parallel distributed memory computers since their Green's functions are global. To address this issue, we present a set of preconditioners for the Schur complement domain decomposition method. They implement a global coupling mechanism, through coarse-space components, similar to the one proposed in [Bramble, Pasciak, and Shatz, Math. Comp., 47 (1986), pp. 103--134]. The definition of the coarse-space components is algebraic; they are defined using the mesh partitioning information and simple interpolation operators. These preconditioners are implemented on distributed memory computers without introducing any new global synchronization in the preconditioned conjugate gradient iteration. The numerical and parallel scalability of those preconditioners are illustrated on two-dimensional model examples that have anisotropy and/or discontinuity phenomena.

...read moreread less

Journal Article•10.1109/71.895795•

Experiences with parallel N-body simulation

[...]

Pangfeng Liu¹, Sandeep N. Bhatt²•Institutions (2)

National Chung Cheng University¹, Akamai Technologies²

01 Dec 2000-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper describes the experiences developing high-performance code for astrophysical N-body simulations and uses a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures.

...read moreread less

Abstract: This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance implementation. Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified. The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 48 percent of the peak machine rate, which compares favorably with other approaches.

...read moreread less

...

Expand