Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2001
  1. Home
  2. Topics
  3. Distributed memory
  4. 2001
Showing papers on "Distributed memory published in 2001"
Journal Article•10.1137/S0895479899358194•
A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling

[...]

Patrick R. Amestoy, Iain S. Duff, Jean-Yves L'Excellent, Jacko Koster
01 Jan 2001-SIAM Journal on Matrix Analysis and Applications
TL;DR: The main features and the tuning of the algorithms for the direct solution of sparse linear systems on distributed memory computers developed in the context of a long term European research project are analyzed and discussed.
Abstract: In this paper, we analyze the main features and discuss the tuning of the algorithms for the direct solution of sparse linear systems on distributed memory computers developed in the context of a long term European research project. The algorithms use a multifrontal approach and are especially designed to cover a large class of problems. The problems can be symmetric positive definite, general symmetric, or unsymmetric matrices, both possibly rank deficient, and they can be provided by the user in several formats. The algorithms achieve high performance by exploiting parallelism coming from the sparsity in the problem and that available for dense matrices. The algorithms use a dynamic distributed task scheduling technique to accommodate numerical pivoting and to allow the migration of computational tasks to lightly loaded processors. Large computational tasks are divided into subtasks to enhance parallelism. Asynchronous communication is used throughout the solution process to efficiently overlap communication with computation. We illustrate our design choices by experimental results obtained on an SGI Origin 2000 and an IBM SP2 for test matrices provided by industrial partners in the PARASOL project.

2,507 citations

Journal Article•10.1127/0941-2948/2001/0010-0363•
PALM - A large-eddy simulation model performing on massively parallel computers

[...]

Siegfried Raasch, M. Schröter
15 Oct 2001-Meteorologische Zeitschrift
TL;DR: The performance of the new PArallelized LES Model (PALM) is excellent on SGI/Cray-T3E systems and an almost linear speed-up is achieved up to very large numbers of processors.
Abstract: An existing code of a large-eddy simulation (LES) model for the study of turbulent processes in the atmospheric and oceanic boundary layer has been completely recoded for use on massively parallel systems with distributed memory. Parallelization is achieved by two-dimensional domain decomposition and communication is realized by the message passing interface (MPI). Periodic boundary conditions, which are used in both horizontal directions, helped to minimize the parallelization effort. The performance of the new PArallelized LES Model (PALM) is excellent on SGI/Cray-T3E systems and an almost linear speed-up is achieved up to very large numbers of processors. Parallelization strategy and model performance is discussed and validation experiments as well as future applications are presented.

345 citations

Journal Article•10.1023/A:1008771324652•
Parallelizing the Murp Verifier

[...]

Ulrich Stern1, David L. Dill1•
Stanford University1
1 Mar 2001
TL;DR: A parallel version of the explicit state enumeration verifier Murϕ for distributed memory multiprocessors and networks of workstations using the message passing paradigm shows close to linear speedups, which are largely insensitive to communication latency and bandwidth.
Abstract: With the use of state and memory reduction techniques in verification by explicit state enumeration, runtime becomes a major limiting factor. We describe a parallel version of the explicit state enumeration verifier Murp for distributed memory multiprocessors and networks of workstations using the message passing paradigm. In experiments with three complex cache coherence protocols on an Sp2 multiprocessor and on a network of workstations at UC Berkeley, parallel Murp shows close to linear speedups, which are largely insensitive to communication latency and bandwidth. There is some slowdown with increasing communication overhead, for which a simple yet relatively accurate approximation formula is given. Techniques to reduce overhead and required bandwidth and to allow heterogeneity and dynamically changing load in the parallel machine are discussed, which we expect will allow good speedups when using conventional networks of workstations.

177 citations

Journal Article•10.1109/12.956091•
A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers)

[...]

Olivier Beaumont1, Vincent Boudet, A. Petitet2, Fabrice Rastello, Yves Robert •
École normale supérieure de Lyon1, University of Tennessee2
01 Oct 2001-IEEE Transactions on Computers
TL;DR: In this paper, the authors study the load balancing problem for dense linear algebra kernels on heterogeneous networks of workstations and propose a data allocation heuristic to balance the load on heterogenous platforms with respect to the performance of processors.
Abstract: The authors study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform block-cyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance of these linear algebra kernels on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous platforms with respect to the performance of the processors. When targeting unidimensional grids, the load-balancing problem can be solved rather easily. When targeting two-dimensional grids, which are the key to scalability and efficiency for numerical kernels, the problem turns out to be surprisingly difficult. We formally state the 2D load-balancing problem and prove its NP-completeness. Next, we introduce a data allocation heuristic, which turns out to be very satisfactory: Its practical usefulness is demonstrated by MPI experiments conducted with a heterogeneous network of workstations.

134 citations

Journal Article•10.1145/373574.373576•
The memory gap and the future of high performance memories

[...]

Maurice V. Wilkes1•
AT&T1
01 Mar 2001-ACM Sigarch Computer Architecture News
TL;DR: Since 1980, the memory gap has been increasing steadily, and during the last ten years, processors have been improving in speed by 60% per annum, whereas DRAM memory access has been improving at barely 10%.
Abstract: The first main memories to be used on digital computers were constructed using a technology much slower than that used for the logic circuits, and it was taken for granted that there would be a memory gap. Mercury delay line memories spent a lot of their time waiting for the required word to come round and were very slow indeed. CRT (Williams Tube) memories and the core memories that followed them were much better. By the early 1970s semiconductor memories were beginning to appear. This did not result in memory performance catching up fully with processor performance, although in the 1970s it came close. It might have expected that from that point memories and processors would scale together, but this did not happen. This was because of significant differences in the DRAM semiconductor technology used for memories compared with the technology used for circuits. The memory gap makes itself felt when a cache miss occurs and the missing word must be be supplied from main memory. It thus only affects users whose programs do not fit into the L2 cache. As far as a workstation user is concerned, the most noticeable effect of an increased memory gap is to make the observed performance more dependent on the application area than it would otherwise be. Since 1980, the memory gap has been increasing steadily. During the last ten years, processors have been improving in speed by 60% per annum, whereas DRAM memory access has been improving at barely 10%. It may thus be said that, while the memory gap is not at present posing a major problem, the writing is on the wall. On an Alpha 21264 667 MHz workstation (XP1000) in 2000, a cache miss cost about 128 clock cycles. This may be compared with the 8 – 32 clock cycles in the minicomputer and workstations of 1990 [1]. If the memory latency remains unchanged, the number of cycles of processor idle time is doubled with each doubling of speed of the processor. A factor of four will bring us to about 500 clock cycles.

132 citations

Patent•
Shared memory apparatus and method for multiprocessor systems

[...]

Theodore G. Scardamalia, Lynn Parker West
10 May 2001
TL;DR: Memory as mentioned in this paper is a memory alias adapter coupled to a processor's memory bus, it monitors processor memory accesses. Whenever a memory access corresponds to shared memory, rather than memory local to the processor, the adapter constructs a memory request message, and transmits the message over a network link to a shared memory unit.
Abstract: A memory alias adapter, coupled to a processor's memory bus, monitors processor memory accesses. Whenever a memory access corresponds to shared memory, rather than memory local to the processor, the adapter constructs a memory request message, and transmits the message over a network link to a shared memory unit. The shared memory unit performs the shared memory access and issues a response message over the network link. The memory alias adapter accepts the response message, and completes processor's memory access on the memory bus. As a result, it is transparent to the processor whether its memory access is to the local memory or to the shared memory.

126 citations

Patent•
Systems and methods for management of memory

[...]

Chaoxin C. Qui1, Mark J. Conrad1, Robert M. Farber1, Scott C. Johnson1•
Conrad Hotels1
2 Nov 2001
TL;DR: Memory management systems and methods that may be employed, for example, to provide efficient management of memory for network systems are discussed in this paper, where they utilize a multi-layer queue management structure to manage buffer/cache memory in an integrated fashion.
Abstract: Memory management systems and methods that may be employed, for example, to provide efficient management of memory for network systems. The disclosed systems and methods may utilize a multi-layer queue management structure to manage buffer/cache memory in an integrated fashion. The disclosed systems and methods may be implemented as part of an information management system, such as a network processing system that is operable to process information communicated via a network environment, and that may include a network processor operable to process network-communicated information and a memory management system operable to reference the information based upon a connection status associated with the content.

121 citations

Proceedings Article•10.1145/502217.502223•
Heterogeneous memory management for embedded systems

[...]

Oren Avissar1, Rajeev Barua1, Dave Stewart•
University of Maryland, College Park1
16 Nov 2001
TL;DR: A compiler strategy that automatically partitions the data among the memory units of software-exposed heterogeneous memory is presented, and it is shown that this strategy is optimal among all static partitions for global and stack data, and a good heuristic for heap data.
Abstract: This paper presents a technique for the efficient compiler management of software-exposed heterogeneous memory. In many lower-end embedded chips, often used in micro-controllers and DSP processors, heterogeneous memory units such as scratch-pad SRAM, internal DRAM, external DRAM and ROM are visible directly to the software, without automatic management by a hardware caching mechanism. Instead the memory units are mapped to different portions of the address space. Caches are avoided because of their cost and power consumption, and because they make it difficult to guarantee real-time performance. For this important class of embedded chips, the allocation of data to different memory units to maximize performance is the responsibility of the software.Current practice typically leaves it to the programmer to partition the data among the different memory units. We present a compiler strategy that automatically partitions the data among the memory units. We show that this strategy is optimal among all static partitions for global and stack data, and a good heuristic for heap data. For global and stack data, the scheme is provably equal to or better than any other compiler scheme or set of programmer annotations. Preliminary results show the benefits of optimal allocation: with just 20% of the data in SRAM, the formulation is able to decrease the runtime by 39% on average for our benchmarks vs. allocating all data to slow memory, without any programmer involvement. For some programs, less than 5% of data in SRAM achieves a similar speedup.

118 citations

Patent•
Multiprocessor system implementing virtual memory using a shared memory, and a page replacement method for maintaining paged memory coherence

[...]

Thomas M. Deneau1•
GlobalFoundries1
24 Apr 2001
TL;DR: In this paper, a virtual memory page replacement method is described for use in a computer system, wherein the virtual memory pages replacement method was designed to help maintain paged memory coherence within the multiprocessor computer system.
Abstract: A computer system including a first processor, a second processor in communication with the first processor, a memory coupled to the first and second processors (i.e., a shared memory) and including multiple memory locations, and a storage device coupled to the first processor. The first and second processors implement virtual memory using the memory. The first processor maintains a first set of page tables and a second set of page tables in the memory. The first processor uses the first set of page tables to access the memory locations within the memory. The second processor uses the second set of page tables, maintained by the first processor, to access the memory locations within the memory. A virtual memory page replacement method is described for use in the computer system, wherein the virtual memory page replacement method is designed to help maintain paged memory coherence within the multiprocessor computer system.

97 citations

Patent•
Method and apparatus for implementing atomicity of memory operations in dynamic multi-streaming processors

[...]

Stephen Melvin1, Mario Nemirovsky1•
MIPS Technologies1
7 Jun 2001
TL;DR: In this article, a lock mechanism for multi-streaming processors is described, where the lock mechanism comprises one or more storage locations associated with each stream of the processor, each storage location enabled to store a memory address a lock bit, and a stall bit.
Abstract: A multi-streaming processor has a plurality of streams for streaming one or more instruction threads, a set of functional resources for processing instructions from streams, and a lock mechanism for locking selected memory locations shared by streams of the processor, the hardware-lock mechanism operating to set a lock when an atomic memory sequence is started and to clear a lock when an atomic memory sequence is completed. In preferred embodiments the lock mechanism comprises one or more storage locations associated with each stream of the processor, each storage location enabled to store a memory address a lock bit, and a stall bit. Methods for practicing the invention using the apparatus are also taught.

93 citations

Patent•
Packet processing in shared memory multi-computer systems

[...]

Lance W. Russell1•
Hewlett-Packard1
4 Oct 2001
TL;DR: In this paper, the authors describe a shared memory multi-computer environment, where a local shared memory network is provided between local nodes and global shared memory networks are provided between nodes and one or more remote nodes.
Abstract: Systems and methods of processing packets in a shared memory multi-computer environment are described. A local shared memory network is provided between local nodes and a global shared memory network is provided between the local nodes and one or more remote nodes. In this way, local nodes may communicate through standard network interfaces while using shared memory as the physical transport medium. In addition, a multi-computer system may be addressed externally and internally as individual nodes over the local shared memory network. A multi-computer system also may be addressed externally and internally as a single node over the global shared memory network.
Patent•
Increasing web page browsing efficiency by periodically physically distributing memory media on which web page data are cached

[...]

Mitchell London, Richard Saville
8 Jun 2001
TL;DR: In this article, the content of the cached data items are updated by a local proxy program during the subscriber's connection to the network when the connection is not otherwise in use and the current data are loaded into the browser program through the Internet connection.
Abstract: Data for selected web sites likely to be accessed are stored on memory media (compact disks) that are physically transported to subscribers for use when browsing the Internet. New copies of the memory media with the data for the selected web sites are periodically produced and distributed to the subscribers. In addition, if the data cached on the memory media are stored on a local hard drive of the subscriber, the content of the cache can be updated by a local proxy program during the subscriber's connection to the network when the connection is not otherwise in use. The local proxy program is installed on the subscriber's computer from the distributed memory media. A data center provides Intelligent Prefetching and implements a validation service to determine if the cached data items are current. If not, the current data are loaded into the browser program through the Internet connection.
Proceedings Article•10.1145/500001.500006•
An optimal memory allocation for application-specific multiprocessor system-on-chip

[...]

Samy Meftali, Ferid Gharsalli, Frederic Rousseau, Ahmed Amine Jerraya
30 Sep 2001
TL;DR: A novel and systematic approach for the design of shared memory architectures in the case of application-specific multiprocessor system-on-chip based on an integer linear programming model which allows automatic generation of an architecture-level specification of the application.
Abstract: In this paper, we present a novel and systematic approach for the design of shared memory architectures in the case of application-specific multiprocessor system-on-chip. This paper focuses on a memory allocation step which is based on an integer linear programming model. It permits to obtain an optimal distributed shared memory architecture minimizing the global cost to access the shared data in the application, and the memory cost. Our approach allows automatic generation of an architecture-level specification of the application. The effectiveness of this approach is illustrated by a packet routing switch example.
Patent•
System and method for a distributed shared memory

[...]

Craig Alan Buhlman1, Anthony J. Casorso1•
Storage Technology Corporation1
17 Jul 2001
TL;DR: In this article, the authors propose a distributed shared memory system with multiple processors, each processor transmitting write commands issued therefrom concerning a shared memory to each of the processors, such that each processor receives each shared memory write command transmitted.
Abstract: A system and method for a distributed shared memory. The system includes multiple processors, each processor transmitting write commands issued therefrom concerning a shared memory to each of the processors, such that each processor receives each shared memory write command transmitted. The system also includes multiple local memories, each local memory associated with one of the processors and having a copy of the shared memory, wherein each processor completes each received shared memory write command at its associated local memory such that the copies of the shared memory remain consistent at all times. The method includes transmitting write commands concerning the shared memory to each of the processors, such that each processor receives each shared memory write command transmitted, and completing each received shared memory write command at the associated local memory such that the copies of the shared memory remain consistent at all times.
Journal Article•
Analysis and comparison of two general sparse solvers for distributed memory computers

[...]

Patrick R. Amestoy1, Iain S. Duff, Jean-Yves L'Excellent2, Xiaoye S. Li3•
ENSEEIHT1, École normale supérieure de Lyon2, Lawrence Berkeley National Laboratory3
01 Jan 2001-Rutherford Appleton Laboratory Technical Reports
TL;DR: In this article, the authors compared two state-of-the-art direct solvers for large sparse sets of linear equations on large-scale distributed memory computers, MUMPS and superLU.
Abstract: This paper provides a comprehensive study and comparison of two state-of-the-art direct solvers for large sparse sets of linear equations on large-scale distributed-memory computers. One is a multifrontal solver called MUMPS, the other is a supernodal solver called superLU. We describe the main algorithmic features of the two solvers and compare their performance characteristics with respect to uniprocessor speed, interprocessor communication, and memory requirements. For both solvers, preorderings for numerical stability and sparsity play an important role in achieving high parallel efficiency. We analyse the results with various ordering algorithms. Our performance analysis is based on data obtained from runs on a 512-processor Cray T3E using a set of matrices from real applications. We also use regular 3D grid problems to study the scalability of the two solvers.
Journal Article•10.1016/S0167-8191(01)00080-1•
Parallel algebraic multigrid based on subdomain blocking

[...]

Arnold Krechel1, Klaus Stüben1•
Center for Information Technology1
1 Jul 2001
TL;DR: A classical AMG variant which has turned out be highly robust and efficient in solving large systems of equations corresponding to elliptic PDEs, discretized by finite differences or finite volumes is considered.
Abstract: The algebraic multigrid (AMG) approach provides a purely algebraic means to tackle the efficient solution of systems of equations posed on large unstructured grids, in 2D and 3D. While sequential AMG has been used for increasingly large problems (with several million unknowns), its application to even larger applications requires a parallel version. Since, in contrast to geometric multigrid, the hierarchy of coarser levels and the related operators develop dynamically during the setup phase of AMG, a direct parallelization is very complicated. Moreover, a “naive” parallelization would, in general, require unpredictable and highly complex communication patterns which seriously limit the achievable scalability, in particular of the costly setup phase. In this paper, we consider a classical AMG variant which has turned out be highly robust and efficient in solving large systems of equations corresponding to elliptic PDEs, discretized by finite differences or finite volumes. Based on a straightforward partitioning of variables (using one of the available algebraic partitioning tools such as Metis), a parallelization approach is proposed which minimizes the communication without sacrificing convergence in complex situations. Results will be presented for industrial CFD and oil-reservoir simulation applications on distributed memory machines, including PC-clusters.
Journal Article•10.1016/S0167-8191(01)00120-X•
Scalable parallel FFT for spectral simulations on a Beowulf cluster

[...]

P. Dmitruk1, Lian-Ping Wang1, W. H. Matthaeus1, R. Zhang2, D. Seckel1 •
University of Delaware1, Exa Corporation2
31 Dec 2001
TL;DR: This work shows that 1/P scaling for execution time at fixed problem size N3 can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed.
Abstract: The implementation and performance of the multidimensional Fast Fourier Transform (FFT) on a distributed memory Beowulf cluster is examined We focus on the three-dimensional (3D) real transform, an essential computational component of Galerkin and pseudo-spectral codes The approach studied is a 1D domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors Communication is based upon the standard portable message passing interface (MPI) We show that 1/P scaling for execution time at fixed problem size N3 (ie, linear speedup) can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components An example is presented using a 3D passive scalar code
Posted Content•
Parallel Programming with MatlabMPI

[...]

Jeremy Kepner1•
Massachusetts Institute of Technology1
20 Jul 2001-arXiv: Astrophysics
TL;DR: The key technical innovation of MatlabMPI is that it implements the widely used MPI ``look and feel'' on top of standard Matlab file I/O, resulting in an extremely compact (~100 lines) and ``pure'' implementation which runs anywhere Matlab runs.
Abstract: MatlabMPI is a Matlab implementation of the Message Passing Interface (MPI) standard and allows any Matlab program to exploit multiple processors. MatlabMPI currently implements the basic six functions that are the core of the MPI point-to-point communications standard. The key technical innovation of MatlabMPI is that it implements the widely used MPI ``look and feel'' on top of standard Matlab file I/O, resulting in an extremely compact (~100 lines) and ``pure'' implementation which runs anywhere Matlab runs. The performance has been tested on both shared and distributed memory parallel computers. MatlabMPI can match the bandwidth of C based MPI at large message sizes. A test image filtering application using MatlabMPI achieved a speedup of ~70 on a parallel computer.
Journal Article•10.1108/02644400110365851•
A fully Distributed Parallel Global Search Algorithm

[...]

Layne T. Watson1, Chuck A. Baker•
Virginia Tech1
01 Feb 2001-Engineering Computations
TL;DR: The n‐dimensional direct search algorithm, DIRECT, developed by Jones, Perttunen, and Stuckman, is appropriate for the sort of global design space exploration done in large scale engineering design.
Abstract: The n‐dimensional direct search algorithm, DIRECT, developed by Jones, Perttunen, and Stuckman has attracted recent attention from the multidisciplinary design optimization community. Since DIRECT only requires function values (or ranking) and balances global exploration with local refinement better than n‐dimensional bisection, it is well suited to the noisy function values typical of realistic simulations. While not efficient for high accuracy optimization, DIRECT is appropriate for the sort of global design space exploration done in large scale engineering design. Direct and pattern search schemes have the potential to exploit massive parallelism, but efficient use of massively parallel machines is non‐trivial to achieve. A fully‐distributed control version of DIRECT that is designed for massively parallel (distributed memory) architectures is presented. Parallel results are presented for a multidisciplinary design optimization problem – configuration design of a high speed civil transport.
Patent•
Distributed computing system clustering model providing soft real-time responsiveness and continuous availability

[...]

Bruce D. Rostowfske, Thomas H. Buscher, Andrew W. Peck, Peter G. Litskevitch
10 Apr 2001
TL;DR: In this article, a distributed computing system comprises a primary server having a primary virtual shared memory and a back-up server with a backup virtual shared shared memory, where the primary server periodically provides a state table to the backup server in order to synchronize the shared memory.
Abstract: A distributed computing system comprises a primary server having a primary virtual shared memory and a back-up server having a back-up virtual shared memory. The primary server periodically provides a state table to the back-up server in order to synchronize the virtual shared memory and the back-up virtual shared memory. A plurality of client computer resources are coupled to the primary server and the back-up server through a network architecture. The client computer resources further comprise plural worker processes each adapted to independently perform an operation on a data object disposed within the primary virtual shared memory without a predetermined assignment between the worker process and the data object. Upon an unavailability of either the primary server or the back-up server, the worker process performs the operation on the corresponding data object disposed within the back-up virtual shared memory. The client computer resources further comprise plural input/output (I/O) ports adapted to receive incoming data packets and transmit outgoing data packets.
Proceedings Article•10.1145/582034.582084•
A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems

[...]

Chris Ding1, Yun He2•
Lawrence Berkeley National Laboratory1, University of California, Berkeley2
10 Nov 2001
TL;DR: A new approach is proposed which expands the ghost cell layers and thus updates boundaries much less frequently — reducing total message volume and groupping small messages into bigger ones, thereby speedup communication substantially.
Abstract: In solving Partial Differential Equations, such as the Barotropic equations in ocean models, on Distributed Memory Computers, finite difference methods are commonly used. Most often, processor subdomain boundaries must be updated at each time step. This boundary update process involves many messages of small sizes, therefore large communication overhead. Here we propose a new approach which expands the ghost cell layers and thus updates boundaries much less frequently --- reducing total message volume and groupping small messages into bigger ones. Together with a technique for eliminating diagonal communications, the method speedup communication substantially, upto 170%. We explain the method and implementation in details, provide systematic timing results and performance analysis on the Cray T3E and IBM SP.
Proceedings Article•10.1145/582034.582038•
A Distributed Memory Unstructured Gauss-Seidel Algorithm for Multigrid Smoothers

[...]

Mark F. Adams1•
Sandia National Laboratories1
10 Nov 2001
TL;DR: An algorithm for parallelizing Gauss-Seidel for distributed memory computers for use as a multigrid smoother is discussed and its performance with preconditioned conjugate gradients on unstructured linear elasticity problems with up to 76 million degrees of freedom is compared.
Abstract: Gauss-Seidel is a popular multigrid smoother as it is provably optimal on structured grids and exhibits superior performance on unstructured grids. Gauss-Seidel is not used to our knowledge on distributed memory machines as it is not obvious how to parallelize it effectively. We, among others, have found that Krylov solvers preconditioned with Jacobi, block Jacobi or overlapped Schwarz are effective on unstructured problems. Gauss-Seidel does however have some attractive properties, namely: fast convergence, no global communication (ie, no dot products) and fewer flops per iteration as one can incorporate an initial guess naturally. This paper discusses an algorithm for parallelizing Gauss-Seidel for distributed memory computers for use as a multigrid smoother and compares its performance with preconditioned conjugate gradients on unstructured linear elasticity problems with up to 76 million degrees of freedom.
Journal Article•10.1155/2001/605217•
Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

[...]

Mitsuhisa Sato, Hiroshi Harada, Atsushi Hasegawa, Yutaka Ishikawa
01 Aug 2001-Scientific Programming
TL;DR: A "cluster-enabled" OpenMP compiler for a page-based software distributed shared memory system, SCASH, which works on a cluster of PCs and a set of directives are added to specify data mapping and loop scheduling method which schedules iterations onto threads associated with the data mapping.
Abstract: OpenMP is attracting wide-spread interest because of its easy-to-use parallel programming model for shared memory multiprocessors. We have implemented a "cluster-enabled" OpenMP compiler for a page-based software distributed shared memory system, SCASH, which works on a cluster of PCs. It allows OpenMP programs to run transparently in a distributed memory environment. The compiler transforms OpenMP programs into parallel programs using SCASH so that shared global variables are allocated at run time in the shared address space of SCASH. A set of directives is added to specify data mapping and loop scheduling method which schedules iterations onto threads associated with the data mapping. Our experimental results show that the data mapping may greatly impact on the performance of OpenMP programs in the software distributed shared memory system. The performance of some NAS parallel benchmark programs in OpenMP is improved by using our extended directives.
Proceedings Article•10.1145/582034.582090•
Removing the Overhead from Software-Based Shared Memory

[...]

Zoran Radovic1, Erik Hagersten1•
Uppsala University1
10 Nov 2001
TL;DR: The implementation presented in this paper---DSZOOM-WF---is a sequentially consistent, fine-grained distributed software-based shared memory that consistently demonstrates performance comparable to hardware-based distributed shared memory implementations.
Abstract: The implementation presented in this paper — DSZOOM-WF — is a sequentially consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation to date of around ten microseconds. The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging InfiniBand standard. All interrupt- and/or poll-based asynchronous protocol processing is completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall. The technique is applicable to both page-based and fine-grain software-based shared memory. DSZOOM-WF consistently demonstrates performance comparable to hardware-based distributed shared memory implementations.
Patent•
Dialysis machine with symmetric multi-processing (SMP) control system and method of operation

[...]

Jamie Begelman
24 Sep 2001
TL;DR: In this paper, a method and control system computing platform for a dialysis machine that uses Symmetric Multi-Processing (SMP) architecture is presented, where the SMP architecture tightly couples multiple independent processors by sharing memory between the processors.
Abstract: A method and control system computing platform for a dialysis machine that uses Symmetric Multi-Processing (SMP) architecture. The SMP architecture tightly couples multiple (e.g., 2) independent processors by sharing memory between the processors. A single shared memory is used by both processors in order to facilitate communication between the processors and reduce cost by eliminating the expense of redundant memory. In this way, the two, or in general “N” processors, increase processor throughput by allowing the execution of N processes in parallel while without requiring extra memory and without having a single point of failure in the computer. In the event of a bus failure on the circuit card, the computer is reset using distributed hardware watchdogs. The watchdog reset signal is also sent to the hardware components of the dialysis machine in order to place the system in a safe.
Proceedings Article•10.1109/EMPDP.2001.905069•
Off-line real-time fault-tolerant scheduling

[...]

Catalin Dima1, Alain Girault1, C. Lavarenne, Yves Sorel•
French Institute for Research in Computer Science and Automation1
1 Feb 2001
TL;DR: This work addresses the problem of off-line fault tolerant scheduling of an algorithm onto a multiprocessor architecture with distributed memory and provides a generic algorithm which solves this problem.
Abstract: We address the problem of off-line fault tolerant scheduling of an algorithm onto a multiprocessor architecture with distributed memory and provide a generic algorithm which solves this problem. We take into account two kinds of failures: fail-silent and omission. The basic technique we use is the replication of operations and data communications. We then discuss the principles which govern the execution of schedulings with replication under the state-machine and the primary/backup arbitrations between replicas. We also show how to compute the execution date for each operation and the timeouts which are used for detecting failures. We end with a heuristic which, using this calculus, computes a possibly non optimal scheduling by finding plain schedulings for each failure pattern and then combining them into a scheduling with replication.
Journal Article•10.1093/BIB/2.2.181•
On the parallelisation of bioinformatics applications

[...]

Oswaldo Trelles1•
University of Málaga1
01 May 2001-Briefings in Bioinformatics
TL;DR: This overview outlines computational issues related to parallelism, physical machine models, parallel programming approaches and scheduling strategies for a broad range of computer architectures, in particular, it deals with shared, distributed and shared/distributed memory architectures.
Abstract: This paper surveys the computational strategies followed to parallelise the most used software in the bioinformatics arena. The studied algorithms are computationally expensive and their computational patterns range from regular, such as database-searching applications, to very irregularly structured patterns (phylogenetic trees). Fine- and coarse-grained parallel strategies are discussed for these very diverse sets of applications. This overview outlines computational issues related to parallelism, physical machine models, parallel programming approaches and scheduling strategies for a broad range of computer architectures. In particular, it deals with shared, distributed and shared/distributed memory architectures.
Proceedings Article•10.1145/378239.378483•
Parallelizing DSP nested loops on reconfigurable architectures using data context switching

[...]

Kiran Bondalapati
22 Jun 2001
TL;DR: This paper develops techniques for parallelizing nested loop computations from digital signal processing applications onto high performance pipelined configurations and proposes a novel data context switching technique that exploits the embedded distributed memory available in reconfigurable architectures to parallelize such loops.
Abstract: Reconfigurable architectures promise significant performance and flexibility advantages over conventional architectures. Automatic mapping techniques that exploit the features of the hardware are needed to leverage the power of these architectures. In this paper, we develop techniques for parallelizing nested loop computations from digital signal processing (DSP) applications onto high performance pipelined configurations. We propose a novel data context switching technique that exploits the embedded distributed memory available in reconfigurable architectures to parallelize such loops. Our technique is demonstrated on two diverse state-of-the-art reconfigurable architectures, namely, Virtex and the Chameleon Systems Reconfigurable Communications Processor. Our techniques show significant performance improvements on both architectures and also perform better than state-of-the-art DSP and microprocessor architectures.
Journal Article•10.1016/S0307-904X(01)00019-1•
Parallel Schur complement method for large-scale systems on distributed memory computers

[...]

S. Kocak1, Hasan U. Akay1•
Indiana University – Purdue University Indianapolis1
01 Oct 2001-Applied Mathematical Modelling
TL;DR: An efficient algorithm for the assemblage and solution of interface equations which facilitates the solution of large-scale systems via the Schur complement method on multiple instruction multiple data (MIMD) distributed memory architectures is presented.
Proceedings Article•10.1137/1.9781611972719.17•
A Middleware for Developing Parallel Data Mining Applications.

[...]

Ruoming Jin1, Gagan Agrawal•
University of Delaware1
1 Jan 2001
TL;DR: This paper presents the design and initial performance evaluation of a middleware for enabling rapid development of parallel data mining applications that can help exploit parallelism on both shared memory and distributed memory configurations, while allowing efficient processing of disk resident data.
Abstract: Data mining is an interdisciplinary field, having applications in diverse areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, consumer profiling, etc. In each of these application domains, the amount of data available for analysis has exploded in recent years, making the scalability of data mining implementations a critical factor. To this end, parallel versions of most of the well-known data mining techniques have been developed in recent years. However, the expertise and effort currently required in implementing, maintaining, and performance tuning a parallel data mining application is a severe impediment in the wide use of parallel computers for scalable data mining. In this paper, we present the design and initial performance evaluation of a middleware for enabling rapid development of parallel data mining applications. This middleware can help exploit parallelism on both shared memory and distributed memory configurations, while allowing efficient processing of disk resident data. Our middleware is based on the observation that parallel versions of several well-known data mining techniques share a relatively similar structure. We have carefully studies parallel versions of apriori association mining [4], bayesian network for classification [14], k-means clustering [25], k-nearest neighbor classifier [24], and artificial neural networks [24]. In each of these methods, parallelization can be done by dividing the data instances (or records or transactions) among the nodes. The
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve