Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 1999
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 1999
Showing papers presented at "Parallel Computing in 1999"
Journal Article•10.1016/S0167-8191(99)00076-9•
Visualization in biomedical computing

[...]

Richard A. Robb1•
University of Rochester1
1 Dec 1999
TL;DR: Current high-performance computers and advanced image processing capabilities have facilitated major progress toward realization of this goal, and there are several important applications possible to be delivered soon that will have a significant impact on the practice of medicine and on biological research.
Abstract: Visualizable objects in biology and medicine extend across a vast range of scale, from individual molecules and cells, to the varieties of tissue and interstitial interfaces, to complete organs, organ systems and body parts, and include functional attributes of these systems, such as biophysical, biomechanical and physiological properties. Medical applications include accurate anatomy and function mapping, enhanced diagnosis, accurate treatment planning and rehearsal, and education/training. However, the greatest potential for revolutionary innovation in the practice of medicine lies in direct, fully immersive, real-time multisensory fusion of real and virtual information data streams into online, real-time visualizations available during an actual clinical procedure. Current high-performance computers and advanced image processing capabilities have facilitated major progress toward realization of this goal. With these advances in hand, there are several important applications possible to be delivered soon that will have a significant impact on the practice of medicine and on biological research.

225 citations

Journal Article•10.1016/S0167-8191(99)00077-0•
Developments and trends in the parallel solution of linear systems

[...]

Iain S. Duff1, Henk A. van der Vorst2•
Rutherford Appleton Laboratory1, Utrecht University2
1 Dec 1999
TL;DR: This review paper considers some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism and considers preconditioning techniques for iterative solvers.
Abstract: In this review paper, we consider some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism. We briefly discuss the solution of dense linear systems, before studying the solution of sparse equations by direct and iterative methods. We consider preconditioning techniques for iterative solvers and discuss some of the present research issues in this field.

121 citations

Journal Article•10.1016/S0167-8191(99)00020-4•
Parallel multigrid in an adaptive PDE solver based on hashing and space-filling curves

[...]

Michael Griebel1, Gerhard Zumbusch1•
University of Bonn1
1 Jul 1999
TL;DR: The concept of hash-table storage techniques to set up a parallel solver that requires substantial less amount of memory than implementations based on tree type data structures and is easier to program in the sequential case.
Abstract: Partial differential equations can be solved efficiently by adaptive multigrid methods on a parallel computer. We report on the concept of hash-table storage techniques to set up such a program. The code requires substantial less amount of memory than implementations based on tree type data structures and is easier to program in the sequential case. The parallelization takes place by a space-filling curve domain decomposition intimately connected to the hash table. The new data structure simplifies the parallelization of the code substantially and introduces a cheap way to solve the load balancing and mapping problem. We report on the main features of the method and give the results of numerical experiments with the new parallel solver on a cluster of 64 Pentium II/400MHz connected by a Myrinet in a fat tree topology.

104 citations

Journal Article•10.1016/S0167-8191(99)00002-2•
An improved diffusion algorithm for dynamic load balancing

[...]

Yifan Hu1, R. J. Blake1•
Daresbury Laboratory1
1 Apr 1999
TL;DR: The performance of the diffusion type algorithms is improved, while retaining the nearest neighbour communication requirement, through the use of Chebyshev polynomials and it is proved that both the diffusion algorithm and the improved diffusion algorithm have an optimal property in terms of the amount of load migrated.
Abstract: Diffusion type algorithms 1 , 3 , 11 are some of the most popular algorithms for scheduling in dynamic load balancing. It is known however that this type of algorithm can suffer from slow convergence. In this paper the performance of the diffusion type algorithms is improved, while retaining the nearest neighbour communication requirement, through the use of Chebyshev polynomials. It is also proved that both the diffusion algorithm and the improved diffusion algorithm have an optimal property in terms of the amount of load migrated. Numerical results are given comparing the algorithm with the diffusion algorithm as well as a fast algorithm that requires global communication.

87 citations

Journal Article•10.1016/S0167-8191(99)00004-6•
Scheduling divisible loads in a three-dimensional mesh of processors

[...]

Maciej Drozdowski1, Wlodzimierz Glazek2•
Poznań University of Technology1, University of Gdańsk2
1 Apr 1999
TL;DR: A family of load distribution algorithms are described and closed-form formulae for optimal load shares allocated to processors in each algorithm are obtained, which attain speedup limit of 1+p/ρ.
Abstract: We study distributed processing of a divisible load in a three-dimensional mesh of communicating processors. The objective is to find distribution of the load among processors which guarantees minimal processing time. We describe a family of load distribution algorithms and obtain closed-form formulae for optimal load shares allocated to processors in each algorithm. Our model takes into consideration communication delays involved in moving load shares from one processor to another. In large meshes our algorithms attain speedup limit of 1+p/ρ, where p is the number of communication ports used simultaneously by each processor in data transfer and ρ is the ratio of processing to communication transfer rate. We also show a matching upper bound on the speedup in this topology.

64 citations

Journal Article•10.1016/S0167-8191(99)00080-0•
Methods for parallel computation of complex flow problems

[...]

Tayfun Teduyar1, Yasuo Osawa1•
Rice University1
1 Dec 1999
TL;DR: An overview of some of the methods developed by the Team for Advanced Flow Simulation and Modeling (TAFSM) to support flow simulation and modeling in a number of “Targeted Challenges” is presented.
Abstract: This paper is an overview of some of the methods developed by the Team for Advanced Flow Simulation and Modeling (T★AFSM) [ http://www.mems.rice.edu/TAFSM/ ] to support flow simulation and modeling in a number of “Targeted Challenges”. The “Targeted Challenges” include unsteady flows with interfaces, fluid–object and fluid–structure interactions, airdrop systems, and air circulation and contaminant dispersion. The methods developed include special numerical stabilization methods for compressible and incompressible flows, methods for moving boundaries and interfaces, advanced mesh management methods, and multi-domain computational methods. We include in this paper a number of numerical examples from the simulation of complex flow problems.

56 citations

Book•10.1007/10704208•
SCI: Scalable Coherent Interface

[...]

Hermann Hellwagner
1 Jan 1999
TL;DR: The SCI Physical Layer API requires functionality which is not supported by the hardware to be emulated in software, and implements standard methods required to support the implementation of such functionality.
Abstract: ion layer, but that it is also impossible to serve all fields of use at once. For example, MPI is a message passing standard that could still benefit from the low message passing overhead and latency that SCI provides. In the case of MPI, some interface or middle layer software is required to match the IEEE API to the MPI API. However, such middle layer or meta driver has to be implemented only once in order to gain complete hardware independence for any SCI based MPI system if the IEEE API is used as underlying software standard. Given this standard being submitted to IEEE and its scope as a hardware abstraction layer, it did not seem appropriate to try to define, for example, a standard address resolution protocol or an SCI network topology implementing failover functionality within the scope of this standard. It is left to the higher level software to implement those functions. However, the SCI Physical Layer API implements standard methods required to support the implementation of such functionality. 10.2 SCI Physical Layer API Architecture and Features This chapter gives an overview of the SCI Physical Layer API. For a detailed reference, please refer to the draft standard document itself. Figure 10.1 below shows its overview. Fig. 10.1. A sketch of the SCI Physical Layer API Architecture. The various data paths (1) through (5) are detailed in the text. A user level process interfaces to the API in order to perform all procedural functions. Depending on the operating system, the API may consist of user level routines and kernel level routines. For example, in the case of initialization and mapping calls, the user level API routines would call the protected kernel level API routines (3) which would perform any authentication and security checking in order to ensure that no unauthorized physical 194 V. Lindenstruth, D. Gustavson access is granted. Responses to API procedure calls and any exceptions are returned to the user level API routines (4). Initialization such as cold start and configuration of the SCI interfaces’ address maps is typically also executed in the API kernel routines (5). This is necessary in order to ensure that no user program can directly access the security relevant address translation tables in the interface. The SCI Physical Layer API requires functionality which is not supported by the hardware to be emulated in software. For example, should the interface not support DMA functionality, the API is required to implement that as programmed I/O. This functionality is expected to reside in the user level part of the API (2). Once a shared memory is set up, the user level process can perform shared memory transactions without involvement of the API as indicated by path (1). This feature allows the smallest overhead and latency. Two general classes of transactions are supported as indicated in Figure 10.2 below. Fig. 10.2. A sketch of a synchronous transaction (left) with a potential exception, and a sketch of an asynchronous transaction (right) Synchronous transactions such as read transactions (Figure 10.2, left) return after valid read data is available. This results in potential stalling of the host processor. If an error occurs, an exception is fired (refer to Section 10.2.1). The second class of transactions are asynchronous transactions, which return as soon as possible without awaiting the completion of the actual SCI transaction. This is illustrated in the right half of Figure 10.2. Upon completion of the asynchronous transaction, a specified call-back procedure is executed. The body of this procedure can be used to implement whatever synchronization method is desired. Examples of asynchronous transactions are posted writes or DMA transactions. In the case of posted writes the call-back procedure would typically act only if an error occurs. In the case of a DMA transaction, the call-back would notify the host about the DMA completion status. Before an SCI transaction can be executed, a certain amount of setup and control is required. For example, address translation tables may need to be configured appropriately in order to map an SCI address region into the 10. SCI Physical Layer API 195 process address space. This is implemented based on windows. A window is a contiguous address region with a defined set of default transaction attributes and configurations. Those default transaction attributes define whether write transactions may be posted, a window is write protected, or write transactions to this given window are executed as broadcast, etc. The appropriate address translation setup of both the operating system and the SCI interface is part of the configuration. 10.2.1 Exception Handling The SCI Physical Layer API interfaces to external hardware and consequently requires a method for asynchronously handling exceptions, such as link failures or asynchronous conditions such as failing posted writes, DMA completion, etc. Since this standard is required to be operating-system independent, a very simple method is supplied for implementing asynchronous attention handlers. It is based on the definition of a context structure, allowing the complete description of the state of the SCI hardware and software interface. For an asynchronous attention condition, a call-back procedure is executed which allows user supplied implementation of the exception handler. The appropriate procedure is supplied as a procedure pointer by the higher software layers and acts much like an interrupt handler. Some transaction specific exceptions cannot be traced to the calling process(or). For example, a posted write transaction exception may occur long after the write request is completed. Other posted writes may have been executed in the meantime. In this case it is not possible to determine the appropriate transaction specific call-back procedure. Therefore, a global exception handler is implemented which will be executed in such a case. However, the global attention handler can only identify the type of condition, based on the context structure. Therefore, in order to debug and trace these conditions, posted transparent writes must be disabled. Asynchronous transactions such as the chained-mode DMA use call-back procedures to provide a tool to synchronize with the host program. This is done by using whatever synchronization method is supported best by the given operating system (signals, events, semaphores etc.). In order to allow implementation of checkpoints, a synchronization transaction is provided, which stalls the calling process until all pending transactions have completed and all related potentially pending transaction handlers were executed, or a specified timeout expired.

52 citations

Journal Article•10.1016/S0167-8191(98)00114-8•
Efficient parallel algorithms for molecular dynamics simulations

[...]

Ravi Murty1, Daniel I. Okunbor1•
Missouri University of Science and Technology1
1 Mar 1999
TL;DR: Two algorithms based on the force decomposition approach, which treats rows one at a time and the other approach, called Force-Stripped Row (FSR), computes a priori the block of rows that balances workload to be sent to a processor.
Abstract: The study of many-particle systems has increased significantly over the past decade, because of the increasing number of useful applications it supports. Numerical experiences have shown that the force calculation contributes 90% of the total simulation time. This is an O(N2) algorithm, mainly due to pairwise interactions, where N is the number of particles in the system. The interaction decomposition technique proposed by Taylor et al., uses a special mapping scheme and optimal communication to reduce the overall computation time. In this paper, we propose two algorithms based on the force decomposition approach. The first technique which we call Force-Row Interleaving (FRI) method, treats rows one at a time and the other approach, called Force-Stripped Row (FSR), computes a priori the block of rows that balances workload to be sent to a processor. These two algorithms were tested on a system of 32000 atoms of liquid argon and implemented on a distributed memory, 16-processor iPSC/860. The FRI and FSR were both comparable to existing parallel techniques with efficiencies of 98.63% and 98.88%, respectively.

46 citations

Journal Article•10.1016/S0167-8191(99)00086-1•
Compilation techniques for parallel systems

[...]

Rajiv Gupta1, Santosh Pande2, Kleanthis Psarris3, Vivek Sarkar4•
University of Arizona1, University of Cincinnati2, University of Texas at San Antonio3, IBM4
1 Dec 1999
TL;DR: This paper provides an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution and discusses the relationship between the nature of compiler support and type of processor architecture.
Abstract: Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed.

46 citations

Journal Article•10.1016/S0167-8191(99)00062-9•
Analysis of fully adaptive wormhole routing in tori

[...]

S. Loucif1, Mohamed Ould-Khaoua2, Lewis M. Mackenzie1•
University of Glasgow1, University of Strathclyde2
1 Nov 1999
TL;DR: A queuing model of adaptive routing in the torus is proposed and the validity of the model is demonstrated by comparing analytical results with those obtained through simulations.
Abstract: A model of adaptive routing in the hypercube has recently been proposed (Y. Boura, C.R. Das, T.M. Jacob, A performance model for adaptive routing in hypercubes, in: Proceedings of the International Workshop on Parallel Processing, 1994, pp. 11–16). Modelling adaptive routing in a high-radix k -ary n -cube, e.g., the torus, is more complicated than in the hypercube since a message in the former may cross more than one channel along a given dimension. This paper proposes a queuing model of adaptive routing in the torus. The validity of the model is demonstrated by comparing analytical results with those obtained through simulations.

44 citations

Book Chapter•10.1007/10704826_10•
The Aleph Toolkit: Support for Scalable Distributed Shared Objects

[...]

Maurice Herlihy1•
Brown University1
9 Jan 1999
TL;DR: It is not at all clear that the distributed shared object model can be adapted to the needs of modern large-scale distributed applications.
Abstract: The shared object model is an appealing programming abstraction for distributed computing. By hiding the details of the network and data distribution, it allows the programmer to focus on higher-level concerns, and makes the program structure robust in the presence of changes in distribution patterns or environment. Nevertheless, it is not at all clear that the distributed shared object model can be adapted to the needs of modern large-scale distributed applications.
Book Chapter•10.1007/10704826_7•
High Performance Sockets and RPC over Virtual Interface (VI) Architecture

[...]

Hemal V. Shah1, Calton Pu2, Rajesh S. Madukkarumukumana1, Rajesh S. Madukkarumukumana2•
Intel1, Oregon Health & Science University2
9 Jan 1999
TL;DR: This paper describes how high-level communication paradigms like stream sockets and remote procedure call (RPC) can be efficiently built over user-level networking architectures and transparently improves the network performance of Distributed Component Object Model (DCOM).
Abstract: Standard user-level networking architecture such as Virtual Interface (VI) Architecture enables distributed applications to perform low overhead communication over System Area Networks (SANs). This paper describes how high-level communication paradigms like stream sockets and remote procedure call (RPC) can be efficiently built over user-level networking architectures. To evaluate performance benefits for standard client-server and multi-threaded environments, our focus is on off-the-shelf sockets and RPC interfaces and commercially available VI Architecture based SANs. The key design techniques developed in this research include credit-based flow control, decentralized user-level protocol processing, caching of pinned communication buffers, and deferred processing of completed send operations. The one-way bandwidth achieved by stream sockets over VI Architecture was 3 to 4 times better than the same achieved by running legacy protocols over the same interconnect. On the same SAN, high-performance stream sockets and RPC over VI Architecture achieve significantly better (between 2-3x) latency than conventional stream sockets and RPC over standard network protocols in Windows NT TM 4.0 environment. Furthermore, our high-performance RPC transparently improved the network performance of Distributed Component Object Model (DCOM) by a factor of 2 to 3.
Journal Article•10.1016/S0167-8191(99)00067-8•
The marketplace of high-performance computing

[...]

Erich Strohmaier1, Jack Dongarra1, Jack Dongarra2, Hans W. Meuer3, Horst D. Simon4 •
University of Tennessee1, Oak Ridge National Laboratory2, University of Mannheim3, Lawrence Berkeley National Laboratory4
1 Dec 1999
TL;DR: The major trends and changes in the High-Performance Computing (HPC) market place since the beginning of the journal ‘Parallel Computing’ are analyzed.
Abstract: In this paper we analyze the major trends and changes in the High-Performance Computing (HPC) market place since the beginning of the journal ‘Parallel Computing’. The initial success of vector computers in the 1970s was driven by raw performance. The introduction of this type of computer systems started the area of ‘Supercomputing’. In the 1980s the availability of standard development environments and of application software packages became more important. Next to performance these factors determined the success of MP vector systems, especially at industrial customers. MPPs became successful in the early 1990s due to their better price/performance ratios, which was made possible by the attack of the ‘killer-micros’. In the lower and medium market segments the MPPs were replaced by microprocessor based symmetrical multiprocessor (SMP) systems in the middle of the 1990s. There success formed the basis for the use of new cluster concepts for very high-end systems. In the last few years only the companies which have entered the emerging markets for massive parallel database servers and financial applications attract enough business volume to be able to support the hardware development for the numerical high-end computing market as well. Success in the traditional floating point intensive engineering applications seems to be no longer suAcient for survival in the market. ” 1999 Elsevier Science B.V. All rights reserved.
Journal Article•10.1016/S0167-8191(99)00021-6•
Efficient eigenvalue and singular value computations on shared memory machines

[...]

Bruno Lang1•
RWTH Aachen University1
1 Jul 1999
TL;DR: Two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers are described and a very simple performance model that allows selecting these parameters automatically is presented.
Abstract: We describe two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers. Depending on the information that is required, different steps in the overall process can be made more efficient. If only the eigenvalues or singular values are sought then the reduction to condensed form may be done in two or more steps to make best use of optimized level-3 BLAS. If eigenvectors and/or singular vectors are required, too, then their accumulation can be speeded up by another blocking technique. The efficiency of the blocked algorithms depends heavily on the values of certain control parameters. We also present a very simple performance model that allows selecting these parameters automatically.
Journal Article•10.1016/S0167-8191(99)00041-1•
Efficient parallel reduction to bidiagonal form

[...]

Benedikt Großer, Bruno Lang1•
RWTH Aachen University1
1 Aug 1999
TL;DR: Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be significantly superior if only the singular values are required.
Abstract: Most methods for calculating the SVD (singular value decomposition) require to first bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense → banded and banded → bidiagonal with cubic and quadratic arithmetic costs, respectively, we are able to carry out a much higher portion of the calculations in matrix–matrix multiplications. Thus, higher performance can be expected. This paper presents and compares three parallel techniques for reducing a full matrix to banded form. (The second reduction stage is described in another paper [B. Lang, Parallel Comput. 22 (1996) 1–18]). Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be significantly superior if only the singular values are required.
Journal Article•10.1016/S0167-8191(99)00044-7•
Modeling performance of heterogeneous parallel computing systems

[...]

Andrea Clematis, Angelo Corana
1 Sep 1999
TL;DR: A simple but quite rigorous analysis of the performance of heterogeneous parallel computing systems, where in general each node has a different computing power, and examines in this case a class of problems for which it is possible to define an efficiency worsening factor related to the degree of heterogeneity.
Abstract: We analyze and model the performance of heterogeneous parallel computing systems, where in general each node has a different computing power. The main features of our approach are: a simple but quite rigorous analysis; an `energetic' perspective on performance analysis, using concepts like the useful work carried out by each node, the work lost due to the various sources of overhead, and the local and global efficiencies, both for dedicated and non-dedicated environments. Although we carry out the analysis having workstation networks in mind, in the first part of the paper we try to maintain maximum generality, without introducing any constraint on the kind of interconnection between nodes and communication speed. This general framework can be applied to different specific situations, provided supplementary assumptions are feasible and values of system and application dependent parameters are available. In the second part the focus of analysis narrows to consider systems with the same communication speed between each pair of nodes, as it occurs for example with workstations connected by switched networks. We examine in this case a class of problems for which it is possible to define an efficiency worsening factor related to the degree of heterogeneity.
Journal Article•10.1016/S0167-8191(99)00088-5•
Heterogeneous parallel and distributed computing

[...]

Vaidy S. Sunderam1, G.A. Geist2•
Emory University1, Oak Ridge National Laboratory2
1 Dec 1999
TL;DR: The evolution of heterogeneous concurrent computing, in the context of the parallel virtual machine (PVM) system, is discussed, which highlights the system level infrastructures that are required, aspects of parallel algorithm development that most affect performance, system capabilities and limitations, and tools and methodologies for effective computing in heterogeneous networked environments.
Abstract: Heterogeneous network-based distributed and parallel computing is gaining increasing acceptance as an alternative or complementary paradigm to multiprocessor-based parallel processing as well as to conventional supercomputing. While algorithmic and programming aspects of heterogeneous concurrent computing are similar to their parallel processing counterparts, system issues, partitioning and scheduling, and performance aspects are significantly different. In this paper, we discuss the evolution of heterogeneous concurrent computing, in the context of the parallel virtual machine (PVM) system, a widely adopted software system for network computing. In particular, we highlight the system level infrastructures that are required, aspects of parallel algorithm development that most affect performance, system capabilities and limitations, and tools and methodologies for effective computing in heterogeneous networked environments. We also present recent developments and experiences in the PVM project, and comment on ongoing and future work.
Book Chapter•10.1007/10704826_2•
Adaptation Models for Network-Aware Distributed Computations

[...]

Peter Steenkiste1•
Carnegie Mellon University1
9 Jan 1999
TL;DR: This paper looks at a number of network-aware applications and identifies three adaptation strategies that have proven to be effective and describes the three adaptation models, comparing their features and applicability, and briefly discusses how these models impact the design of middleware that supports network- aware applications.
Abstract: Network-aware applications actively adapt to the level of service they receive from the network. This allows the application to execute well over a diverse set of networks and under a wide range of network conditions. However, network diversity and dynamic network conditions make the development of network-aware applications a difficult task, since the developer has to be an expert in both the application domain and networking. In this paper we look at a number of network-aware applications and identify three adaptation strategies that have proven to be effective. These strategies can be viewed as adapation models that capture the essential structure of the adaptation process. Similar to the use of programming models in parallel and distributed computing, adaptation models can be used to guide the development of other network-aware applications and they can also form the basis for programming support, e.g. middleware, that supports the development of network-aware applications. In this paper we describe the three adaptation models, compare their features and applicability, and briefly discuss how these models impact the design of middleware that supports network-aware applications.
Proceedings Article•
The Cactus Computational Collaboratory: Enabling Technologies for Relativistic Astrophysics, and a Toolkit for Solving PDEs by Communities in Science and Engineering

[...]

Gabrielle Allen1, Tom Goodale, Edward Seidel•
Max Planck Society1
1 Jan 1999
TL;DR: The Cactus project as mentioned in this paper is a system for collaborative research and development for a distributed group of researchers at different institutions around the world, which allows many users, with very different areas of expertise, to work coherently together on distributed computers.
Abstract: We are developing a system for collaborative research and development for a distributed group of researchers at different institutions around the world. In a new paradigm for collaborative computational science, the computer code and supporting infrastructure itself becomes the collaborating instrument, just as an accelerator becomes the collaborating tool for large numbers of distributed researchers in particle physics, The design of this "Collaboratory" allows many users, with very different areas of expertise, to work coherently together on distributed computers around the world. Different supercomputers may be used separately, or for problems exceeding the capacity of any single system, multiple supercomputers may be networked together through high speed gigabit networks. Central to this Collaboratory is a new type of community simulation code, called "Cactus". The scientific driving force behind this project is the simulation of Einstein's equations for studying black holes, gravitational waves, and neutron stars, which has brought together researchers in very different fields from many groups around the world to make advances in the study of relativity and astrophysics. But the system is also being developed to provide scientists and engineers, without expert knowledge of parallel or distributed computing, mesh refinement, and so on, with a simple framework for solving any system of partial differential equations on many parallel computer systems, from traditional supercomputers to networks of workstations.
Journal Article•10.1016/S0167-8191(99)00040-X•
Analysis of nearest neighbor load balancing algorithms for random loads

[...]

Peter Sanders1•
Max Planck Society1
1 Aug 1999
TL;DR: It is shown that nearest neighbor load balancing algorithms are also asymptotically very efficient when a random rather than a worst case initial load distribution is considered, and some but not all of the algorithms known to perform better than diffusion in the worst case also perform better for random loads.
Abstract: Nearest neighbor load balancing algorithms, like diffusion, are popular due to their simplicity, flexibility, and robustness. We show that they are also asymptotically very efficient when a random rather than a worst case initial load distribution is considered. We show that diffusion needs Θ ((log n ) 2/ d ) balancing time on a d -dimensional mesh network with n d processors. Furthermore, some but not all of the algorithms known to perform better than diffusion in the worst case also perform better for random loads. We also present new results on worst case performance regarding the maximum load deviation.
Journal Article•10.1016/S0167-8191(99)00014-9•
Acceleration of molecular mechanic simulation by parallelization and fast multipole techniques

[...]

Horst Schwichtenberg1, G. Winter1, H. Wallmeier•
Center for Information Technology1
1 May 1999
TL;DR: A code (MEGADYN) is generated for the simulation of MD of large simulation ensembles (up to 10 6 atoms) on the basis of classical force field methods and a reduction of complexity of the calculation of forces and energy down to O( N ) was achieved by employing Greengards fast multipole method to the Coulomb interaction.
Abstract: Simulations of classical molecular dynamic (MD) systems can be sped up considerably by parallelizing the existing codes for distributed memory machines. In classical MD the CPU time is typically a function of the square of the number of atoms. The size of the molecular system which can be solved is therefore often limited by the CPU available. There are different approaches for reducing computation time. One consists in parallelizing sequential O( N 2 ) algorithms. The other is replacing the calculation of non-bonding forces by a less complex algorithm which can then be parallelized. We have generated a code (MEGADYN) for the simulation of MD of large simulation ensembles (up to 10 6 atoms) on the basis of classical force field methods. A reduction of complexity of the calculation of forces and energy down to O( N ) was achieved by employing Greengards fast multipole method (FMM) to the Coulomb interaction. Within the framework of FMM the periodic boundary conditions are realized in a minimum image convention type manner. Thus MEGADYN can be used to simulate NVT as well as NPT ensembles.
Journal Article•10.1016/S0167-8191(98)00101-X•
Approximation of algorithms for scheduling trees with general communication delays

[...]

Alix Munier
1 Jan 1999
TL;DR: It is proved that, for a limited number of identical processors m, any list schedule using the clusters structure has a relative performance bounded by 1+(1−1/m)(2−1/(1+ρ)) and that this bound is tight.
Abstract: We consider the problem of scheduling a tree with general communication delays. Jakoby and Reischuk proved that this problem is NP-hard for binary trees and unlimited number of processors. Firstly, we develop a clustering procedure based on the same lower bounds as Papadimitriou and Yannakakis for a related problem. We deduce an approximation algorithm for an unlimited number of processors with relative performance 2−1/(1+ρ), where ρ denotes the maximum ratio between communication delays and duration of tasks. We also prove that, for a limited number of identical processors m, any list schedule using the clusters structure has a relative performance bounded by 1+(1−1/m)(2−1/(1+ρ)) and that this bound is tight.
Book Chapter•10.1007/10704826_14•
Supporting Shared Memory and Message Passing on Clusters of PCs with a SMiLE

[...]

Wolfgang Karl1, Markus Leberecht1, Martin Schulz1•
Technische Universität München1
9 Jan 1999
TL;DR: By utilizing the Scalable Coherent Interface (SCI) with its ability to transparently perform remote memory operations, it is possible to support both efficient message passing and transparent shared memory on one single platform.
Abstract: With the rise of fast interconnection technologies and new concepts to utilize them without operating system interaction (like VIA [4]), compute clusters are becoming increasingly commonplace. Most of the interconnection networks focus only on message passing as their prime programming model neglecting the large code basis for shared memory. However, by utilizing the Scalable Coherent Interface (SCI) [19] with its ability to transparently perform remote memory operations, it is possible to support both efficient message passing and transparent shared memory on one single platform. This introduces a previously unknown flexibility into the cluster architecture.
Book Chapter•10.1007/3-540-49164-3_33•
Hardware and Software Aspects for 3-D Wavelet Decomposition on Shared Memory MIMD Computers

[...]

Rade Kutil1, Andreas Uhl1•
University of Salzburg1
1 Feb 1999
TL;DR: Hardware and software aspects of parallel 3-D wavelet/subband decomposition on shared memory MIMD computers are discussed and results are conducted on a SGI POWERChallenge GR.
Abstract: In this work we discuss hardware and software aspects of parallel 3-D wavelet/subband decomposition on shared memory MIMD computers. Experimental results are conducted on a SGI POWERChallenge GR.
Journal Article•10.1016/S0167-8191(99)00074-5•
Compiling high performance Fortran for distributed-memory architectures

[...]

Siegfried Benkner1, Hans P. Zima1•
University of Vienna1
1 Dec 1999
TL;DR: An overview of HPF compilation and runtime technology for distributed-memory architectures, and deals with a number of topics in some detail, including distribution and alignment processing, the basic compilation scheme and methods for the optimization of regular computations.
Abstract: High Performance Fortran (HPF) is a data-parallel language that provides a high-level interface for programming scientific applications, while delegating to the compiler the task of generating explicitly parallel message-passing programs. This paper provides an overview of HPF compilation and runtime technology for distributed-memory architectures, and deals with a number of topics in some detail. In particular, we discuss distribution and alignment processing, the basic compilation scheme and methods for the optimization of regular computations. A separate section is devoted to the transformation and optimization of independent loops with irregular data accesses. The paper concludes with a discussion of research issues and outlines potential future development paths of the language.
Book Chapter•10.1007/10704826_5•
Performance Evaluation of the Multimedia Router with MPEG-2 Video Traffic

[...]

Blanca Caminero1, Francisco J. Quiles1, José Duato2, Damon S. Love3, Sudhakar Yalamanchili3 •
University of Castilla–La Mancha1, Polytechnic University of Valencia2, Georgia Institute of Technology3
9 Jan 1999
TL;DR: It is shown that, with a simple scheduling algorithm, amenable for single-chip implementation, the link bandwidth utilization is quite satisfactory, while still providing acceptable delays to both CBR and VBR traffic.
Abstract: The Multimedia Router (MMR) architecture is aimed at providing QoS to multimedia traffic in a local area environment, while retaining a compact and simple design. In this paper, we show some preliminary performance evaluation results. The workload was composed of a mix of synthetic CBR traffic and semi-synthetic VBR traffic. The latter was obtained from real MPEG-2 video sequences. We show that, with a simple scheduling algorithm, amenable for single-chip implementation, the link bandwidth utilization is quite satisfactory, while still providing acceptable delays to both CBR and VBR traffic.
Journal Article•10.1016/S0167-8191(98)00103-3•
A versatile cost modelling approach for multicomputer task scheduling

[...]

Cristina Boeres1, Vinod E. F. Rebello1•
Federal Fluminense University1
1 Jan 1999
TL;DR: A multi-stage scheduling approach (MSA) is proposed which can be customised to classes of parallel systems according to their communication performance characteristics by varying the order in which the rules (which guide the strategy) are applied.
Abstract: In general, scheduling models only consider the message delay or latency as the dominant communication parameter. However, in many of the current generation of parallel systems, latency is negligible compared to the CPU penalties for the communication-related activities that are incurred whenever pairs of dependent tasks on distinct processors need to communicate. This work considers a model where the CPU penalty , which is associated with sending and receiving, communication events , is an additional (potentially dominant) communication parameter. A multi-stage scheduling approach (MSA) is proposed which takes both of these types communication parameters into account. This scheduling approach can be customised to classes of parallel systems according to their communication performance characteristics by varying the order in which the rules (which guide the strategy) are applied.
Journal Article•10.1016/S0167-8191(99)00011-3•
A case study in scalability: an ADI method for the two-dimensional time-dependent Dirac equation

[...]

Ulrich Rathe1, Peter Sanders2, Peter L. Knight1•
Imperial College London1, Max Planck Society2
1 May 1999
TL;DR: The dynamics of relativistic atomic wave functions evolving under the influence of intense laser pulses is used as an example of a general class of applications employing the alternating direction implicit method.
Abstract: The dynamics of relativistic atomic wave functions evolving under the influence of intense laser pulses is used as an example of a general class of applications employing the alternating direction implicit method. The method requires the solution of many tridiagonal systems of linear equations. A range of parallel algorithms for this setting are analyzed with respect to their scalability on large parallel machines.
Book Chapter•10.1007/3-540-49164-3_10•
Parallel Quasi-Monte Carlo Integration Using (t, s)-Sequences

[...]

Wolfgang Ch. Schmid1, Andreas Uhl1•
University of Salzburg1
1 Feb 1999
TL;DR: This work discusses parallelization techniques for quasi-Monte Carlo integration using (t, s)-sequences and shows that leapfrog parallelization may be very dangerous whereas block-based parallelization turns out to be robust.
Abstract: Currently, the most effective constructions of low-discrepancy point sets and sequences are based on the theory of (t, m, s)-nets and (t, s)-sequences. In this work we discuss parallelization techniques for quasi-Monte Carlo integration using (t, s)-sequences. We show that leapfrog parallelization may be very dangerous whereas block-based parallelization turns out to be robust.
Book Chapter•10.1007/3-540-49164-3_53•
MPI-parallelized Radiance on SGI CoW and SMP

[...]

Roland Koholka, Heinz Mayer, Alois Goller
1 Feb 1999
TL;DR: A parallelization strategy which is suited for scenes with medium to high complexity to decrease calculation time is proposed, and for a set of scenes the obtained speedup indicates the good performance of the chosen load balancing method.
Abstract: For lighting simulations in architecture there is the need for correct illumination calculation of virtual scenes. The Radiance Synthetic Imaging System delivers an excellent solution to that problem. Unfortunately, simulating complex scenes leads to long computation times even for one frame. This paper proposes a parallelization strategy which is suited for scenes with medium to high complexity to decrease calculation time. For a set of scenes the obtained speedup indicates the good performance of the chosen load balancing method. The use of MPI delivers a platform independent solution for clusters of workstations (CoWs) as well as for shared-memory multiprocessors (SMPs).
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve