Top 203 papers presented at Parallel Computing in 2000

Showing papers presented at "Parallel Computing in 2000"

Journal Article•10.1016/S0167-8191(00)00048-X•

Graph partitioning models for parallel computing

[...]

Bruce Hendrickson¹, Tamara G. Kolda¹•Institutions (1)

1 Nov 2000

TL;DR: A survey of recently proposed alternatives for graph partitioning finds that the standard methodology for graph partitions minimizes the wrong metric and lacks expressibility.

...read moreread less

Abstract: Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility. We survey several recently proposed alternatives and discuss their relative merits.

...read moreread less

540 citations

Book Chapter•10.1007/3-540-70734-4_16•

MUMPS: A General Purpose Distributed Memory Sparse Solver

[...]

Patrick R. Amestoy¹, Iain S. Duff², Jean-Yves L'Excellent, Jacko Koster³•Institutions (3)

ENSEEIHT¹, Rutherford Appleton Laboratory², University of Bergen³

18 Jun 2000

TL;DR: Recently integrated features of MUMPS are reported on and the present performance of the solver on an SGI Origin 2000 and a CRAY T3E is illustrated.

...read moreread less

Abstract: MUMPS is a public domain software package for the multifrontal solution of large sparse linear systems on distributed memory computers. The matrices can be symmetric positive definite, general symmetric, or unsymmetric, and possibly rank deficient. MUMPS exploits parallelism coming from the sparsity in the matrix and parallelism available for dense matrices. Additionally, large computational tasks are divided into smaller subtasks to enhance parallelism. MUMPS uses a distributed dynamic scheduling technique that allows numerical pivoting and the migration of computational tasks to lightly loaded processors. Asynchronous communication is used to overlap communication with computation. In this paper, we report on recently integrated features and illustrate the present performance of the solver on an SGI Origin 2000 and a CRAY T3E.

...read moreread less

267 citations

Journal Article•10.1016/S0167-8191(99)00118-0•

Computational solution of capacity planning models under uncertainty

[...]

S. A. MirHassani¹, Cormac Lucas¹, Gautam Mitra¹, Enza Messina¹, Chandra A. Poojari¹ - Show less +1 more•Institutions (1)

Brunel University London¹

1 Apr 2000

TL;DR: This paper considers two related modelling approaches and solution techniques addressing the traditional supply chain network planning problem as a multi-period resource allocation model involving 0–1 discrete strategic decision variables and a two-stage integer stochastic programming representation and solution of the same problem.

...read moreread less

Abstract: The traditional supply chain network planning problem is stated as a multi-period resource allocation model involving 0–1 discrete strategic decision variables. The MIP structure of this problem makes it fairly intractable for practical applications, which involve multiple products, factories, warehouses and distribution centres (DCs). The same problem formulated and studied under uncertainty makes it even more intractable. In this paper we consider two related modelling approaches and solution techniques addressing this issue. The first involves scenario analysis of solutions to “wait and see” models and the second involves a two-stage integer stochastic programming (ISP) representation and solution of the same problem. We show how the results from the former can be used in the solution of the latter model. We also give some computational results based on serial and parallel implementations of the algorithms.

...read moreread less

166 citations

Journal Article•10.1016/S0167-8191(00)00034-X•

On the complexity of loop fusion

[...]

Alain Darte¹•Institutions (1)

École normale supérieure de Lyon¹

1 Aug 2000

TL;DR: The goal of this paper is to study, from a theoretical point of view, several variants of the loop fusion problem -- identifying polynomially solvable cases and NP-complete cases -- and to make the link between these problems and some scheduling problems that arise from completely different areas.

...read moreread less

Abstract: Loop fusion is a program transformation that combines several loops into one. It is used in parallelizing compilers mainly for increasing the granularity of loops and for improving data reuse. The goal of this paper is to study, from a theoretical point of view, several variants of the loop fusion problem – identifying polynomially solvable cases and NP-complete cases – and to make the link between these problems and some scheduling problems that arise from completely different areas. We study, among others, the fusion of loops of different types, and the fusion of loops when combined with loop shifting.

...read moreread less

153 citations

Journal Article•10.1016/S0167-8191(99)00104-0•

Massively parallel computing using commodity components

[...]

Ron Brightwell¹, Lee Ann Fisk¹, David S. Greenberg, Tramm Hudson¹, Mike Levenhagen¹, Arthur B. Maccabe², Rolf Riesen¹ - Show less +3 more•Institutions (2)

Sandia National Laboratories¹, University of New Mexico²

1 Feb 2000

TL;DR: The design goals of the cluster and an approach to developing a commodity-based computational resource capable of delivering performance comparable to production-level MPP machines are presented.

...read moreread less

Abstract: The Computational Plant (Cplant) project at Sandia National Laboratories is developing a large-scale, massively parallel computing resource from a cluster of commodity computing and networking components. We are combining the benefits of commodity cluster computing with our expertise in designing, developing, using, and maintaining large-scale, massively parallel processing (MPP) machines. In this paper, we present the design goals of the cluster and an approach to developing a commodity-based computational resource capable of delivering performance comparable to production-level MPP machines. We provide a description of the hardware components of a 96-node Phase I prototype machine and discuss the experiences with the prototype that led to the hardware choices for a 400-node Phase II production machine. We give a detailed description of the management and runtime software components of the cluster and oAer computational performance data as well as performance measurements of functions that are critical to the management of large systems. ” 2000 Elsevier Science B.V. All rights reserved.

...read moreread less

99 citations

Journal Article•10.1016/S0167-8191(00)00043-0•

Shape-optimized mesh partitioning and load balancing for parallel adaptive FEM

[...]

Ralf Diekmann¹, Robert Preis², F. Schlimbach³, Chris Walshaw³•Institutions (3)

Hilti¹, University of Paderborn², University of Greenwich³

1 Nov 2000

TL;DR: A dynamic distributed load balancing algorithm for parallel, adaptive Finite Element simulations in which preconditioned Conjugate Gradient solvers based on domain-decomposition are used, to maintain good partition aspect ratio.

...read moreread less

Abstract: We present a dynamic distributed load balancing algorithm for parallel, adaptive Finite Element simulations in which we use preconditioned Conjugate Gradient solvers based on domain-decomposition The load balancing is designed to maintain good partition aspect ratio and we show that cut size is not always the appropriate measure in load balancing Furthermore, we attempt to answer the question why the aspect ratio of partitions plays an important role for certain solvers We define and rate different kinds of aspect ratio and present a new center-based partitioning method of calculating the initial distribution which implicitly optimizes this measure During the adaptive simulation, the load balancer calculates a balancing flow using different versions of the diffusion algorithm and a variant of breadth first search Elements to be migrated are chosen according to a cost function aiming at the optimization of subdomain shapes Experimental results for Bramble's preconditioner and comparisons to state-of-the-art load balancers show the benefits of the construction

...read moreread less

95 citations

Journal Article•10.1016/S0167-8191(00)00010-7•

Pajé, an interactive visualization tool for tuning multi-threaded parallel applications

[...]

J. Chassin de Kergommeaux¹, Benhur de Oliveira Stein², Paul-Emile Bernard³•Institutions (3)

Apache Corporation¹, Universidade Federal de Santa Maria², French Institute for Research in Computer Science and Automation³

15 Aug 2000

TL;DR: Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system, is described.

...read moreread less

Abstract: This paper describes Paje, an interactive visualization tool for displaying the execution of parallel applications where a potentially large number of communicating threads of various life-times execute on each node of a distributed memory parallel system. Paje is capable of representing a wide variety of interactions between threads. The main characteristics of Paje, interactivity and scalability, are exemplified by the performance tuning of a molecular dynamics application. In order to be easily extensible, the architecture of the system was based on components which are connected in a data flow graph to produce a given visualization tool. Innovative components were designed, in addition to “classical” components existing in similar visualization systems, to support scalability and interactivity.

...read moreread less

90 citations

Journal Article•10.1016/S0167-8191(00)00049-1•

Quality matching and local improvement for multilevel graph-partitioning

[...]

Burkhard Monien¹, Robert Preis¹, Ralph Diekmann²•Institutions (2)

University of Paderborn¹, Hilti²

1 Nov 2000

TL;DR: A new approximation algorithm for maximum weighted matching in general edge-weighted graphs and the Helpful-Set strategy for the local improvement of partitions are presented.

...read moreread less

Abstract: Multilevel strategies have proven to be very powerful approaches in order to partition graphs efficiently. Their efficiency is dominated by two parts; the coarsening and the local improvement strategies. Several methods have been developed to solve these problems, but their efficiency has only been proven on an experimental basis. In this paper, we present new and efficient methods for both problems, while satisfying certain quality measurements. For the coarsening part we develop a new approximation algorithm for maximum weighted matching in general edge-weighted graphs. It calculates a matching with an edge weight of at least 1 2 of the edge weight of a maximum weighted matching. Its time complexity is O(|E|), with |E| being the number of edges in the graph. Furthermore, we use the Helpful-Set strategy for the local improvement of partitions. For partitioning graphs with a regular degree of 2k into two parts, it guarantees an upper bound of ((k−1)/2)|V|+1 on the cut size of the partition, with |V| being the number of vertices. These quality methods used for the two parts of the multilevel approach lead to an efficient graph-partitioning concept.

...read moreread less

77 citations

Book Chapter•10.1007/10720115_4•

A New Methodology to Computer Deadlock-Free Routing Tables for Irregular Networks

[...]

José Carlos Sancho¹, Antonio Robles¹, José Duato¹•Institutions (1)

Polytechnic University of Valencia¹

8 Jan 2000

TL;DR: A new methodology to compute deadlock-free routing tables for NOWs is proposed, based on generating an underlying acyclic connected graph from the network graph and assigning a sequence number to each switch, which is used to remove cyclic dependencies.

...read moreread less

Abstract: Networks of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers Many NOWs are arranged as a switch-based network with irregular topology, which makes routing and deadlock avoidance quite complicated Current proposals use the up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock However, routing is considerably restricted and most messages must follow non-minimal paths, increasing latency and wasting resources In this paper, we propose a new methodology to compute deadlock-free routing tables for NOWs The methodology tries to minimize the limitations of the current proposals in order to improve network performance It is based on generating an underlying acyclic connected graph from the network graph and assigning a sequence number to each switch, which is used to remove cyclic dependencies Evaluation results show that the routing algorithm based on the new methodology increases throughput by a factor of up to 2 in large networks, also reducing latency significantly

...read moreread less

70 citations

Journal Article•10.1016/S0167-8191(00)00012-0•

Solving algebriac Riccati equations on parallel computers using Newton's method with exact line search

[...]

Peter Benner¹, Peter Benner², Ralph Byers³, Enrique S. Quintana-Ortí⁴, Gregorio Quintana-Ortí⁴ - Show less +1 more•Institutions (4)

University of Bremen¹, Max Planck Society², University of Kansas³, James I University⁴

1 Sep 2000

TL;DR: This work applies and extends the available theory for Newton's method endowed with exact line search to accelerate convergence and proposes to solve Lyapunov equations using iterative schemes for computing the matrix sign function.

...read moreread less

Abstract: We investigate the numerical solution of continuous-time algebraic Riccati equations via Newton's method on serial and parallel computers with distributed memory. We apply and extend the available theory for Newton's method endowed with exact line search to accelerate convergence. We also discuss a new stopping criterion based on recent observations regarding condition and error estimates. In each iteration step of Newton's method a stable Lyapunov equation has to be solved. We propose to solve these Lyapunov equations using iterative schemes for computing the matrix sign function. This approach can be efficiently implemented on parallel computers using ScaLAPACK. Numerical experiments on an ibm sp 2 multicomputer report on the accuracy, scalability, and speed-up of the implemented algorithms.

...read moreread less

57 citations

Journal Article•10.1016/S0167-8191(00)00054-5•

Simple data management, scheduling and solution strategies for managing the irregularities in parallel adaptive hp finite element simulations

[...]

Andras Laszloffy¹, Jingping Long¹, Abani Patra¹•Institutions (1)

University at Buffalo¹

1 Dec 2000

TL;DR: It is shown here that adaptive hp finite element methods can be parallelized easily if a good spatially local ordering of all data is used for organizing storage, distribution and access, and schedule computation using a “owner-computes” rule.

...read moreread less

Abstract: Adaptive hp finite element methods (FEM), in which both grid size h and local polynomial order p are dynamically altered, generate computations that require dynamic and irregular patterns of data storage, access and computation, making their parallelization very difficult. We show here that such applications can be parallelized easily if we use a good spatially local ordering of all data for organizing storage, distribution and access, and schedule computation using a “owner-computes” rule. This ordering results in a global index space which can be partitioned to distribute the data, locally used in hashing schemes and B-trees for the necessary dynamic memory management, and used in designing efficient solution schemes.

...read moreread less

Proceedings Article•10.1142/9781848160170_0036•

Towards a fast parallel sparse matrix-vector multiplication.

[...]

Roman Geus, Stefan Röllin

1 May 2000

TL;DR: This paper analyses the performance of the sparse matrix-vector product with symmetric matrices originating from the FEM and describes techniques that lead to a fast implementation and shows how these optimisations can be incorporated into an efficient parallel implementation using message passing.

...read moreread less

Abstract: The sparse matrix-vector product is an important computational kernel that runs ineffectively on many computers with super-scalar RISC processors. In this paper we analyse the performance of the sparse matrix-vector product with symmetric matrices originating from the FEM and describe techniques that lead to a fast implementation. It is shown how these optimisations can be incorporated into an efficient parallel implementation using messagepassing. We conduct numerical experiments on many different machines and show that our optimisations speed up the sparse matrix-vector multiplication substantially.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00035-1•

Scheduling preemptable tasks on parallel processors with limited availability

[...]

Jacek Blazewicz¹, Maciej Drozdowski¹, Piotr Formanowicz¹, Wieslaw Kubiak², Günter Schmidt³ - Show less +1 more•Institutions (3)

Poznań University of Technology¹, Memorial University of Newfoundland², Saarland University³

1 Aug 2000

TL;DR: It is shown that this problem of preemptive task scheduling on m parallel identical processors with the objective of minimizing makespan can be solved in polynomial time and becomes NP-hard in the strong sense in case of trees and identical processors.

...read moreread less

Abstract: It is well known that in the majority of cases the problem of preemptive task scheduling on m parallel identical processors with the objective of minimizing makespan can be solved in polynomial time. For example, for tree-like precedence constraints the algorithm of Muntz and CoAman can be applied. In this paper, this problem is generalized to cover the case of parallel processors which are available in certain time intervals only. It will be shown that this problem becomes NP-hard in the strong sense in case of trees and identical processors. If tasks form chains and are processed by identical processors with a staircase pattern of availability, the problem can be solved in low-order polynomial time for Cmax criterion, and a linear programming approach is required for Lmax criterion. Network flow and linear programming approaches will be proposed for independent tasks scheduled on, respectively, uniform and unrelated processors with arbitrary patterns of availability for schedule length and maximum lateness criteria. ” 2000 Published by Elsevier Science B.V. All rights reserved.

...read moreread less

Book Chapter•10.1007/3-540-70734-4_7•

LAWRA: Linear Algebra with Recursive Algorithms

[...]

Bjarne Stig Andersen¹, Fred G. Gustavson², Alexander Karaivanov¹, Minka Marinova¹, Jerzy Waniewski¹, Plamen Y. Yalamov - Show less +2 more•Institutions (2)

Technical University of Denmark¹, IBM²

18 Jun 2000

TL;DR: The Cholesky factorization algorithm for positive definite matrices, LU factorization for generalMatrices, and LDLT factorized for symmetric indefinite matrices using recursion are formulated and presented in this paper.

...read moreread less

Abstract: Recursion leads to automatic variable blocking for dense linear algebra algorithms. The recursion transforms LAPACK level-2 algorithms into level3 codes. For this and other reasons recursion usually speeds up the algorithms. Recursion provides a new, easy and very successful way of programming numerical linear algebra algorithms. Several algorithms for matrix factorization have been implemented and tested. Some of these algorithms are already candidates for the LAPACK library. Recursion has also been successfully applied to the BLAS (Basic Linear Algebra Subprograms). The ATLAS system (Automatically Tuned Linear Algebra Software) uses a recursive coding of the BLAS. The Cholesky factorization algorithm for positive definite matrices, LU factorization for general matrices, and LDLT factorization for symmetric indefinite matrices using recursion are formulated in this paper. Performance graphs of our packed Cholesky and LDLT algorithms are presented here.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00057-0•

Parallel algorithms to solve two-stage stochastic linear programs with robustness constraints

[...]

Patrizia Beraldi¹, Lucio Grandinetti¹, Roberto Musmanno, Chefi Triki¹•Institutions (1)

University of Calabria¹

1 Dec 2000

TL;DR: A parallel method for solving two-stage stochastic linear programs with restricted recourse is presented, based on a primal-dual path-following interior point algorithm, and exploits fruitfully the dual block-angular structure of the constraint matrix and the special block structures of the matrices involved in the restricted recourse model.

...read moreread less

Abstract: In this paper we present a parallel method for solving two-stage stochastic linear programs with restricted recourse. The mathematical model considered here can be used to represent several real-world applications, including financial and production planning problems, for which significant changes in the recourse solutions should be avoided because of their difficulty to be implemented. Our parallel method is based on a primal-dual path-following interior point algorithm, and exploits fruitfully the dual block-angular structure of the constraint matrix and the special block structure of the matrices involved in the restricted recourse model. We describe and discuss both message-passing and shared-memory implementations and we present the numerical results collected on the Origin2000.

...read moreread less

Book Chapter•10.1007/10720115_9•

Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages

[...]

Darius Buntinas¹, Dhabaleswar K. Panda¹, José Duato¹, P. Sadayappan¹•Institutions (1)

Ohio State University¹

8 Jan 2000

TL;DR: A design and an implementation of a multi-send primitive to support efficient broadcast/multicast that requires minimal assistance from the NIC and a method for constructing an optimal multicast tree using the new primitive is proposed.

...read moreread less

Abstract: Broadcasting and multicasting are common operations in parallel and distributed programs. Some modern Network Interface Cards (NICs) have programmable processors which can be used to provide support for these operations. However these processors are 5-15 times slower than the host processor. In this paper we propose a design and an implementation of a multi-send primitive to support efficient broadcast/multicast that requires minimal assistance from the NIC. Our scheme is designed with the idea that as much processing as possible should be done by the host processor. This gives us more flexibility with, for example, creating multicast trees which would be optimal for a particular message size, or choosing a multicast tree dynamically based on requirements of bandwidth versus latency for a particular message. We have designed a multi-send primitive and implemented it as an addition to Fast-Messages (FM) 2.1 running over a Myrinet network. The proposed scheme does less processing at the NIC. The impact of adding such NIC-assisted multicast operation to a run-time system is also very small, less than 500ns for non-multi-send packets. To fully utilize the benefits of this primitive, we propose a method for constructing an optimal multicast tree using the new primitive. We have evaluated this scheme and obtained a speedup factor of up to 1.85 for multicasting 16K messages with 16 nodes.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00045-4•

A comparison of some dynamic load-balancing algorithms for a parallel adaptive flow solver

[...]

N. Touheed¹, Paul M. Selwood¹, Peter K. Jimack¹, Martin Berzins¹•Institutions (1)

University of Leeds¹

1 Nov 2000

TL;DR: This paper contrasts the performance of a number of different parallel dynamic load-balancing algorithms when used in conjunction with a particular parallel, adaptive, time-dependent, 3D flow solver.

...read moreread less

Abstract: In this paper we contrast the performance of a number of different parallel dynamic load-balancing algorithms when used in conjunction with a particular parallel, adaptive, time-dependent, 3D flow solver. An overview of this solver is given along with a description of the dynamic load-balancing problem that results from its use. Two recently published parallel dynamic load-balancing software tools are then briefly described and a number of recursive parallel dynamic load-balancing techniques are also outlined. The effectiveness of each of these algorithms is then assessed when they are coupled with the parallel adaptive solver and used to tackle a model 3D flow problem.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00020-X•

Scalability of correlated electronic structure calculations on parallel computers: A case study of the RI-MP2 method

[...]

David E. Bernholdt¹•Institutions (1)

Syracuse University¹

1 Jul 2000

TL;DR: The implementation of the RI-MP2 method using the Global Array parallel programming model is described and its scalability is analyzed, both with problem size and number of processors.

...read moreread less

Abstract: The RI-MP2 method arises from the application of the “resolution of the identity” (RI) integral approximation to the second-order many-body perturbation theory (MP2). It provides a lower-cost alternative to the MP2 method, widely used in the computational chemistry community. This paper describes the implementation of the RI-MP2 method using the Global Array parallel programming model and analyzes its scalability, both with problem size and number of processors. Large-scale calculations are dominated by a parallel matrix multiplication, and scale quite well from 16 to 128 processors on an IBM RS/6000 SP system. It is estimated that exact MP2 calculations on the largest system reported here might take as much as 90 times longer than the RI-MP2 timings presented.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00036-3•

Rate of change load balancing in distributed and parallel systems

[...]

Luis Miguel Campos¹, Isaac D. Scherson¹•Institutions (1)

University of California, Irvine¹

1 Aug 2000

TL;DR: A novel distributed algorithm for load balancing is proposed and is based on local rate of change (RoC) observations rather than on global absolute load numbers, which is a totally distributed algorithm and requires no centralized trigger and/or decision makers.

...read moreread less

Abstract: Dynamic load balancing (DLB) is an important system function destined to distribute workload among available processors to improve throughput and/or execution times of parallel computer programs either uniform or non-uniform (jobs whose workload varies at run-time in unpredictable ways). Non-uniform computation and communication requirements may bog down a parallel computer if no efficient load distribution is effected. A novel distributed algorithm for load balancing is proposed and is based on local rate of change (RoC) observations rather than on global absolute load numbers. It is a totally distributed algorithm and requires no centralized trigger and/or decision makers. The strategy is discussed and analyzed by means of experimental simulation.

...read moreread less

Journal Article•10.1016/S0167-8191(99)00108-8•

Efficient parallel processing on low-cost clusters with GAMMA active ports

[...]

Giovanni Chiola¹, Giuseppe Ciaccio¹•Institutions (1)

University of Genoa¹

1 Feb 2000

TL;DR: Despite being implemented at the kernel level in the Linux OS, performance numbers of GAMMA Active Ports are much better than many other LAN-oriented communication layers, including so called “user-level” ones (e.g. U-Net).

...read moreread less

Abstract: The Genoa Active Message MAchine (GAMMA) is an efficient communication layer for 100base-T clusters of Personal Computers under the Linux operating system (OS). It is based on Active Ports, a communication mechanism derived from Active Messages. Active Ports share most of the low-level optimization opportunities with Generic Active Messages while offering a higher-level programming interface not only in the SPMD but also in the MIMD and client/server paradigms. In addition to point-to-point communications, multi-cast, barrier synchronization, scatter, and gather primitives have also been developed based on Active Ports and exploiting shared 100base-T LAN technology in an optimal way. GAMMA Active Ports deliver excellent communication performance at the user level (latency 13 μ s, maximum throughput 12.2 MByte/s, half-power point reached with 200 byte long messages), thus enabling cost-effective cluster computing on 100base-T. Despite being implemented at the kernel level in the Linux OS, performance numbers of GAMMA Active Ports are much better than many other LAN-oriented communication layers, including so called “user-level” ones (e.g. U-Net). Some code porting efforts have already shown that several applications are reasonably easy to develop on top of GAMMA and that they can actually take advantage of the efficient point-to-point as well as collective communication primitives offered by our prototype library implementation. A porting of the MPICH higher-level interface atop GAMMA is currently under way.

...read moreread less

Book Chapter•10.1007/3-540-70734-4_44•

Domain Decomposition Solvers for Large Scale Industrial Finite Element Problems

[...]

Petter E. Bjørstad¹, Jacko Koster¹, Piotr Krzyżanowski²•Institutions (2)

University of Bergen¹, University of Warsaw²

18 Jun 2000

TL;DR: A domain decomposition code for solving large scale finite element problems in a robust, yet efficient way and its performance on some challenging industrial problems is shown.

...read moreread less

Abstract: The European research project PARASOL aimed to design and develop a public domain library of scalable sparse matrix solvers for distributed memory computers. Parallab was a partner in the project and developed a domain decomposition code for solving large scale finite element problems in a robust, yet efficient way. Although the PARASOL project finished in June 1999, Parallab has continued the development of the solver. In this paper, we report on the present status of the solver and show its performance on some challenging industrial problems.

...read moreread less

Journal Article•10.1016/S0167-8191(99)00124-6•

Parallel computing of a quasi-Monte Carlo algorithm for valuing derivatives

[...]

Jenny X. Li¹, Gary L. Mullen¹•Institutions (1)

Pennsylvania State University¹

1 Apr 2000

TL;DR: From the numerical evidence, it appears that the ( t, m, s )-nets generated by a parallel algorithm can be powerful tools for valuing derivatives and other securities.

...read moreread less

Abstract: The performance of the standard Monte Carlo method is compared with the performance obtained through the use of ( t , m , s )-nets in base b in the approximation of several high dimensional integral problems in valuing derivatives and other securities. The ( t , m , s )-nets are generated by a parallel algorithm, where particular considerations are given to scalability of dynamic adaptive routing and load balancing in the design and implementation of the algorithm. From the numerical evidence it appears that such nets can be powerful tools for valuing such securities.

...read moreread less

Journal Article•10.1016/S0167-8191(99)00116-7•

A two-level parallelization strategy for genetic algorithms applied to optimum shape design

[...]

N. Marco¹, Stéphane Lanteri¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

1 Mar 2000

TL;DR: Numerical and performance results are presented for the optimization of two-dimensional airfoils for calculations performed on the following systems: an SGI Origin 2000 and an IBM SP-2 MIMD systems where the interconnection is realized through a FastEthernet (100 Mbits/s) switch.

...read moreread less

Abstract: This paper presents a two-level strategy for the parallelization of a Genetic Algorithm (GA) coupled to a compressible flow solver designed on unstructured triangular meshes. The parallel implementation is based on MPI and makes use of the process group features of this environment. The resulting algorithm is used for the optimum shape design of aerodynamic configurations. Numerical and performance results are presented for the optimization of two-dimensional airfoils for calculations performed on the following systems: an SGI Origin 2000 and an IBM SP-2 MIMD systems; an Pentium Pro (P6/200 MHz) cluster where the interconnection is realized through a FastEthernet (100 Mbits/s) switch.

...read moreread less

Book Chapter•10.1007/3-540-70734-4_9•

High-Performance Library Software for QR Factorization

[...]

Erik Elmroth¹, Fred G. Gustavson²•Institutions (2)

Umeå University¹, IBM²

18 Jun 2000

TL;DR: A new version of RGEQRF and its accompanying SMP parallel counterpart is presented, implemented for a future release of the IBM ESSL library and represents a robust high-performance piece of library software for QR factorization on uniprocessor and multiprocessors systems.

...read moreread less

Abstract: In [5,6], we presented algorithm RGEQR3, a purely recursive formulation of the QR factorization. Using recursion leads us to a natural way to choose the k-way aggregating Householder transform of Schreiber and Van Loan [10]. RGEQR3 is a performance critical subroutine for the main (hybrid recursive) routine RGEQRF for QR factorization of a general m × n matrix. This contribution presents a new version of RGEQRF and its accompanying SMP parallel counterpart, implemented for a future release of the IBM ESSL library. It represents a robust high-performance piece of library software for QR factorization on uniprocessor and multiprocessor systems. The implementation builds on previous results [5,6]. In particular, the new version is optimized in a number of ways to improve the performance; e.g., for small matrices and matrices with a very small number of columns. This is partly done by including mini blocking in the otherwise pure recursive RGEQR3. We describe the salient features of this implementation. Our serial implementation outperforms the corresponding LAPACK routine by 10-65% for square matrices and 10-100% on tall and thin matrices on the IBM POWER2 and POWER3 nodes. The tests covered matrix sizes which varied from very small to very large. The SMP parallel implementation shows close to perfect speedup on a 4-processor PPC604e node.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00051-X•

Parallelizing irregular and pointer-based computations automatically: perpectives from logic and constraint programming

[...]

Manuel V. Hermenegildo¹•Institutions (1)

Technical University of Madrid¹

1 Dec 2000

TL;DR: Some of the problems faced by parallelizing compilers for logic and constraint programs are introduced in a tutorial way and some of the significant progress made in the area are provided.

...read moreread less

Abstract: Irregular computations pose some of the most interesting and challenging problems in automatic parallelization. Irregularity appears in certain kinds of numerical problems and is pervasive in symbolic applications. Such computations often use dynamic data structures, which make heavy use of pointers. This complicates all the steps of a parallelizing compiler, from independence detection to task partitioning and placement. Starting in the mid 80s there has been significant progress in the development of parallelizing compilers for logic programming (and more recently, constraint programming) resulting in quite capable parallelizers. The typical applications of these paradigms frequently involve irregular computations, and make heavy use of dynamic data structures with pointers, since logical variables represent in practice a well-behaved form of pointers. This arguably makes the techniques used in these compilers potentially interesting. In this paper, we introduce in a tutorial way, some of the problems faced by parallelizing compilers for logic and constraint programs and provide pointers to some of the significant progress made in the area. In particular, this work has resulted in a series of achievements in the areas of inter-procedural pointer aliasing analysis for independence detection, cost models and cost analysis, cactus-stack memory management, techniques for managing speculative and irregular computations through task granularity control and dynamic task allocation (such as work-stealing schedulers), etc.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00062-4•

Efficient compiler and run-time support for parallel irregular reductions

[...]

Hwansoo Han¹, Chau-Wen Tseng¹•Institutions (1)

University of Maryland, College Park¹

1 Dec 2000

TL;DR: Results show L ocal W rite improves performance significantly compared to using replicated buffers, and can match or exceed explicit message-passing gather/scatter for applications with low locality or high adaptivity.

...read moreread less

Abstract: Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop L ocal W rite , a new technique which partitions irregular reductions so that each processor computes values only for locally assigned data, eliminating the need for buffers or synchronized writes. Computation is replicated if its results are needed on multiple processors. We experimentally evaluate its performance for three irregular codes on a software DSM running on a distributed-memory multiprocessor and two shared-memory multiprocessors while varying connectivity, locality, and adaptivity. Results show L ocal W rite improves performance significantly compared to using replicated buffers, and can match or exceed explicit message-passing gather/scatter for applications with low locality or high adaptivity.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00027-2•

Contention-free communication scheduling for array redistribution

[...]

Minyi Guo¹, Ikuo Nakata, Yoshiyuki Yamashita¹•Institutions (1)

University of Tsukuba¹

1 Sep 2000

TL;DR: A technique to schedule the communication so that it becomes contention-free in a redistribution algorithm, which can deal with multi-dimensional “shape changing redistribution”.

...read moreread less

Abstract: Array redistribution is required often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of the programs may degrade considerably. The redistribution overheads consist of two parts: index computation and interprocessor communication. If there is no communication scheduling in a redistribution algorithm, the communication contention may occur, which increases the communication waiting time. In order to solve this problem, in this paper, we propose a technique to schedule the communication so that it becomes contention-free. Our approach initially generates a communication table to represent the communication relations among sending nodes and receiving nodes. According to the communication table, we then generate another table named communication scheduling table. Each column of communication scheduling table is a permutation of receiving node numbers in each communication step. Thus the communications in our redistribution algorithm are contention-free. Our approach can deal with multi-dimensional “shape changing redistribution”.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00019-3•

Parallelization of multi-reference coupled-cluster method

[...]

Piotr Piecuch¹, J. I. Landman•Institutions (1)

Michigan State University¹

1 Jul 2000

TL;DR: A parallel computer program performing orthogonally spin-adapted two-reference state-universal coupled-cluster calculations with singly and doubly excited clusters (SUCCSD) is described and a significant speedup of the SUCCSD calculations on shared-memory parallel systems having 16–32 CPUs is demonstrated.

...read moreread less

Abstract: A parallel computer program performing orthogonally spin-adapted two-reference state-universal coupled-cluster calculations with singly and doubly excited clusters (SUCCSD) is described. It is demonstrated that a significant speedup of the SUCCSD calculations on shared-memory parallel systems having 16–32 CPUs can be achieved with relatively few changes in a serial code, if the OpenMP directives are used. The parallel efficiency of the SUCCSD calculations improves with the basis set. The results of parallel SUCCSD calculations for the low lying electronic states of simple quasi-degenerate molecular systems are reported.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00037-5•

Real-time sonar beamforming on high-performance distributed computers

[...]

Alan D. George¹, Jeff S. Markwell¹, Ryan B. Fogarty¹•Institutions (1)

University of Florida¹

1 Sep 2000

TL;DR: This paper explores parallel algorithms for conventional beamforming (CBF) designed for an in-array processing system and the parallel algorithms presented offer scaled speedup and provide the basis for adaptations in advanced beamforming algorithms.

...read moreread less

Abstract: Rapid advancements in acoustical beamforming techniques for array signal processing are producing algorithms with increased levels of computational complexity. Concomitantly, autonomous arrays capable of performing most or all of the processing in situ have become a focus for mission-critical applications. To address these changes, future sonar systems will take advantage of parallel in-array processing by coupling transducer nodes with low-power processing devices to achieve higher performance and fault tolerance at lower cost. This paper explores parallel algorithms for conventional beamforming (CBF) designed for an in-array processing system. The parallel algorithms presented offer scaled speedup and provide the basis for adaptations in advanced beamforming algorithms.

...read moreread less

Journal Article•10.1016/S0167-8191(00)00031-4•

List scheduling of general task graphs under LogP

[...]

Tomasz Kalinowski¹, Iskander Kort, Denis Trystram•Institutions (1)

Polish Academy of Sciences¹

1 Aug 2000

TL;DR: In this paper, the applicability of the list scheduling approach to the LogP model is evaluated and an upper bound on list schedules under LogP is established for graph classes and model instances.

...read moreread less

Abstract: List scheduling is the most frequently used scheduling technique. In this context worst case analysis as well as many experimental studies were performed for various computational models. However, many new models have been proposed during the last decade with the aim to provide a realistic but still simple and general model of parallel computation. LogP is one of the most popular models so far suggested. It takes into account the time a computation processor spends to manage a communication. Many experimental studies on current parallel architectures have shown that such a parameter cannot be neglected. The aim of this paper is to assess the applicability of the list scheduling approach to the LogP model. More precisely, we present two adaptations of the earliest task first (ETF) heuristic. Then, we establish an upper bound on list schedules under LogP. Finally, we present an extensive experimental study for diAerent graph classes and model instances. ” 2000 Elsevier Science B.V. All rights reserved.

...read moreread less

...

Expand