Top 256 papers presented at Parallel Computing in 2011

Showing papers presented at "Parallel Computing in 2011"

Proceedings Article•

NAS Parallel Benchmarks.

[...]

1 Jan 2011

TL;DR: The NAS Parallel Benchmarks (NPB) as discussed by the authors are a suite of parallel computer per-formance benchmarks, originally developed at the NASA Ames Re-search Center in 1991 to assess high-end parallel supercomputers.

...read moreread less

Abstract: TITLE: The NAS Parallel Benchmarks AUTHOR: David H Bailey 1 ACRONYMS: NAS, NPB DEFINITION: The NAS Parallel Benchmarks (NPB) are a suite of parallel computer per- formance benchmarks. They were originally developed at the NASA Ames Re- search Center in 1991 to assess high-end parallel supercomputers [?]. Although they are no longer used as widely as they once were for comparing high-end sys- tem performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym “NAS” originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Sim- ulation Program, and more recently to the NASA Advanced Supercomputing Center, although the acronym remains “NAS.” The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. DISCUSSION: The original NAS Parallel Benchmarks consisted of eight individual bench- mark problems, each of which focused on some aspect of scientiﬁc computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientiﬁc computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s oﬀered an attrac- tive alternative to parallel vector supercomputers that had been the mainstay of high-end scientiﬁc computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the sys- tems. As a result, it was diﬃcult to discern whether the new systems oﬀered any fundamental performance advantage over vector supercomputers, and, if so, which of the parallel oﬀerings would be most useful in real-world scientiﬁc computation. 1 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, dhbailey@lbl.gov. Supported in part by the Director, Oﬃce of Computational and Technology Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy, under contract number DE-AC02-05CH11231.

...read moreread less

588 citations

Reference Book•10.1007/978-0-387-09766-4•

Universality in VLSI Computation

[...]

Gianfranco Bilardi¹, Geppino Pucci¹•Institutions (1)

University of Padua¹

1 Jan 2011

544 citations

Journal Article•10.1016/J.PARCO.2011.05.004•

A Hybrid MPI-OpenMP Scheme for Scalable Parallel Pseudospectral Computations for Fluid Turbulence

[...]

Pablo D. Mininni¹, Pablo D. Mininni², Duane Rosenberg¹, Raghu Reddy³, Annick Pouquet¹ - Show less +1 more•Institutions (3)

National Center for Atmospheric Research¹, Facultad de Ciencias Exactas y Naturales², Pittsburgh Supercomputing Center³

1 Jun 2011

TL;DR: It is shown that the hybrid scheme achieves good scalability up to ∼20,000 compute cores with a maximum efficiency of 89%, and a mean of 79%.

...read moreread less

Abstract: A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves good scalability up to ∼20,000 compute cores with a maximum efficiency of 89%, and a mean of 79%. Data are presented that help guide the choice of the optimal number of MPI tasks and OpenMP threads in order to maximize code performance on two different platforms.

...read moreread less

253 citations

Journal Article•10.1016/J.PARCO.2011.05.002•

Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations

[...]

T. Auckenthaler¹, Volker Blum², Hans-Joachim Bungartz¹, Thomas Huckle¹, R. Johanni³, Lukas Krämer⁴, Bruno Lang⁴, Hermann Lederer³, Paul R. Willems⁴ - Show less +5 more•Institutions (4)

Technische Universität München¹, Fritz Haber Institute of the Max Planck Society², Max Planck Society³, University of Wuppertal⁴

1 Dec 2011

TL;DR: In this article, the tridiagonal-to-banded back transformation was proposed to improve the parallel efficiency for large numbers of processors as well as the per-processor utilization.

...read moreread less

Abstract: The computation of selected eigenvalues and eigenvectors of a symmetric (Hermitian) matrix is an important subtask in many contexts, for example in electronic structure calculations. If a significant portion of the eigensystem is required then typically direct eigensolvers are used. The central three steps are: reduce the matrix to tridiagonal form, compute the eigenpairs of the tridiagonal matrix, and transform the eigenvectors back. To better utilize memory hierarchies, the reduction may be effected in two stages: full to banded, and banded to tridiagonal. Then the back transformation of the eigenvectors also involves two stages. For large problems, the eigensystem calculations can be the computational bottleneck, in particular with large numbers of processors. In this paper we discuss variants of the tridiagonal-to-banded back transformation, improving the parallel efficiency for large numbers of processors as well as the per-processor utilization. We also modify the divide-and-conquer algorithm for symmetric tridiagonal matrices such that it can compute a subset of the eigenpairs at reduced cost. The effectiveness of our modifications is demonstrated with numerical experiments.

...read moreread less

221 citations

Journal Article•10.1016/J.PARCO.2011.02.002•

High performance computing using MPI and OpenMP on multi-core parallel systems

[...]

Haoqiang Jin¹, Dennis C. Jespersen¹, Piyush Mehrotra¹, Rupak Biswas¹, Lei Huang², Barbara Chapman² - Show less +2 more•Institutions (2)

Ames Research Center¹, University of Houston²

1 Sep 2011

TL;DR: This paper presents the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems and presents new data locality extensions to OpenMP to better match the hierarchical memory structure of multi- core architectures.

...read moreread less

Abstract: The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.

...read moreread less

195 citations

Book Chapter•10.1007/978-0-387-09766-4_93•

PaToH (Partitioning tool for hypergraphs)

[...]

Ümit V. Çatalyürek¹, Cevdet Aykanat²•Institutions (2)

Ohio State University¹, Bilkent University²

1 Jan 2011

173 citations

Journal Article•10.1016/J.PARCO.2011.02.007•

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

[...]

Wang Xian¹, Aoki Takayuki¹•Institutions (1)

Tokyo Institute of Technology¹

1 Sep 2011

TL;DR: The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library to introduce the overlapping technique between computation and communication.

...read moreread less

Abstract: GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384x384x384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3-4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8-96 GPUs, the performances increase by a factor about 1.1-1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re=13,000 was carried on successfully using the mesh system 2000x1000x1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0h were used for processing 100,000 time steps. Under this condition, the computational time (2.79h) and the data communication time (3.06h) are almost the same.

...read moreread less

153 citations

Proceedings Article•

Open Trace Format 2: The Next Generation of Scalable Trace Formats and Support Libraries.

[...]

Dominic Eschweiler¹, Michael Wagner², Markus Geimer¹, Andreas Knüpfer², Wolfgang E. Nagel², Felix Wolf¹ - Show less +2 more•Institutions (2)

Forschungszentrum Jülich¹, Dresden University of Technology²

1 Jan 2011

TL;DR: This paper introduces the Open Trace Format Version 2 (OTF2), a major re-design based on the experiences of its predecessor formats, the Open trace Format (Version 1) and the EPILOG trace format, with a new file encoding, a close integration in the trace recording system Score-P, and a number of improvements for performance and scalability.

...read moreread less

Abstract: A well designed event trace data format is the basis of all trace-based analysis methods. In this paper, we introduce the Open Trace Format Version 2 (OTF2). It is a major re-design based on the experiences of its predecessor formats, the Open Trace Format (Version 1) and the EPILOG trace format. It comes with a new file encoding, a close integration in the trace recording system Score-P, and a number of improvements for performance and scalability. Besides the actual format, it consists of a read/write support library with a powerful API plus supportive tools, which are distributed as Open Source software. Furthermore, OTF2 will serve as a joint data source for the analysis tools Scalasca and Vampir in the near future.

...read moreread less

117 citations

Proceedings Article•

BLAS (Basic Linear Algebra Subprograms).

[...]

Robert A. van de Geijn¹, Kazushige Goto¹•Institutions (1)

University of Texas at Austin¹

1 Jan 2011

114 citations

Journal Article•10.1016/J.PARCO.2010.10.004•

Scheduling of tasks in the parareal algorithm

[...]

Eric Aubanel¹•Institutions (1)

University of New Brunswick¹

1 Mar 2011

TL;DR: It is suggested that the parareal algorithm is a promising approach to solving long time evolution problems, particularly when the goal is simulation of longer times using more processors, and exhibits characteristics that make it particularly suitable for execution on heterogeneous computational grids.

...read moreread less

Abstract: Parallelization of partial differential equations (PDEs) by time decomposition has attracted much interest, mainly due to its potential to enable very long time simulations beyond what is possible using spatial domain decomposition. However, there has only been limited performance analysis of the parareal algorithm in the literature, ignoring the efficient scheduling of tasks. This paper presents a detailed study of the scheduling of tasks in the parareal algorithm that achieves significantly better efficiency than the usual algorithm. Two algorithms are proposed, one which uses a manager-worker paradigm with overlap of sequential and parallel phases, and a second that is completely distributed. Experiments were conducted with the 2D heat equation. It was found that the rate of convergence decreases as the number of processors increases, in the case of strong scaling (fixed time interval). However, for weak scaling results the rate of convergence was unaffected by the number of processors. The results of this paper suggest that the parareal algorithm is a promising approach to solving long time evolution problems, particularly when the goal is simulation of longer times using more processors. It also exhibits characteristics that make it particularly suitable for execution on heterogeneous computational grids, such as low communication overhead and easy accommodation of different processor and network speeds.

...read moreread less

92 citations

Proceedings Article•

HPC Challenge Benchmark.

[...]

Jack Dongarra¹, Piotr Luszczek¹•Institutions (1)

University of Tennessee¹

1 Jan 2011

Journal Article•10.1016/J.PARCO.2011.08.001•

A scalable memory efficient multigrid solver for micro-finite element analyses based on CT images

[...]

Cyril Flaig¹, Peter Arbenz¹•Institutions (1)

ETH Zurich¹

1 Dec 2011

TL;DR: An improved solver is presented that has a significantly smaller memory footprint compared to the currently used solvers and is fully parallel, and shows almost perfect scalability up to 8000 cores of a Cray XT-5 supercomputer.

...read moreread less

Abstract: Micro-structural finite element (@mFE) analysis based on high-resolution computed tomography represents the current gold standard to predict bone stiffness and strength. Recent progress in solver technology makes possible simulations on large supercomputers that involve billions of degrees of freedom. In this paper we present an improved solver that has a significantly smaller memory footprint compared to the currently used solvers. This new approach fully exploits the information that is contained in the underlying CT image itself. It admits to execute all steps in the underlying multigrid-preconditioned conjugate gradient algorithm in matrix-free form. The reduced memory footprint allows to solve bigger bone models on a given hardware. It is an important step forward to the clinical usage of @mFE simulations. The new solver is fully parallel. We show almost perfect scalability up to 8000 cores of a Cray XT-5 supercomputer.

...read moreread less

Proceedings Article•

METIS and ParMETIS.

[...]

George Karypis¹•Institutions (1)

University of Minnesota¹

1 Jan 2011

Journal Article•10.1016/J.PARCO.2011.03.005•

A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters

[...]

Christian Feichtinger¹, Johannes Habich¹, Harald Köstler¹, Georg Hager¹, Ulrich Rüde¹, Gerhard Wellein¹ - Show less +2 more•Institutions (1)

University of Erlangen-Nuremberg¹

1 Sep 2011

TL;DR: It is demonstrated that a large fraction of the kernel performance can be sustained for weak scaling on InfiniBand clusters, leading to excellent parallel efficiency and a cost analysis must determine the best course of action for a particular simulation task and hardware configuration.

...read moreread less

Abstract: Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. We address this issue in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. Our multi-GPU implementation uses a block-structured MPI parallelization and is suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail. It is demonstrated that a large fraction of the kernel performance can be sustained for weak scaling on InfiniBand clusters, leading to excellent parallel efficiency. However, in strong scaling scenarios using multiple GPUs is much less efficient than running CPU-only simulations on IBM BG/P and x86-based clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task and hardware configuration. Finally we present weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously, using clusters equipped with varying node configurations.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.05.003•

A common parallel computing framework for modeling hydrological processes of river basins

[...]

Hao Wang¹, Xudong Fu¹, Guangqian Wang¹, Tiejian Li¹, Jie Gao¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

1 Jun 2011

TL;DR: A common parallel computing framework based on the message passing interface (MPI) protocol for modeling hydrological processes of river basins is presented, and a practical and dynamic spatial domain decomposition method,based on the binary-tree structure of the drainage network, is proposed.

...read moreread less

Abstract: Restricted computing power has become one of the primary factors obstructing advancement in basin simulations for majority of hydrological models. Parallel computing is one of the most available approaches to solve this problem. Using binary-tree theory, we present in this study a common parallel computing framework based on the message passing interface (MPI) protocol for modeling hydrological processes of river basins. A practical and dynamic spatial domain decomposition method, based on the binary-tree structure of the drainage network, is proposed. This framework is computationally efficient, and is independent of the type of physical models chosen. The framework is tested in the Chabagou river basin of China, where two years of runoff processes of the entire basin were simulated. Results demonstrate that the system may provide efficient computing performance. However, primarily because of the constraint of the binary-tree structure for drainage network, this study finds that unlimited enhancement of computing efficiency is impossible to realize.

...read moreread less

Journal Article•10.1016/J.PARCO.2010.12.002•

A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU-GPU platforms

[...]

Peter Benner¹, Pablo Ezzatti², Daniel Kressner³, Enrique S. Quintana-Ortí⁴, Alfredo Remón⁴ - Show less +1 more•Institutions (4)

Max Planck Society¹, University of the Republic², ETH Zurich³, James I University⁴

1 Aug 2011

TL;DR: A hybrid Lyapunov solver based on the matrix sign function, where the intensive parts of the computation are accelerated using a graphics processor (GPU) while executing the remaining operations on a general-purpose multi-core processor (CPU).

...read moreread less

Abstract: We describe a hybrid Lyapunov solver based on the matrix sign function, where the intensive parts of the computation are accelerated using a graphics processor (GPU) while executing the remaining operations on a general-purpose multi-core processor (CPU). The initial stage of the iteration operates in single-precision arithmetic, returning a low-rank factor of an approximate solution. As the main computation in this stage consists of explicit matrix inversions, we propose a hybrid implementation of Gausz-Jordan elimination using look-ahead to overlap computations on GPU and CPU. To improve the approximate solution, we introduce an iterative refinement procedure that allows to cheaply recover full double-precision accuracy. In contrast to earlier approaches to iterative refinement for Lyapunov equations, this approach retains the low-rank factorization structure of the approximate solution. The combination of the two stages results in a mixed-precision algorithm, that exploits the capabilities of both general-purpose CPUs and many-core GPUs and overlaps critical computations. Numerical experiments using real-world data and a platform equipped with two Intel Xeon QuadCore processors and an Nvidia Tesla C1060 show a significant efficiency gain of the hybrid method compared to a classical CPU implementation.

...read moreread less

Journal Article•10.1016/J.PARCO.2010.12.004•

Dynamic scheduling of a batch of parallel task jobs on heterogeneous clusters

[...]

Jorge G. Barbosa¹, Belmiro Moreira¹•Institutions (1)

University of Porto¹

1 Aug 2011

TL;DR: An adaptation of the Heterogeneous Earliest-Finish-Time (HEFT) algorithm is proposed, called here P-HEFT, to handle parallel tasks in heterogeneous clusters with good efficiency without compromising the makespan.

...read moreread less

Abstract: This paper addresses the problem of minimizing the scheduling length (make-span) of a batch of jobs with different arrival times. A job is described by a direct acyclic graph (DAG) of parallel tasks. The paper proposes a dynamic scheduling method that adapts the schedule when new jobs are submitted and that may change the processors assigned to a job during its execution. The scheduling method is divided into a scheduling strategy and a scheduling algorithm. We also propose an adaptation of the Heterogeneous Earliest-Finish-Time (HEFT) algorithm, called here P-HEFT, to handle parallel tasks in heterogeneous clusters with good efficiency without compromising the makespan. The results of a comparison of this algorithm with another DAG scheduler using a simulation of several machine configurations and job types shows that P-HEFT gives a shorter makespan for a single DAG but scores worse for multiple DAGs. Finally, the results of the dynamic scheduling of a batch of jobs using the proposed scheduler method showed significant improvements for more heavily loaded machines when compared to the alternative resource reservation approach.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.02.001•

Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

[...]

Kamesh Madduri¹, Eun-Jin Im², Khaled Z. Ibrahim¹, Samuel Williams¹, Stephane Ethier³, Leonid Oliker¹ - Show less +2 more•Institutions (3)

Lawrence Berkeley National Laboratory¹, Kookmin University², Princeton Plasma Physics Laboratory³

1 Sep 2011

TL;DR: This work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization, and achieves significant performance improvements on these complex PIC kernels, despite the inherent challenges of data dependency and locality.

...read moreread less

Abstract: The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC's key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7x on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.08.004•

Two-dimensional cache-oblivious sparse matrix-vector multiplication

[...]

A. N. Yzelman¹, Rob H. Bisseling¹•Institutions (1)

Utrecht University¹

1 Dec 2011

TL;DR: The research is presented, extending the one-dimensional method for cache-oblivious SpMV multiplication to two dimensions, while still allowing only row and column permutations on the sparse input matrix, with the largest gain obtained over a factor of 3 in SpMv speed, compared to the natural matrix ordering.

...read moreread less

Abstract: In earlier work, we presented a one-dimensional cache-oblivious sparse matrix-vector (SpMV) multiplication scheme which has its roots in one-dimensional sparse matrix partitioning. Partitioning is often used in distributed-memory parallel computing for the SpMV multiplication, an important kernel in many applications. A logical extension is to move towards using a two-dimensional partitioning. In this paper, we present our research in this direction, extending the one-dimensional method for cache-oblivious SpMV multiplication to two dimensions, while still allowing only row and column permutations on the sparse input matrix. This extension requires a generalisation of the compressed row storage data structure to a block-based data structure, for which several variants are investigated. Experiments performed on three different architectures show further improvements of the two-dimensional method compared to the one-dimensional method, especially in those cases where the one-dimensional method already provided significant gains. The largest gain obtained by our new reordering is over a factor of 3 in SpMV speed, compared to the natural matrix ordering.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.09.002•

A CPU-GPU hybrid approach for the unsymmetric multifrontal method

[...]

Chenhan D. Yu¹, Weichung Wang¹, Dan'l Pierce²•Institutions (2)

National Taiwan University¹, MSC Software²

1 Dec 2011

TL;DR: Numerical results show that the CPU-GPU hybrid approach can accelerate the unsymmetric multifrontal solver, especially for computationally expensive problems.

...read moreread less

Abstract: Multifrontal is an efficient direct method for solving large-scale sparse and unsymmetric linear systems. The method transforms a large sparse matrix factorization process into a sequence of factorizations involving smaller dense frontal matrices. Some of these dense operations can be accelerated by using a graphic processing unit (GPU). We analyze the unsymmetric multifrontal method from both an algorithmic and implementational perspective to see how a GPU, in particular the NVIDIA Tesla C2070, can be used to accelerate the computations. Our main accelerating strategies include (i) performing BLAS on both CPU and GPU, (ii) improving the communication efficiency between the CPU and GPU by using page-locked memory, zero-copy memory, and asynchronous memory copy, and (iii) a modified algorithm that reuses the memory between different GPU tasks and sets thresholds to determine whether certain tasks be performed on the GPU. The proposed acceleration strategies are implemented by modifying UMFPACK, which is an unsymmetric multifrontal linear system solver. Numerical results show that the CPU-GPU hybrid approach can accelerate the unsymmetric multifrontal solver, especially for computationally expensive problems.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.03.001•

Parallel c-means algorithm for image segmentation on a reconfigurable mesh computer

[...]

Omar Bouattane, Bouchaib Cherradi, Mohamed Youssfi, Mohamed O. Bensalah

1 Apr 2011

TL;DR: A parallel algorithm for data classification, and its application for Magnetic Resonance Images (MRI) segmentation is proposed, and the studied classification method is the well-known c-means method.

...read moreread less

Abstract: In this paper, we propose a parallel algorithm for data classification, and its application for Magnetic Resonance Images (MRI) segmentation. The studied classification method is the well-known c-means method. The use of the parallel architecture in the classification domain is introduced in order to improve the complexities of the corresponding algorithms, so that they will be considered as a pre-processing procedure. The proposed algorithm is assigned to be implemented on a parallel machine, which is the reconfigurable mesh computer (RMC). The image of size (mxn) to be processed must be stored on the RMC of the same size, one pixel per processing element (PE).

...read moreread less

Journal Article•10.1016/J.PARCO.2011.05.006•

Tuning collective communication for Partitioned Global Address Space programming models

[...]

Rajesh Nishtala¹, Yili Zheng², Paul Hargrove², Katherine Yelick²•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

1 Sep 2011

TL;DR: An implementation framework for PGAS collectives is presented as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids and supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned.

...read moreread less

Abstract: Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.

...read moreread less

Journal Article•10.1016/J.PARCO.2010.11.002•

Exploiting thread-level parallelism in the iterative solution of sparse linear systems

[...]

José Ignacio Aliaga¹, Matthias Bollhöfer², Alberto F. Martín¹, Enrique S. Quintana-Ortí¹•Institutions (2)

James I University¹, Braunschweig University of Technology²

1 Mar 2011

TL;DR: This work exploits the parallelism exposed by the task tree corresponding to the nested dissection hierarchy (task parallelism), employ dynamic scheduling of tasks to processors to improve load balance, and formulate all stages of the parallel PCG method conformal with the computation of the preconditioner to increase data reuse.

...read moreread less

Abstract: We investigate the efficient iterative solution of large-scale sparse linear systems on shared-memory multiprocessors. Our parallel approach is based on a multilevel ILU preconditioner which preserves the mathematical semantics of the sequential method in ILUPACK. We exploit the parallelism exposed by the task tree corresponding to the nested dissection hierarchy (task parallelism), employ dynamic scheduling of tasks to processors to improve load balance, and formulate all stages of the parallel PCG method conformal with the computation of the preconditioner to increase data reuse. Results on a CC-NUMA platform with 16 processors reveal the parallel efficiency of this solution.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.03.006•

A hybrid message passing/shared memory parallelization of the adaptive integral method for multi-core clusters

[...]

Fangzhou Wei¹, Ali E. Yilmaz¹•Institutions (1)

University of Texas at Austin¹

1 Jun 2011

TL;DR: Timing and speedup results show that the hybrid MPI/OpenMP parallelization of AIM exhibits better strong scalability (fixed problem size speedup) than pure MPI Parallelization of it when multiple cores are used on each processor.

...read moreread less

Abstract: A hybrid message passing and shared memory parallelization technique is presented for improving the scalability of the adaptive integral method (AIM), an FFT based algorithm, on clusters of identical multi-core processors. The proposed hybrid MPI/OpenMP parallelization scheme is based on a nested one-dimensional (1-D) slab decomposition of the 3-D auxiliary regular grid and the associated AIM calculations: If there are M processors and T cores per processor, the scheme (i) divides the regular grid into M slabs and MT sub-slabs, (ii) assigns each slab/sub-slab and the associated operations to one of the processors/cores, and (iii) uses MPI for inter-processor data communication and OpenMP for intra-processor data exchange. The MPI/OpenMP parallel AIM is used to accelerate the solution of the combined-field integral equation pertinent to the analysis of time-harmonic electromagnetic scattering from perfectly conducting surfaces. The scalability of the scheme is investigated theoretically and verified on a state-of-the-art multi-core cluster for benchmark scattering problems. Timing and speedup results on up to 1024 quad-core processors show that the hybrid MPI/OpenMP parallelization of AIM exhibits better strong scalability (fixed problem size speedup) than pure MPI parallelization of it when multiple cores are used on each processor.

...read moreread less

Proceedings Article•

SWARM: A Parallel Programming Framework for Multicore Processors.

[...]

David A. Bader¹, Guojing Cong²•Institutions (2)

Georgia Institute of Technology¹, IBM²

1 Jan 2011

TL;DR: This paper introduces SWARM (software and algorithms for running on multi-core), a portable open-source parallel library of basic primitives that fully exploit multicore processors and proposes a computational model for these systems.

...read moreread less

Abstract: Due to fundamental physical limitations and power constraints, we are witnessing a radical change in commodity microprocessor architectures to multicore designs. Continued performance on multicore processors now requires the exploitation of concurrency at the algorithmic level. In this paper, we identify key issues in algorithm design for multicore processors and propose a computational model for these systems. We introduce SWARM (software and algorithms for running on multi-core), a portable open-source parallel library of basic primitives that fully exploit multicore processors. Using this framework, we have implemented efficient parallel algorithms for important primitive operations such as prefix-sums, pointer-jumping, symmetry breaking, and list ranking; for combinatorial problems such as sorting and selection; for parallel graph theoretic algorithms such as spanning tree, minimum spanning tree, graph decomposition, and tree contraction; and for computational genomics applications such as maximum parsimony. The main contributions of this paper are the design of the SWARM multicore framework, the presentation of a multicore algorithmic model, and validation results for this model. SWARM is freely available as open-source from http://multicore-swarm.sourceforge.net/.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.05.001•

Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures

[...]

Lars Karlsson¹, Bo Kågström¹•Institutions (1)

Umeå University¹

1 Dec 2011

TL;DR: It is shown that a two-stage approach consisting of an intermediate reduction to block Hessenberg form speeds up the reduction by avoiding matrix-vector multiplications.

...read moreread less

Abstract: We consider parallel reduction of a real matrix to Hessenberg form using orthogonal transformations. Standard Hessenberg reduction algorithms reduce the columns of the matrix from left to right in either a blocked or unblocked fashion. However, the standard blocked variant performs 20% of the computations in terms of matrix-vector multiplications. We show that a two-stage approach consisting of an intermediate reduction to block Hessenberg form speeds up the reduction by avoiding matrix-vector multiplications. We describe and evaluate a new high-performance implementation of the two-stage approach that attains significant speedups over the one-stage approach. The key components are a dynamically scheduled implementation of Stage 1 and a blocked, adaptively load-balanced implementation of Stage 2.

...read moreread less

Proceedings Article•

Intel® Threading Building Blocks (TBB).

[...]

Arch D. Robison¹•Institutions (1)

Intel¹

1 Jan 2011

Journal Article•10.1016/J.PARCO.2011.05.007•

A parallel block LU decomposition method for distributed finite element matrices

[...]

Daniel Maurer¹, Christian Wieners¹•Institutions (1)

Karlsruhe Institute of Technology¹

1 Dec 2011

TL;DR: This work presents a new parallel direct linear solver for matrices resulting from finite element problems that follows the nested dissection approach, where the resulting Schur complements are also distributed in parallel.

...read moreread less

Abstract: In this work we present a new parallel direct linear solver for matrices resulting from finite element problems. The algorithm follows the nested dissection approach, where the resulting Schur complements are also distributed in parallel. The sparsity structure of the finite element matrices is used to pre-compute an efficient block structure for the LU factors. We demonstrate the performance and the parallel scaling behavior by several test examples.

...read moreread less

Journal Article•10.1016/J.PARCO.2010.11.001•

High-performance message-passing over generic Ethernet hardware with Open-MX

[...]

Brice Goglin¹•Institutions (1)

L'Abri¹

1 Feb 2011

TL;DR: This paper details how Open-MX copes with the inherent limitations of the Ethernet hardware to satisfy the requirements of message-passing by applying an innovative copy offload model, and achieves better performance than TCP implementations, especially on 10 gigabit/s hardware.

...read moreread less

Abstract: In the last decade, cluster computing has become the most popular high-performance computing architecture. Although numerous technological innovations have been proposed to improve the interconnection of nodes, many clusters still rely on commodity Ethernet hardware to implement message-passing within parallel applications. We present Open-MX, an open-source message-passing stack over generic Ethernet. It offers the same abilities as the specialized Myrinet Express stack, without requiring dedicated support from the networking hardware. Open-MX works transparently in the most popular MPI implementations through its MX interface compatibility. It also enables interoperability between hosts running the specialized MX stack and generic Ethernet hosts. We detail how Open-MX copes with the inherent limitations of the Ethernet hardware to satisfy the requirements of message-passing by applying an innovative copy offload model. Combined with a careful tuning of the fabric and of the MX wire protocol, Open-MX achieves better performance than TCP implementations, especially on 10 gigabit/s hardware.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.04.002•

Parallelization of Nullspace Algorithm for the computation of metabolic pathways

[...]

Dimitrije Jevremovic¹, Cong T. Trinh², Cong T. Trinh¹, Friedrich Srienc³, Carlos P. Sosa¹, Daniel Boley - Show less +2 more•Institutions (3)

University of Minnesota¹, University of Tennessee², Biotechnology Institute³

1 Jun 2011

TL;DR: This work develops a distributed memory parallelization of the Nullspace Algorithm to handle efficiently the computation of the elementary modes of a large metabolic network and gives an implementation in C++ language with the support of MPI library functions for the parallel communication.

...read moreread less

Abstract: Elementary mode analysis is a useful metabolic pathway analysis tool in understanding and analyzing cellular metabolism, since elementary modes can represent metabolic pathways with unique and minimal sets of enzyme-catalyzed reactions of a metabolic network under steady state conditions. However, computation of the elementary modes of a genome- scale metabolic network with 100-1000 reactions is very expensive and sometimes not feasible with the commonly used serial Nullspace Algorithm. In this work, we develop a distributed memory parallelization of the Nullspace Algorithm to handle efficiently the computation of the elementary modes of a large metabolic network. We give an implementation in C++ language with the support of MPI library functions for the parallel communication. Our proposed algorithm is accompanied with an analysis of the complexity and identification of major bottlenecks during computation of all possible pathways of a large metabolic network. The algorithm includes methods to achieve load balancing among the compute-nodes and specific communication patterns to reduce the communication overhead and improve efficiency.

...read moreread less

...

Expand