Top 126 papers presented at Parallel Computing in 2012

Showing papers presented at "Parallel Computing in 2012"

Journal Article•10.1016/J.PARCO.2011.08.005•

Parallel reactive molecular dynamics: Numerical methods and algorithmic techniques

[...]

Hasan Metin Aktulga¹, Joseph C. Fogarty², Sagar A. Pandit², Ananth Grama¹•Institutions (2)

Purdue University¹, University of South Florida²

1 Apr 2012

TL;DR: PuReMD is presented, which extends current spatio-temporal simulation capability for reactive atomistic systems by over an order of magnitude and incorporates efficient dynamic data structures, algorithmic optimizations, and effective solvers to deliver low per-time-step simulation time, with a small memory footprint.

...read moreread less

Abstract: Molecular dynamics modeling has provided a powerful tool for simulating and understanding diverse systems - ranging from materials processes to biophysical phenomena. Parallel formulations of these methods have been shown to be among the most scalable scientific computing applications. Many instances of this class of methods rely on a static bond structure for molecules, rendering them infeasible for reactive systems. Recent work on reactive force fields has resulted in the development of ReaxFF, a novel bond order potential that bridges quantum-scale and classical MD approaches by explicitly modeling bond activity (reactions) and charge equilibration. These aspects of ReaxFF pose significant challenges from a computational standpoint, both in sequential and parallel contexts. Evolving bond structure requires efficient dynamic data structures. Minimizing electrostatic energy through charge equilibration requires the solution of a large sparse linear system with a shielded electrostatic kernel at each sub-femtosecond long time-step. In this context, reaching spatio-temporal scales of tens of nanometers and nanoseconds, where phenomena of interest can be observed, poses significant challenges. In this paper, we present the design and implementation details of the Purdue Reactive Molecular Dynamics code, PuReMD. PuReMD has been demonstrated to be highly efficient (in terms of processor performance) and scalable. It extends current spatio-temporal simulation capability for reactive atomistic systems by over an order of magnitude. It incorporates efficient dynamic data structures, algorithmic optimizations, and effective solvers to deliver low per-time-step simulation time, with a small memory footprint. PuReMD is comprehensively validated for performance and accuracy on up to 3375 cores on a commodity cluster (Hera at LLNL-OCF). Potential performance bottlenecks to scalability beyond our experiments have also been analyzed. PuReMD is available over the public domain and has been used to model diverse systems, ranging from strain relaxation in Si-Ge nanobars, water-silica surface interaction, and oxidative stress in lipid bilayers (bio-membranes).

...read moreread less

955 citations

Journal Article•10.1016/J.PARCO.2011.09.001•

PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

[...]

Andreas Klöckner¹, Nicolas Pinto², Yunsup Lee³, Bryan Catanzaro³, Paul Ivanov³, Ahmed Fasih⁴ - Show less +2 more•Institutions (4)

Courant Institute of Mathematical Sciences¹, McGovern Institute for Brain Research², University of California, Berkeley³, Ohio State University⁴

1 Mar 2012

TL;DR: This article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems.

...read moreread less

Abstract: High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that supports this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.

...read moreread less

676 citations

Proceedings Article•10.1109/INPAR.2012.6339595•

Auto-tuning a high-level language targeted to GPU codes

[...]

Scott Grauer-Gray¹, Lifan Xu¹, Robert Searles¹, Sudhee Ayalasomayajula¹, John Cavazos¹ - Show less +1 more•Institutions (1)

University UCINF¹

13 May 2012

TL;DR: This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.

...read moreread less

Abstract: Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel. In this work, we apply optimizations to GPU code using HMPP, a high-level directive-based language and source-to-source compiler that can generate CUDA / OpenCL code. However, programming with high-level languages may mean a loss of performance compared to using low-level languages. Our work shows that it is possible to improve the performance of a high-level language by using auto-tuning. We perform auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and show results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. The results show that our auto-tuned HMPP-generated implementations are significantly faster than the default HMPP implementation and can meet or exceed the performance of manually coded CUDA / OpenCL implementations.

...read moreread less

510 citations

Journal Article•10.1016/J.PARCO.2011.10.003•

DAGuE: A generic distributed DAG engine for High Performance Computing

[...]

George Bosilca¹, Aurelien Bouteiller¹, Anthony Danalis¹, Thomas Herault¹, Pierre Lemarinier², Jack Dongarra³ - Show less +2 more•Institutions (3)

University of Tennessee¹, University of Rennes², Oak Ridge National Laboratory³

1 Jan 2012

TL;DR: DAGuE is presented, a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority.

...read moreread less

Abstract: The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures is a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be expressed as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problem-size independent format that can be queried on-demand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority. We demonstrate the efficiency of our approach, using several micro-benchmarks to analyze the performance of different components of the framework, and a linear algebra factorization as a use case.

...read moreread less

291 citations

Proceedings Article•10.1109/INPAR.2012.6339596•

A study of Persistent Threads style GPU programming for GPGPU workloads

[...]

Kshitij Gupta¹, Jeff A. Stuart¹, John D. Owens¹•Institutions (1)

University of California, Davis¹

13 May 2012

TL;DR: Through micro-kernel benchmarks, it is shown the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases.

...read moreread less

Abstract: In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing—CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.

...read moreread less

264 citations

Proceedings Article•10.1109/INPAR.2012.6339601•

ispc: A SPMD compiler for high-performance CPU programming

[...]

Matt Pharr¹, William R. Mark¹•Institutions (1)

Intel¹

13 May 2012

TL;DR: A compiler, the Intel R® SPMD Program Compiler (ispc), is developed that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units.

...read moreread less

Abstract: SIMD parallelism has become an increasingly important mechanism for delivering performance in modern CPUs, due its power efficiency and relatively low cost in die area compared to other forms of parallelism. Unfortunately, languages and compilers for CPUs have not kept up with the hardware's capabilities. Existing CPU parallel programming models focus primarily on multi-core parallelism, neglecting the substantial computational capabilities that are available in CPU SIMD vector units. GPU-oriented languages like OpenCL support SIMD but lack capabilities needed to achieve maximum efficiency on CPUs and suffer from GPU-driven constraints that impair ease of use on CPUs. We have developed a compiler, the Intel R® SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units. ispc draws from GPU programming languages, which have shown that for many applications the easiest way to program SIMD units is to use a single-program, multiple-data (SPMD) model, with each instance of the program mapped to one SIMD lane. We discuss language features that make ispc easy to adopt and use productively with existing software systems and show that ispc delivers up to 35x speedups on a 4-core system and up to 240× speedups on a 40-core system for complex workloads (compared to serial C++ code).

...read moreread less

252 citations

Proceedings Article•10.1109/INPAR.2012.6339594•

OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures

[...]

Gihan R. Mudalige¹, Michael B. Giles¹, Istvan Z. Reguly², Carlo Bertolli³, Paul H. J. Kelly³ - Show less +1 more•Institutions (3)

University of Oxford¹, Pázmány Péter Catholic University², Imperial College London³

13 May 2012

TL;DR: It is demonstrated that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

...read moreread less

Abstract: OP2 is an “active” library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2's recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs. We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2's design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

...read moreread less

115 citations

Journal Article•10.1016/J.PARCO.2012.07.001•

Graph coloring algorithms for multi-core and massively multithreaded architectures

[...]

ímit V. Çatalyürek¹, John Feo², Assefaw H. Gebremedhin³, Mahantesh Halappanavar², Alex Pothen³ - Show less +1 more•Institutions (3)

Ohio State University¹, Pacific Northwest National Laboratory², Purdue University³

1 Oct 2012

TL;DR: In this paper, the authors explore the interplay between architectures and algorithm design in the context of shared-memory platforms and a specific graph problem of central importance in scientific and high-performance computing, distance-1 graph coloring.

...read moreread less

Abstract: We explore the interplay between architectures and algorithm design in the context of shared-memory platforms and a specific graph problem of central importance in scientific and high-performance computing, distance-1 graph coloring. We introduce two different kinds of multithreaded heuristic algorithms for the stated, NP-hard, problem. The first algorithm relies on speculation and iteration, and is suitable for any shared-memory system. The second algorithm uses dataflow principles, and is targeted at the non-conventional, massively multithreaded Cray XMT system. We study the performance of the algorithms on the Cray XMT and two multi-core systems, Sun Niagara 2 and Intel Nehalem. Together, the three systems represent a spectrum of multithreading capabilities and memory structure. As testbed, we use synthetically generated large-scale graphs carefully chosen to cover a wide range of input types. The results show that the algorithms have scalable runtime performance and use nearly the same number of colors as the underlying serial algorithm, which in turn is effective in practice. The study provides insight into the design of high performance algorithms for irregular problems on many-core architectures.

...read moreread less

81 citations

Proceedings Article•10.1109/INPAR.2012.6339609•

VOCL: An optimized environment for transparent virtualization of graphics processing units

[...]

Shucai Xiao¹, Pavan Balaji², Qian Zhu³, Rajeev Thakur², Susan Coghlan², Heshan Lin¹, Gaojin Wen⁴, Jue Hong⁴, Wu-chun Feng¹ - Show less +5 more•Institutions (4)

Virginia Tech¹, Argonne National Laboratory², Accenture³, Chinese Academy of Sciences⁴

13 May 2012

TL;DR: A virtual OpenCL (VOCL) framework is proposed to support the transparent utilization of local or remote GPUs, which exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution.

...read moreread less

Abstract: Graphics processing units (GPUs) have been widely used for general-purpose computation acceleration. However, current programming models such as CUDA and OpenCL can support GPUs only on the local computing node, where the application execution is tightly coupled to the physical GPU hardware. In this work, we propose a virtual OpenCL (VOCL) framework to support the transparent utilization of local or remote GPUs. This framework, based on the OpenCL programming model, exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution. The proposed framework requires no source code modifications. We also propose various strategies for reducing the overhead caused by data communication and kernel launching and demonstrate about 85% of the data write bandwidth and 90% of the data read bandwidth compared to data write and read, respectively, in a native nonvirtualized environment. We evaluate the performance of VOCL using four real-world applications with various computation and memory access intensities and demonstrate that compute-intensive applications can execute with negligible overhead in the VOCL environment.

...read moreread less

80 citations

Proceedings Article•10.1109/INPAR.2012.6339604•

ScatterAlloc: Massively parallel dynamic memory allocation for the GPU

[...]

Markus Steinberger¹, Michael Kenzel¹, Bernhard Kainz¹, Dieter Schmalstieg¹•Institutions (1)

Graz University of Technology¹

13 May 2012

TL;DR: This paper analyzes the special requirements of a dynamic memory allocator that is designed for massively parallel architectures such as Graphics Processing Units (GPUs) and presents the thorough design of ScatterAlloc, which can efficiently deal with hundreds of requests in parallel.

...read moreread less

Abstract: In this paper, we analyze the special requirements of a dynamic memory allocator that is designed for massively parallel architectures such as Graphics Processing Units (GPUs). We show that traditional strategies, which work well on CPUs, are not well suited for the use on GPUs and present the thorough design of ScatterAlloc, which can efficiently deal with hundreds of requests in parallel. Our allocator greatly reduces collisions and congestion by scattering memory requests based on hashing. We analyze ScatterAlloc in terms of allocation speed, data access time and fragmentation, and compare it to current state-of-the-art allocators, including the one provided with the NVIDIA CUDA toolkit. Our results show, that ScatterAlloc clearly outperforms these other approaches, yielding speed-ups between 10 to 100.

...read moreread less

71 citations

Journal Article•10.1016/J.PARCO.2012.08.001•

Parallel job scheduling for power constrained HPC systems

[...]

M. Etinski¹, Julita Corbalan¹, Jesús Labarta¹, Mateo Valero¹•Institutions (1)

Polytechnic University of Catalonia¹

1 Dec 2012

TL;DR: The proposed MaxJobPerf policy, a new parallel job scheduling policy based on integer linear programming, clearly outperforms the other power-budgeting approaches at the parallelJob scheduling level and is compared against other power budgeting policies for different power budgets.

...read moreread less

Abstract: Power has become the primary constraint in high performance computing. Traditionally, parallel job scheduling policies have been designed to improve certain job performance metrics when scheduling parallel workloads on a system with a given number of processors. The available number of processors is not anymore the only limitation in parallel job scheduling. The recent increase in processor power consumption has resulted in a new limitation: the available power. Given constraints naturally lead to an optimization problem. We proposed MaxJobPerf, a new parallel job scheduling policy based on integer linear programming. Dynamic Voltage Frequency Scaling (DVFS) is a widely used technique that running applications at reduced CPU frequency/voltage trades increased execution time for power reduction. The optimization problem determines which jobs should run and at which frequency. In this paper, we compare the MaxJobPerf policy against other power budgeting policies for different power budgets. It clearly outperforms the other power-budgeting approaches at the parallel job scheduling level. Furthermore, we give a detailed analysis of the policy parameters including a discussion on how to manage job reservations to avoid job starvation.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.07.002•

Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

[...]

Mickeal Verschoor¹, Andrei C. Jalba¹•Institutions (1)

Eindhoven University of Technology¹

1 Oct 2012

TL;DR: It is shown that reordering matrix blocks substantially improves the performance of the SpMV operation, especially when small blocks are used, so that the method outperforms existing state-of-the-art approaches, in most cases.

...read moreread less

Abstract: The Conjugate Gradient (CG) method is a widely-used iterative method for solving linear systems described by a (sparse) matrix. The method requires a large amount of Sparse-Matrix Vector (SpMV) multiplications, vector reductions and other vector operations to be performed. We present a number of mappings for the SpMV operation on modern programmable GPUs using the Block Compressed Sparse Row (BCSR) format. Further, we show that reordering matrix blocks substantially improves the performance of the SpMV operation, especially when small blocks are used, so that our method outperforms existing state-of-the-art approaches, in most cases. Finally, a thorough analysis of the performance of both SpMV and CG methods is performed, which allows us to model and estimate the expected maximum performance for a given (unseen) problem.

...read moreread less

Proceedings Article•10.1109/INPAR.2012.6339602•

Efficient sparse matrix-vector multiplication on cache-based GPUs

[...]

Istvan R eguly¹, Michael B. Giles²•Institutions (2)

Pázmány Péter Catholic University¹, University of Oxford²

13 May 2012

TL;DR: This paper discusses efficient implementations of sparse matrix-vector multiplication on NVIDIA's Fermi architecture, the first to introduce conventional L1 caches to GPUs, and focuses on the compressed sparse row (CSR) format for developing general purpose code.

...read moreread less

Abstract: Sparse matrix-vector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidth-limited operation on current hardware. On cache-based architectures the main factors that influence performance are spatial locality in accessing the matrix, and temporal locality in re-using the elements of the vector.

...read moreread less

Book Chapter•10.1007/978-3-642-30397-5_10•

A GPU algorithm for greedy graph matching

[...]

Bas O. Fagginger Auer¹, Rob H. Bisseling¹•Institutions (1)

Utrecht University¹

1 Jan 2012

TL;DR: A fine-grained shared-memory parallel algorithm for maximal greedy matching is introduced, together with an implementation on the GPU, which is faster (speedups up to 6.8 for random matching and 5.6 for weighted matching) than the serial CPU algorithms.

...read moreread less

Abstract: Greedy graph matching provides us with a fast way to coarsen a graph during graph partitioning. Direct algorithms on the CPU which perform such greedy matchings are simple and fast, but offer few handholds for parallelisation. To remedy this, we introduce a fine-grained shared-memory parallel algorithm for maximal greedy matching, together with an implementation on the GPU, which is faster (speedups up to 6.8 for random matching and 5.6 for weighted matching) than the serial CPU algorithms and produces matchings of similar (random matching) or better (weighted matching) quality.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.03.001•

Compressed sensing and Cholesky decomposition on FPGAs and GPUs

[...]

Depeng Yang¹, Gregory D. Peterson¹, Husheng Li¹•Institutions (1)

University of Tennessee¹

1 Aug 2012

TL;DR: Results show that the proposed Cholesky decomposition on FPGAs and GPUs are much faster than LAPACK and MAGMA for small matrices and accelerating CS signal reconstruction algorithms can achieve around 15x speedup.

...read moreread less

Abstract: Compressed sensing (CS) is a revolutionary signal acquisition theory, enabling signal acquisition at a rate that is below the Nyquist sampling rate. However, CS signal reconstruction algorithms are computationally expensive. One of the key computation steps in CS algorithms is to iteratively compute a Cholesky decomposition. Modern application acceleration devices, such as FPGAs and GPUs, can accelerate Cholesky decomposition and CS signal reconstruction computation. This paper presents high performance parallel Cholesky decomposition algorithms for GPU and FPGA implementation. For GPUs, an optimized Cholesky decomposition algorithm is developed with high parallelism, reduced data copying, and improved memory access. For FPGAs, a dedicated pipelined hardware architecture for Cholesky decomposition is designed. Only one pipelined triangular linear equation solver is needed for solving Cholesky decomposition and Cholesky decomposition-based linear equation systems. Moreover, CS signal reconstruction algorithms are accelerated on GPUs and FPGAs for fast signal recovery based on our iterative Cholesky decomposition. Results show that the proposed Cholesky decomposition on FPGAs and GPUs are much faster than LAPACK and MAGMA for small matrices. For accelerating CS signal reconstruction algorithms, our FPGA implementation can achieve around 15x speedup and our GPU implementation can achieve about a 38x speedup compared with the CPU using LAPACK and the hybrid CPU/GPU system with MAGMA.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.10.008•

Using explicit platform descriptions to support programming of heterogeneous many-core systems

[...]

Martin Sandrieser¹, Siegfried Benkner¹, Sabri Pllana¹•Institutions (1)

University of Vienna¹

1 Jan 2012

TL;DR: A platform description language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems and develops a prototype source-to-source compilation framework that utilizes PDL descriptors to transform sequential task-based programs with source code annotations into a form that is convenient for execution on heterogeneous many-core systems.

...read moreread less

Abstract: Heterogeneous many-core systems constitute a viable approach for coping with power constraints in modern computer architectures and can now be found across the whole computing landscape ranging from mobile devices, to desktop systems and servers, all the way to high-end supercomputers and large-scale data centers. While these systems promise to offer superior performance-power ratios, programming heterogeneous many-core architectures efficiently has been shown to be notoriously difficult. Programmers typically are forced to take into account a plethora of low-level architectural details and usually have to resort to a combination of different programming models within a single application. In this paper we propose a platform description language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems. PDL architecture descriptions support both programmers and toolchains by providing platform-specific information in a well-defined and explicit manner. We have developed a prototype source-to-source compilation framework that utilizes PDL descriptors to transform sequential task-based programs with source code annotations into a form that is convenient for execution on heterogeneous many-core systems. Our framework relies on a component-based approach that accommodates for different implementation variants of tasks, customized for different parts of a heterogeneous platform, and utilizes an advanced runtime system for exploiting parallelism through dynamic task scheduling. We show various usage scenarios of our PDL and demonstrate the effectiveness of our framework for a commonly used scientific kernel and a financial application on different configurations of a state-of-the-art CPU/GPU system.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.05.005•

OpenMP parallelism for fluid and fluid-particulate systems

[...]

Amit Amritkar¹, Danesh K. Tafti¹, Rui Liu², Rick Kufrin², Barbara Chapman³ - Show less +1 more•Institutions (3)

Virginia Tech¹, University of Illinois at Urbana–Champaign², University of Houston³

1 Sep 2012

TL;DR: It is shown through weak and strong scaling studies that OpenMP performance can be made to match that of MPI on the SGI Altix systems for up to 256 cores.

...read moreread less

Abstract: In order to exploit the flexibility of OpenMP in parallelizing large scale multi-physics applications where different modes of parallelism are needed for efficient computation, it is first necessary to be able to scale OpenMP codes as well as MPI on large core counts. In this research we have implemented fine grained OpenMP parallelism for a large CFD code GenIDLEST and investigated the performance from 1 to 256 cores using a variety of performance optimization and measurement tools. It is shown through weak and strong scaling studies that OpenMP performance can be made to match that of MPI on the SGI Altix systems for up to 256 cores. Data placement and locality were established to be key components in obtaining good scalability with OpenMP. It is also shown that a hybrid implementation on a dual core system gives the same performance as standalone MPI or OpenMP. Finally, it is shown that in irregular multi-physics applications which do not adhere solely to the SPMD (Single Process, Multiple Data) mode of computation, as encountered in tightly coupled fluid-particulate systems, the flexibility of OpenMP can have a big performance advantage over MPI.

...read moreread less

Book Chapter•10.1007/978-3-642-36803-5_14•

Use of direct solvers in TFETI massively parallel implementation

[...]

Vaclav Hapla¹, David Horák¹, Michal Merta¹•Institutions (1)

University of Ostrava¹

10 Jun 2012

TL;DR: The comparison of the direct solvers available in PETSc on the Cray XE6 machine HECToR (PETSc, MUMPS, SuperLU) regarding their performance in the two most time consuming actions in TFETI --- the pseudoinverse application and the coarse problem solution is compared.

...read moreread less

Abstract: The FETI methods blend iterative and direct solvers. The dual problem is solved iteratively using e.g. CG method; in each iteration, the auxiliary problems related to the application of an unassembled system matrix (subdomain problems' solutions and projector application in dual operator) are solved directly. The paper deals with the comparison of the direct solvers available in PETSc on the Cray XE6 machine HECToR (PETSc, MUMPS, SuperLU) regarding their performance in the two most time consuming actions in TFETI --- the pseudoinverse application and the coarse problem solution. For the numerical experiments, our novel TFETI implementation in FLLOP (FETI Light Layer on top of PETSc) library was used.

...read moreread less

Book Chapter•10.1007/978-3-642-36803-5_1•

Computational physics on graphics processing units

[...]

Ari Harju¹, Topi Siro¹, Filippo Federici Canova², Samuli Hakala¹, Teemu Rantalaiho³ - Show less +1 more•Institutions (3)

Aalto University¹, Tampere University of Technology², Helsinki Institute of Physics³

10 Jun 2012

TL;DR: In this paper, the authors discuss advances made in the field of computational physics, focusing on classical molecular dynamics and quantum simulations for electronic structure calculations using the density functional theory, wave function techniques and quantum field theory.

...read moreread less

Abstract: The use of graphics processing units for scientific computations is an emerging strategy that can significantly speed up various algorithms. In this review, we discuss advances made in the field of computational physics, focusing on classical molecular dynamics and quantum simulations for electronic structure calculations using the density functional theory, wave function techniques and quantum field theory.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.02.001•

Cluster-based optimized parallel video transcoding

[...]

Gerassimos Barlas¹•Institutions (1)

American University of Sharjah¹

1 Apr 2012

TL;DR: An analytical approach to the optimization of a large collection of parallel transcoding techniques based on temporal partitioning, is pursued, and closed-form solutions to the partitioning/scheduling problem are derived for the most important of these methods, under CBR input media conditions.

...read moreread less

Abstract: Video transcoding is a popular technique for delivering video content of varying quality and size to diverse audiences. In this paper an analytical approach to the optimization of a large collection of parallel transcoding techniques based on temporal partitioning, is pursued. The key elements in the design of such techniques are identified, allowing them to be enumerated and classified. Closed-form solutions to the partitioning/scheduling problem (and optimum operation sequencing where necessary) are derived for the most important of these methods, under CBR input media conditions. Subsequently, appropriate heuristics allow the solution of the partitioning problem under VBR input media conditions. The paper is concluded by an extensive battery of tests for the most significant strategies, on several feature-length video streams. The tests reveal not only how one of the proposed strategies, namely NPWF"V"B"R, strikes a nice balance between efficiency and distortion minimization on heterogeneous platforms, but also allow us to derive guidelines for transcoding solution deployment.

...read moreread less

Book Chapter•10.1007/978-3-642-36803-5_13•

Distributed evolutionary computing system based on web browsers with javascript

[...]

Jerzy Duda¹, Wojciech Dlubacz¹•Institutions (1)

AGH University of Science and Technology¹

10 Jun 2012

TL;DR: The paper presents a distributed computing system that is based on evolutionary algorithms and utilizing a web browser on a client's side and shows, that the system scales quite smoothly, taking additional advantage of local search algorithm executed by some clients.

...read moreread less

Abstract: The paper presents a distributed computing system that is based on evolutionary algorithms and utilizing a web browser on a client's side. Evolutionary algorithm is coded in JavaScript language embedded in a web page sent to the client. The code is optimized with regards to the memory usage and communication efficiency between the server and the clients. The server side is also based on JavaScript language, as node.js server was applied. The proposed system has been tested on the basis of permutation flowshop scheduling problem, one of the most popular optimization benchmarks for heuristics studied in the literature. The results have shown, that the system scales quite smoothly, taking additional advantage of local search algorithm executed by some clients.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.03.005•

Improving performance of adaptive component-based dataflow middleware

[...]

Timothy D. R. Hartley¹, Erik Saule¹, ímit V. Çatalyürek¹•Institutions (1)

Ohio State University¹

1 Jun 2012

TL;DR: This work presents a novel framework, with which developers can easily create high-performance dataflow applications, without the tedious tuning process, and shows that this approach achieves good performance for a wide range of applications, with a much-reduced development cost.

...read moreread less

Abstract: Making the best use of modern computational resources for distributed applications requires expert knowledge of low-level programming tools, or a productive high-level and high-performance programming framework. Unfortunately, even state-of-the-art high-level frameworks still require the developer to conduct a tedious manual tuning step to find the work partitioning which gives the best application execution performance. Here, we present a novel framework, with which developers can easily create high-performance dataflow applications, without the tedious tuning process. We compare the performance of our approach to that of three distributed programming frameworks which differ significantly in their programming paradigm, their support for multi-core CPUs and accelerators, and their load-balancing approach. These three frameworks are DataCutter, a component-based dataflow framework, KAAPI, a framework using asynchronous function calls, and MR-MPI, a MapReduce implementation. By highly optimizing the implementations of three applications on the four frameworks and comparing the execution time performance of the runtime engines, we show their strengths and weaknesses. We show that our approach achieves good performance for a wide range of applications, with a much-reduced development cost.

...read moreread less

Proceedings Article•10.1109/INPAR.2012.6339610•

GPU accelerated nonlinear optimization in radio interferometric calibration

[...]

Sarod Yatawatta¹, S. Kazemi², Saleem Zaroubi²•Institutions (2)

ASTRON¹, Kapteyn Astronomical Institute²

13 May 2012

TL;DR: The GPU based acceleration of two well known nonlinear optimization routines: Levenberg-Marquardt (LM) and Limited Memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) in radio interferometric calibration is presented.

...read moreread less

Abstract: We present the GPU based acceleration of two well known nonlinear optimization routines: Levenberg-Marquardt (LM) and Limited Memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) in radio interferometric calibration. Radio interferometric calibration is a heavily compute intensive operation where the same nonlinear optimization problem has to be solved over many time intervals, with different data. We achieve a speedup of about 3 times compared with conventional multi-core CPU based optimization by using GPU accelerated linear algebra routines (CULAtools,CUBLAS). We present details of our GPU accelerated optimization algorithms as well as timing comparisons with non-GPU based multi-core CPU routines.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.03.004•

Parallelizing SOR for GPGPUs using alternate loop tiling

[...]

Peng Di¹, Hui Wu¹, Jingling Xue¹, Feng Wang², Canqun Yang² - Show less +1 more•Institutions (2)

University of New South Wales¹, National University of Defense Technology²

1 Jun 2012

TL;DR: This paper presents a new parallel SOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs and outperforms red- black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism.

...read moreread less

Abstract: Gauss-Seidel and SOR, which are widely used smoothers in multigrid methods, are difficult to parallelize, particularly on GPGPUs due to the existence of DOACROSS data dependences. In this paper, we present a new parallel SOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a loop tiling technique called alternate tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of PDE-like DOACROSS loops on GPGPUs.

...read moreread less

Proceedings Article•10.1109/INPAR.2012.6339597•

Policy-based tuning for performance portability and library co-optimization

[...]

Duane Merrill¹, Michael Garland¹, Andrew S. Grimshaw²•Institutions (2)

Nvidia¹, University of Virginia²

13 May 2012

TL;DR: This paper presents a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand, and enables flexible granularity coarsening.

...read moreread less

Abstract: Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with “universally reasonable” performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.

...read moreread less

Book Chapter•10.1007/978-3-642-36803-5_31•

Solution of multi-objective competitive facility location problems using parallel NSGA-II on large scale computing systems

[...]

Algirdas Lančinskas¹, Julius Żilinskas¹•Institutions (1)

Vilnius University¹

10 Jun 2012

TL;DR: Several strategies to parallelize the algorithm utilizing both the distributed and shared memory parallel programing models are presented, and results of experimental investigation carried out by solving the competitive facility location problem using up to 2048 processing units are presented.

...read moreread less

Abstract: The multi-objective firm expansion problem on competitive facility location model, and an evolutionary algorithm suitable to solve multi-objective optimization problems are reviewed in the paper. Several strategies to parallelize the algorithm utilizing both the distributed and shared memory parallel programing models are presented. Results of experimental investigation carried out by solving the competitive facility location problem using up to 2048 processing units are presented and discussed.

...read moreread less

Book Chapter•10.1007/978-3-642-36803-5_25•

Vectorized higher order finite difference kernels

[...]

Gerhard Zumbusch¹•Institutions (1)

University of Jena¹

10 Jun 2012

TL;DR: The combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils.

...read moreread less

Abstract: Several highly optimized implementations of Finite Difference schemes are discussed. The combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils. The kernels are implemented on and tuned for several processor architectures like recent Intel Sandy Bridge, Ivy Bridge and AMD Bulldozer CPU cores, all with AVX vector instructions as well as Nvidia Kepler and Fermi and AMD Southern and Northern Islands GPU architectures, as well as some older architectures for comparison. The kernels are either based on a cache aware spatial loop or on time-slicing to compute several time steps at once. Furthermore, vector components can either be independent, grouped in short vectors of SSE, AVX or GPU warp size or in larger virtual vectors with explicit synchronization. The optimal choice of the algorithm and its parameters depend both on the Finite Difference stencil and on the processor architecture.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.05.001•

Elastic computing: A portable optimization framework for hybrid computers

[...]

John Wernsing¹, Greg Stitt¹•Institutions (1)

University of Florida¹

1 Aug 2012

TL;DR: This paper introduces elastic computing, which is an optimization framework where application designers invoke specialized elastic functions that contain a knowledge-base of implementation alternatives and parallelization strategies that enables dynamic and transparent optimization for different resources and run-time parameters.

...read moreread less

Abstract: Due to power limitations and escalating cooling costs, high-performance computing systems can no longer rely solely on faster clock frequencies and numerous microprocessor nodes to meet increasing performance demands. As an alternative approach, high-performance systems are increasingly integrating multi-core processors and heterogeneous accelerators such as GPUs and FPGAs. However, usage of such hybrid systems has been limited largely to device experts due to significantly increased application design complexity. To enable more transparent usage of hybrid systems, we introduce elastic computing, which is an optimization framework where application designers invoke specialized elastic functions that contain a knowledge-base of implementation alternatives and parallelization strategies. For each elastic function, a collection of optimization tools analyze numerous possible implementations which enables dynamic and transparent optimization for different resources and run-time parameters. In this paper, we present the enabling technologies of elastic computing, and evaluate those technologies on four different hybrid systems, including the Novo-G FPGA supercomputer. The results include detailed case studies of using elastic computing for time-domain convolution and sum of absolute difference image retrieval, which achieved speedups up to 206x.

...read moreread less

Proceedings Article•10.1109/INPAR.2012.6339611•

Parallel speculative encryption of multiple AES contexts on GPUs

[...]

Wagner M. Nunan Zola¹, Luis C. E. Bona¹•Institutions (1)

Federal University of Paraná¹

13 May 2012

TL;DR: This work presents a high performance heterogeneous parallel method for encryption using GPUs that executes most of the encryption processes on the GPU and partially on CPU.

...read moreread less

Abstract: This work presents a high performance heterogeneous parallel method for encryption using GPUs. Our heterogeneous design executesmost of the encryption processes on the GPU and partially on CPU. Aside from the AES 16 Byte block size, our parallel AES CTR algorithm divides work in small logical data blocks.

...read moreread less

Journal Article•10.1016/J.PARCO.2011.11.001•

Load balancing in homogeneous pipeline based applications

[...]

Andreu Moreno, Eduardo César¹, A. Guevara¹, Joan Sorribes¹, Tomàs Margalef¹ - Show less +1 more•Institutions (1)

Autonomous University of Barcelona¹

1 Mar 2012

TL;DR: The key idea is to have free computational resources by gathering the pipeline's fastest stages and then using these resources to replicate the slowest stages, and this work shows a new strategy for dynamically improving the performance of pipeline applications.

...read moreread less

Abstract: We propose to use knowledge about a parallel application's structure that was acquired with the use of a skeleton based development strategy to dynamically improve its performance. Parallel/distributed programming provides the possibility of solving highly demanding computational problems. However, this type of application requires support tools in all phases of the development cycle because the implementation is extremely difficult, especially for non-expert programmers. This work shows a new strategy for dynamically improving the performance of pipeline applications. We call this approach Dynamic Pipeline Mapping (DPM), and the key idea is to have free computational resources by gathering the pipeline's fastest stages and then using these resources to replicate the slowest stages. We present two versions of this strategy, both with complexity O(Nlog(N)) on the number of pipe stages, and we compare them to an optimal mapping algorithm and to the Binary Search Closest (BSC) algorithm [1]. Our results show that the DPM leads to significant performance improvements, increasing the application throughput up to 40% on average.

...read moreread less

...

Expand