Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2012
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2012
Showing papers presented at "Parallel Computing in 2012"
Journal Article•10.1016/J.PARCO.2011.08.005•
Parallel reactive molecular dynamics: Numerical methods and algorithmic techniques

[...]

Hasan Metin Aktulga1, Joseph C. Fogarty2, Sagar A. Pandit2, Ananth Grama1•
Purdue University1, University of South Florida2
1 Apr 2012
TL;DR: PuReMD is presented, which extends current spatio-temporal simulation capability for reactive atomistic systems by over an order of magnitude and incorporates efficient dynamic data structures, algorithmic optimizations, and effective solvers to deliver low per-time-step simulation time, with a small memory footprint.
Abstract: Molecular dynamics modeling has provided a powerful tool for simulating and understanding diverse systems - ranging from materials processes to biophysical phenomena. Parallel formulations of these methods have been shown to be among the most scalable scientific computing applications. Many instances of this class of methods rely on a static bond structure for molecules, rendering them infeasible for reactive systems. Recent work on reactive force fields has resulted in the development of ReaxFF, a novel bond order potential that bridges quantum-scale and classical MD approaches by explicitly modeling bond activity (reactions) and charge equilibration. These aspects of ReaxFF pose significant challenges from a computational standpoint, both in sequential and parallel contexts. Evolving bond structure requires efficient dynamic data structures. Minimizing electrostatic energy through charge equilibration requires the solution of a large sparse linear system with a shielded electrostatic kernel at each sub-femtosecond long time-step. In this context, reaching spatio-temporal scales of tens of nanometers and nanoseconds, where phenomena of interest can be observed, poses significant challenges. In this paper, we present the design and implementation details of the Purdue Reactive Molecular Dynamics code, PuReMD. PuReMD has been demonstrated to be highly efficient (in terms of processor performance) and scalable. It extends current spatio-temporal simulation capability for reactive atomistic systems by over an order of magnitude. It incorporates efficient dynamic data structures, algorithmic optimizations, and effective solvers to deliver low per-time-step simulation time, with a small memory footprint. PuReMD is comprehensively validated for performance and accuracy on up to 3375 cores on a commodity cluster (Hera at LLNL-OCF). Potential performance bottlenecks to scalability beyond our experiments have also been analyzed. PuReMD is available over the public domain and has been used to model diverse systems, ranging from strain relaxation in Si-Ge nanobars, water-silica surface interaction, and oxidative stress in lipid bilayers (bio-membranes).

955 citations

Journal Article•10.1016/J.PARCO.2011.09.001•
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

[...]

Andreas Klöckner1, Nicolas Pinto2, Yunsup Lee3, Bryan Catanzaro3, Paul Ivanov3, Ahmed Fasih4 •
Courant Institute of Mathematical Sciences1, McGovern Institute for Brain Research2, University of California, Berkeley3, Ohio State University4
1 Mar 2012
TL;DR: This article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems.
Abstract: High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that supports this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.

676 citations

Proceedings Article•10.1109/INPAR.2012.6339595•
Auto-tuning a high-level language targeted to GPU codes

[...]

Scott Grauer-Gray1, Lifan Xu1, Robert Searles1, Sudhee Ayalasomayajula1, John Cavazos1 •
University UCINF1
13 May 2012
TL;DR: This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.
Abstract: Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel. In this work, we apply optimizations to GPU code using HMPP, a high-level directive-based language and source-to-source compiler that can generate CUDA / OpenCL code. However, programming with high-level languages may mean a loss of performance compared to using low-level languages. Our work shows that it is possible to improve the performance of a high-level language by using auto-tuning. We perform auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and show results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. The results show that our auto-tuned HMPP-generated implementations are significantly faster than the default HMPP implementation and can meet or exceed the performance of manually coded CUDA / OpenCL implementations.

510 citations

Journal Article•10.1016/J.PARCO.2011.10.003•
DAGuE: A generic distributed DAG engine for High Performance Computing

[...]

George Bosilca1, Aurelien Bouteiller1, Anthony Danalis1, Thomas Herault1, Pierre Lemarinier2, Jack Dongarra3 •
University of Tennessee1, University of Rennes2, Oak Ridge National Laboratory3
1 Jan 2012
TL;DR: DAGuE is presented, a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority.
Abstract: The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures is a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be expressed as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problem-size independent format that can be queried on-demand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority. We demonstrate the efficiency of our approach, using several micro-benchmarks to analyze the performance of different components of the framework, and a linear algebra factorization as a use case.

291 citations

Proceedings Article•10.1109/INPAR.2012.6339596•
A study of Persistent Threads style GPU programming for GPGPU workloads

[...]

Kshitij Gupta1, Jeff A. Stuart1, John D. Owens1•
University of California, Davis1
13 May 2012
TL;DR: Through micro-kernel benchmarks, it is shown the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases.
Abstract: In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing—CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.

264 citations

Proceedings Article•10.1109/INPAR.2012.6339601•
ispc: A SPMD compiler for high-performance CPU programming

[...]

Matt Pharr1, William R. Mark1•
Intel1
13 May 2012
TL;DR: A compiler, the Intel R® SPMD Program Compiler (ispc), is developed that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units.
Abstract: SIMD parallelism has become an increasingly important mechanism for delivering performance in modern CPUs, due its power efficiency and relatively low cost in die area compared to other forms of parallelism. Unfortunately, languages and compilers for CPUs have not kept up with the hardware's capabilities. Existing CPU parallel programming models focus primarily on multi-core parallelism, neglecting the substantial computational capabilities that are available in CPU SIMD vector units. GPU-oriented languages like OpenCL support SIMD but lack capabilities needed to achieve maximum efficiency on CPUs and suffer from GPU-driven constraints that impair ease of use on CPUs. We have developed a compiler, the Intel R® SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units. ispc draws from GPU programming languages, which have shown that for many applications the easiest way to program SIMD units is to use a single-program, multiple-data (SPMD) model, with each instance of the program mapped to one SIMD lane. We discuss language features that make ispc easy to adopt and use productively with existing software systems and show that ispc delivers up to 35x speedups on a 4-core system and up to 240× speedups on a 40-core system for complex workloads (compared to serial C++ code).

252 citations

Proceedings Article•10.1109/INPAR.2012.6339594•
OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures

[...]

Gihan R. Mudalige1, Michael B. Giles1, Istvan Z. Reguly2, Carlo Bertolli3, Paul H. J. Kelly3 •
University of Oxford1, Pázmány Péter Catholic University2, Imperial College London3
13 May 2012
TL;DR: It is demonstrated that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Abstract: OP2 is an “active” library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2's recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs. We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2's design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

115 citations

Journal Article•10.1016/J.PARCO.2012.07.001•
Graph coloring algorithms for multi-core and massively multithreaded architectures

[...]

ímit V. Çatalyürek1, John Feo2, Assefaw H. Gebremedhin3, Mahantesh Halappanavar2, Alex Pothen3 •
Ohio State University1, Pacific Northwest National Laboratory2, Purdue University3
1 Oct 2012
TL;DR: In this paper, the authors explore the interplay between architectures and algorithm design in the context of shared-memory platforms and a specific graph problem of central importance in scientific and high-performance computing, distance-1 graph coloring.
Abstract: We explore the interplay between architectures and algorithm design in the context of shared-memory platforms and a specific graph problem of central importance in scientific and high-performance computing, distance-1 graph coloring. We introduce two different kinds of multithreaded heuristic algorithms for the stated, NP-hard, problem. The first algorithm relies on speculation and iteration, and is suitable for any shared-memory system. The second algorithm uses dataflow principles, and is targeted at the non-conventional, massively multithreaded Cray XMT system. We study the performance of the algorithms on the Cray XMT and two multi-core systems, Sun Niagara 2 and Intel Nehalem. Together, the three systems represent a spectrum of multithreading capabilities and memory structure. As testbed, we use synthetically generated large-scale graphs carefully chosen to cover a wide range of input types. The results show that the algorithms have scalable runtime performance and use nearly the same number of colors as the underlying serial algorithm, which in turn is effective in practice. The study provides insight into the design of high performance algorithms for irregular problems on many-core architectures.

81 citations

Proceedings Article•10.1109/INPAR.2012.6339609•
VOCL: An optimized environment for transparent virtualization of graphics processing units

[...]

Shucai Xiao1, Pavan Balaji2, Qian Zhu3, Rajeev Thakur2, Susan Coghlan2, Heshan Lin1, Gaojin Wen4, Jue Hong4, Wu-chun Feng1 •
Virginia Tech1, Argonne National Laboratory2, Accenture3, Chinese Academy of Sciences4
13 May 2012
TL;DR: A virtual OpenCL (VOCL) framework is proposed to support the transparent utilization of local or remote GPUs, which exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution.
Abstract: Graphics processing units (GPUs) have been widely used for general-purpose computation acceleration. However, current programming models such as CUDA and OpenCL can support GPUs only on the local computing node, where the application execution is tightly coupled to the physical GPU hardware. In this work, we propose a virtual OpenCL (VOCL) framework to support the transparent utilization of local or remote GPUs. This framework, based on the OpenCL programming model, exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution. The proposed framework requires no source code modifications. We also propose various strategies for reducing the overhead caused by data communication and kernel launching and demonstrate about 85% of the data write bandwidth and 90% of the data read bandwidth compared to data write and read, respectively, in a native nonvirtualized environment. We evaluate the performance of VOCL using four real-world applications with various computation and memory access intensities and demonstrate that compute-intensive applications can execute with negligible overhead in the VOCL environment.

80 citations

Proceedings Article•10.1109/INPAR.2012.6339604•
ScatterAlloc: Massively parallel dynamic memory allocation for the GPU

[...]

Markus Steinberger1, Michael Kenzel1, Bernhard Kainz1, Dieter Schmalstieg1•
Graz University of Technology1
13 May 2012
TL;DR: This paper analyzes the special requirements of a dynamic memory allocator that is designed for massively parallel architectures such as Graphics Processing Units (GPUs) and presents the thorough design of ScatterAlloc, which can efficiently deal with hundreds of requests in parallel.
Abstract: In this paper, we analyze the special requirements of a dynamic memory allocator that is designed for massively parallel architectures such as Graphics Processing Units (GPUs). We show that traditional strategies, which work well on CPUs, are not well suited for the use on GPUs and present the thorough design of ScatterAlloc, which can efficiently deal with hundreds of requests in parallel. Our allocator greatly reduces collisions and congestion by scattering memory requests based on hashing. We analyze ScatterAlloc in terms of allocation speed, data access time and fragmentation, and compare it to current state-of-the-art allocators, including the one provided with the NVIDIA CUDA toolkit. Our results show, that ScatterAlloc clearly outperforms these other approaches, yielding speed-ups between 10 to 100.

71 citations

Journal Article•10.1016/J.PARCO.2012.08.001•
Parallel job scheduling for power constrained HPC systems

[...]

M. Etinski1, Julita Corbalan1, Jesús Labarta1, Mateo Valero1•
Polytechnic University of Catalonia1
1 Dec 2012
TL;DR: The proposed MaxJobPerf policy, a new parallel job scheduling policy based on integer linear programming, clearly outperforms the other power-budgeting approaches at the parallelJob scheduling level and is compared against other power budgeting policies for different power budgets.
Abstract: Power has become the primary constraint in high performance computing. Traditionally, parallel job scheduling policies have been designed to improve certain job performance metrics when scheduling parallel workloads on a system with a given number of processors. The available number of processors is not anymore the only limitation in parallel job scheduling. The recent increase in processor power consumption has resulted in a new limitation: the available power. Given constraints naturally lead to an optimization problem. We proposed MaxJobPerf, a new parallel job scheduling policy based on integer linear programming. Dynamic Voltage Frequency Scaling (DVFS) is a widely used technique that running applications at reduced CPU frequency/voltage trades increased execution time for power reduction. The optimization problem determines which jobs should run and at which frequency. In this paper, we compare the MaxJobPerf policy against other power budgeting policies for different power budgets. It clearly outperforms the other power-budgeting approaches at the parallel job scheduling level. Furthermore, we give a detailed analysis of the policy parameters including a discussion on how to manage job reservations to avoid job starvation.
Journal Article•10.1016/J.PARCO.2012.07.002•
Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

[...]

Mickeal Verschoor1, Andrei C. Jalba1•
Eindhoven University of Technology1
1 Oct 2012
TL;DR: It is shown that reordering matrix blocks substantially improves the performance of the SpMV operation, especially when small blocks are used, so that the method outperforms existing state-of-the-art approaches, in most cases.
Abstract: The Conjugate Gradient (CG) method is a widely-used iterative method for solving linear systems described by a (sparse) matrix. The method requires a large amount of Sparse-Matrix Vector (SpMV) multiplications, vector reductions and other vector operations to be performed. We present a number of mappings for the SpMV operation on modern programmable GPUs using the Block Compressed Sparse Row (BCSR) format. Further, we show that reordering matrix blocks substantially improves the performance of the SpMV operation, especially when small blocks are used, so that our method outperforms existing state-of-the-art approaches, in most cases. Finally, a thorough analysis of the performance of both SpMV and CG methods is performed, which allows us to model and estimate the expected maximum performance for a given (unseen) problem.
Proceedings Article•10.1109/INPAR.2012.6339602•
Efficient sparse matrix-vector multiplication on cache-based GPUs

[...]

Istvan R eguly1, Michael B. Giles2•
Pázmány Péter Catholic University1, University of Oxford2
13 May 2012
TL;DR: This paper discusses efficient implementations of sparse matrix-vector multiplication on NVIDIA's Fermi architecture, the first to introduce conventional L1 caches to GPUs, and focuses on the compressed sparse row (CSR) format for developing general purpose code.
Abstract: Sparse matrix-vector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidth-limited operation on current hardware. On cache-based architectures the main factors that influence performance are spatial locality in accessing the matrix, and temporal locality in re-using the elements of the vector.
Book Chapter•10.1007/978-3-642-30397-5_10•
A GPU algorithm for greedy graph matching

[...]

Bas O. Fagginger Auer1, Rob H. Bisseling1•
Utrecht University1
1 Jan 2012
TL;DR: A fine-grained shared-memory parallel algorithm for maximal greedy matching is introduced, together with an implementation on the GPU, which is faster (speedups up to 6.8 for random matching and 5.6 for weighted matching) than the serial CPU algorithms.
Abstract: Greedy graph matching provides us with a fast way to coarsen a graph during graph partitioning. Direct algorithms on the CPU which perform such greedy matchings are simple and fast, but offer few handholds for parallelisation. To remedy this, we introduce a fine-grained shared-memory parallel algorithm for maximal greedy matching, together with an implementation on the GPU, which is faster (speedups up to 6.8 for random matching and 5.6 for weighted matching) than the serial CPU algorithms and produces matchings of similar (random matching) or better (weighted matching) quality.
Journal Article•10.1016/J.PARCO.2012.03.001•
Compressed sensing and Cholesky decomposition on FPGAs and GPUs

[...]

Depeng Yang1, Gregory D. Peterson1, Husheng Li1•
University of Tennessee1
1 Aug 2012
TL;DR: Results show that the proposed Cholesky decomposition on FPGAs and GPUs are much faster than LAPACK and MAGMA for small matrices and accelerating CS signal reconstruction algorithms can achieve around 15x speedup.
Abstract: Compressed sensing (CS) is a revolutionary signal acquisition theory, enabling signal acquisition at a rate that is below the Nyquist sampling rate. However, CS signal reconstruction algorithms are computationally expensive. One of the key computation steps in CS algorithms is to iteratively compute a Cholesky decomposition. Modern application acceleration devices, such as FPGAs and GPUs, can accelerate Cholesky decomposition and CS signal reconstruction computation. This paper presents high performance parallel Cholesky decomposition algorithms for GPU and FPGA implementation. For GPUs, an optimized Cholesky decomposition algorithm is developed with high parallelism, reduced data copying, and improved memory access. For FPGAs, a dedicated pipelined hardware architecture for Cholesky decomposition is designed. Only one pipelined triangular linear equation solver is needed for solving Cholesky decomposition and Cholesky decomposition-based linear equation systems. Moreover, CS signal reconstruction algorithms are accelerated on GPUs and FPGAs for fast signal recovery based on our iterative Cholesky decomposition. Results show that the proposed Cholesky decomposition on FPGAs and GPUs are much faster than LAPACK and MAGMA for small matrices. For accelerating CS signal reconstruction algorithms, our FPGA implementation can achieve around 15x speedup and our GPU implementation can achieve about a 38x speedup compared with the CPU using LAPACK and the hybrid CPU/GPU system with MAGMA.
Journal Article•10.1016/J.PARCO.2011.10.008•
Using explicit platform descriptions to support programming of heterogeneous many-core systems

[...]

Martin Sandrieser1, Siegfried Benkner1, Sabri Pllana1•
University of Vienna1
1 Jan 2012
TL;DR: A platform description language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems and develops a prototype source-to-source compilation framework that utilizes PDL descriptors to transform sequential task-based programs with source code annotations into a form that is convenient for execution on heterogeneous many-core systems.
Abstract: Heterogeneous many-core systems constitute a viable approach for coping with power constraints in modern computer architectures and can now be found across the whole computing landscape ranging from mobile devices, to desktop systems and servers, all the way to high-end supercomputers and large-scale data centers. While these systems promise to offer superior performance-power ratios, programming heterogeneous many-core architectures efficiently has been shown to be notoriously difficult. Programmers typically are forced to take into account a plethora of low-level architectural details and usually have to resort to a combination of different programming models within a single application. In this paper we propose a platform description language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems. PDL architecture descriptions support both programmers and toolchains by providing platform-specific information in a well-defined and explicit manner. We have developed a prototype source-to-source compilation framework that utilizes PDL descriptors to transform sequential task-based programs with source code annotations into a form that is convenient for execution on heterogeneous many-core systems. Our framework relies on a component-based approach that accommodates for different implementation variants of tasks, customized for different parts of a heterogeneous platform, and utilizes an advanced runtime system for exploiting parallelism through dynamic task scheduling. We show various usage scenarios of our PDL and demonstrate the effectiveness of our framework for a commonly used scientific kernel and a financial application on different configurations of a state-of-the-art CPU/GPU system.
Journal Article•10.1016/J.PARCO.2012.05.005•
OpenMP parallelism for fluid and fluid-particulate systems

[...]

Amit Amritkar1, Danesh K. Tafti1, Rui Liu2, Rick Kufrin2, Barbara Chapman3 •
Virginia Tech1, University of Illinois at Urbana–Champaign2, University of Houston3
1 Sep 2012
TL;DR: It is shown through weak and strong scaling studies that OpenMP performance can be made to match that of MPI on the SGI Altix systems for up to 256 cores.
Abstract: In order to exploit the flexibility of OpenMP in parallelizing large scale multi-physics applications where different modes of parallelism are needed for efficient computation, it is first necessary to be able to scale OpenMP codes as well as MPI on large core counts. In this research we have implemented fine grained OpenMP parallelism for a large CFD code GenIDLEST and investigated the performance from 1 to 256 cores using a variety of performance optimization and measurement tools. It is shown through weak and strong scaling studies that OpenMP performance can be made to match that of MPI on the SGI Altix systems for up to 256 cores. Data placement and locality were established to be key components in obtaining good scalability with OpenMP. It is also shown that a hybrid implementation on a dual core system gives the same performance as standalone MPI or OpenMP. Finally, it is shown that in irregular multi-physics applications which do not adhere solely to the SPMD (Single Process, Multiple Data) mode of computation, as encountered in tightly coupled fluid-particulate systems, the flexibility of OpenMP can have a big performance advantage over MPI.
Book Chapter•10.1007/978-3-642-36803-5_14•
Use of direct solvers in TFETI massively parallel implementation

[...]

Vaclav Hapla1, David Horák1, Michal Merta1•
University of Ostrava1
10 Jun 2012
TL;DR: The comparison of the direct solvers available in PETSc on the Cray XE6 machine HECToR (PETSc, MUMPS, SuperLU) regarding their performance in the two most time consuming actions in TFETI --- the pseudoinverse application and the coarse problem solution is compared.
Abstract: The FETI methods blend iterative and direct solvers. The dual problem is solved iteratively using e.g. CG method; in each iteration, the auxiliary problems related to the application of an unassembled system matrix (subdomain problems' solutions and projector application in dual operator) are solved directly. The paper deals with the comparison of the direct solvers available in PETSc on the Cray XE6 machine HECToR (PETSc, MUMPS, SuperLU) regarding their performance in the two most time consuming actions in TFETI --- the pseudoinverse application and the coarse problem solution. For the numerical experiments, our novel TFETI implementation in FLLOP (FETI Light Layer on top of PETSc) library was used.
Book Chapter•10.1007/978-3-642-36803-5_1•
Computational physics on graphics processing units

[...]

Ari Harju1, Topi Siro1, Filippo Federici Canova2, Samuli Hakala1, Teemu Rantalaiho3 •
Aalto University1, Tampere University of Technology2, Helsinki Institute of Physics3
10 Jun 2012
TL;DR: In this paper, the authors discuss advances made in the field of computational physics, focusing on classical molecular dynamics and quantum simulations for electronic structure calculations using the density functional theory, wave function techniques and quantum field theory.
Abstract: The use of graphics processing units for scientific computations is an emerging strategy that can significantly speed up various algorithms. In this review, we discuss advances made in the field of computational physics, focusing on classical molecular dynamics and quantum simulations for electronic structure calculations using the density functional theory, wave function techniques and quantum field theory.
Journal Article•10.1016/J.PARCO.2012.02.001•
Cluster-based optimized parallel video transcoding

[...]

Gerassimos Barlas1•
American University of Sharjah1
1 Apr 2012
TL;DR: An analytical approach to the optimization of a large collection of parallel transcoding techniques based on temporal partitioning, is pursued, and closed-form solutions to the partitioning/scheduling problem are derived for the most important of these methods, under CBR input media conditions.
Abstract: Video transcoding is a popular technique for delivering video content of varying quality and size to diverse audiences. In this paper an analytical approach to the optimization of a large collection of parallel transcoding techniques based on temporal partitioning, is pursued. The key elements in the design of such techniques are identified, allowing them to be enumerated and classified. Closed-form solutions to the partitioning/scheduling problem (and optimum operation sequencing where necessary) are derived for the most important of these methods, under CBR input media conditions. Subsequently, appropriate heuristics allow the solution of the partitioning problem under VBR input media conditions. The paper is concluded by an extensive battery of tests for the most significant strategies, on several feature-length video streams. The tests reveal not only how one of the proposed strategies, namely NPWF"V"B"R, strikes a nice balance between efficiency and distortion minimization on heterogeneous platforms, but also allow us to derive guidelines for transcoding solution deployment.
Book Chapter•10.1007/978-3-642-36803-5_13•
Distributed evolutionary computing system based on web browsers with javascript

[...]

Jerzy Duda1, Wojciech Dlubacz1•
AGH University of Science and Technology1
10 Jun 2012
TL;DR: The paper presents a distributed computing system that is based on evolutionary algorithms and utilizing a web browser on a client's side and shows, that the system scales quite smoothly, taking additional advantage of local search algorithm executed by some clients.
Abstract: The paper presents a distributed computing system that is based on evolutionary algorithms and utilizing a web browser on a client's side. Evolutionary algorithm is coded in JavaScript language embedded in a web page sent to the client. The code is optimized with regards to the memory usage and communication efficiency between the server and the clients. The server side is also based on JavaScript language, as node.js server was applied. The proposed system has been tested on the basis of permutation flowshop scheduling problem, one of the most popular optimization benchmarks for heuristics studied in the literature. The results have shown, that the system scales quite smoothly, taking additional advantage of local search algorithm executed by some clients.
Journal Article•10.1016/J.PARCO.2012.03.005•
Improving performance of adaptive component-based dataflow middleware

[...]

Timothy D. R. Hartley1, Erik Saule1, ímit V. Çatalyürek1•
Ohio State University1
1 Jun 2012
TL;DR: This work presents a novel framework, with which developers can easily create high-performance dataflow applications, without the tedious tuning process, and shows that this approach achieves good performance for a wide range of applications, with a much-reduced development cost.
Abstract: Making the best use of modern computational resources for distributed applications requires expert knowledge of low-level programming tools, or a productive high-level and high-performance programming framework. Unfortunately, even state-of-the-art high-level frameworks still require the developer to conduct a tedious manual tuning step to find the work partitioning which gives the best application execution performance. Here, we present a novel framework, with which developers can easily create high-performance dataflow applications, without the tedious tuning process. We compare the performance of our approach to that of three distributed programming frameworks which differ significantly in their programming paradigm, their support for multi-core CPUs and accelerators, and their load-balancing approach. These three frameworks are DataCutter, a component-based dataflow framework, KAAPI, a framework using asynchronous function calls, and MR-MPI, a MapReduce implementation. By highly optimizing the implementations of three applications on the four frameworks and comparing the execution time performance of the runtime engines, we show their strengths and weaknesses. We show that our approach achieves good performance for a wide range of applications, with a much-reduced development cost.
Proceedings Article•10.1109/INPAR.2012.6339610•
GPU accelerated nonlinear optimization in radio interferometric calibration

[...]

Sarod Yatawatta1, S. Kazemi2, Saleem Zaroubi2•
ASTRON1, Kapteyn Astronomical Institute2
13 May 2012
TL;DR: The GPU based acceleration of two well known nonlinear optimization routines: Levenberg-Marquardt (LM) and Limited Memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) in radio interferometric calibration is presented.
Abstract: We present the GPU based acceleration of two well known nonlinear optimization routines: Levenberg-Marquardt (LM) and Limited Memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) in radio interferometric calibration. Radio interferometric calibration is a heavily compute intensive operation where the same nonlinear optimization problem has to be solved over many time intervals, with different data. We achieve a speedup of about 3 times compared with conventional multi-core CPU based optimization by using GPU accelerated linear algebra routines (CULAtools,CUBLAS). We present details of our GPU accelerated optimization algorithms as well as timing comparisons with non-GPU based multi-core CPU routines.
Journal Article•10.1016/J.PARCO.2012.03.004•
Parallelizing SOR for GPGPUs using alternate loop tiling

[...]

Peng Di1, Hui Wu1, Jingling Xue1, Feng Wang2, Canqun Yang2 •
University of New South Wales1, National University of Defense Technology2
1 Jun 2012
TL;DR: This paper presents a new parallel SOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs and outperforms red- black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism.
Abstract: Gauss-Seidel and SOR, which are widely used smoothers in multigrid methods, are difficult to parallelize, particularly on GPGPUs due to the existence of DOACROSS data dependences. In this paper, we present a new parallel SOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a loop tiling technique called alternate tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of PDE-like DOACROSS loops on GPGPUs.
Proceedings Article•10.1109/INPAR.2012.6339597•
Policy-based tuning for performance portability and library co-optimization

[...]

Duane Merrill1, Michael Garland1, Andrew S. Grimshaw2•
Nvidia1, University of Virginia2
13 May 2012
TL;DR: This paper presents a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand, and enables flexible granularity coarsening.
Abstract: Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with “universally reasonable” performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.
Book Chapter•10.1007/978-3-642-36803-5_31•
Solution of multi-objective competitive facility location problems using parallel NSGA-II on large scale computing systems

[...]

Algirdas Lančinskas1, Julius Żilinskas1•
Vilnius University1
10 Jun 2012
TL;DR: Several strategies to parallelize the algorithm utilizing both the distributed and shared memory parallel programing models are presented, and results of experimental investigation carried out by solving the competitive facility location problem using up to 2048 processing units are presented.
Abstract: The multi-objective firm expansion problem on competitive facility location model, and an evolutionary algorithm suitable to solve multi-objective optimization problems are reviewed in the paper. Several strategies to parallelize the algorithm utilizing both the distributed and shared memory parallel programing models are presented. Results of experimental investigation carried out by solving the competitive facility location problem using up to 2048 processing units are presented and discussed.
Book Chapter•10.1007/978-3-642-36803-5_25•
Vectorized higher order finite difference kernels

[...]

Gerhard Zumbusch1•
University of Jena1
10 Jun 2012
TL;DR: The combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils.
Abstract: Several highly optimized implementations of Finite Difference schemes are discussed. The combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils. The kernels are implemented on and tuned for several processor architectures like recent Intel Sandy Bridge, Ivy Bridge and AMD Bulldozer CPU cores, all with AVX vector instructions as well as Nvidia Kepler and Fermi and AMD Southern and Northern Islands GPU architectures, as well as some older architectures for comparison. The kernels are either based on a cache aware spatial loop or on time-slicing to compute several time steps at once. Furthermore, vector components can either be independent, grouped in short vectors of SSE, AVX or GPU warp size or in larger virtual vectors with explicit synchronization. The optimal choice of the algorithm and its parameters depend both on the Finite Difference stencil and on the processor architecture.
Journal Article•10.1016/J.PARCO.2012.05.001•
Elastic computing: A portable optimization framework for hybrid computers

[...]

John Wernsing1, Greg Stitt1•
University of Florida1
1 Aug 2012
TL;DR: This paper introduces elastic computing, which is an optimization framework where application designers invoke specialized elastic functions that contain a knowledge-base of implementation alternatives and parallelization strategies that enables dynamic and transparent optimization for different resources and run-time parameters.
Abstract: Due to power limitations and escalating cooling costs, high-performance computing systems can no longer rely solely on faster clock frequencies and numerous microprocessor nodes to meet increasing performance demands. As an alternative approach, high-performance systems are increasingly integrating multi-core processors and heterogeneous accelerators such as GPUs and FPGAs. However, usage of such hybrid systems has been limited largely to device experts due to significantly increased application design complexity. To enable more transparent usage of hybrid systems, we introduce elastic computing, which is an optimization framework where application designers invoke specialized elastic functions that contain a knowledge-base of implementation alternatives and parallelization strategies. For each elastic function, a collection of optimization tools analyze numerous possible implementations which enables dynamic and transparent optimization for different resources and run-time parameters. In this paper, we present the enabling technologies of elastic computing, and evaluate those technologies on four different hybrid systems, including the Novo-G FPGA supercomputer. The results include detailed case studies of using elastic computing for time-domain convolution and sum of absolute difference image retrieval, which achieved speedups up to 206x.
Proceedings Article•10.1109/INPAR.2012.6339611•
Parallel speculative encryption of multiple AES contexts on GPUs

[...]

Wagner M. Nunan Zola1, Luis C. E. Bona1•
Federal University of Paraná1
13 May 2012
TL;DR: This work presents a high performance heterogeneous parallel method for encryption using GPUs that executes most of the encryption processes on the GPU and partially on CPU.
Abstract: This work presents a high performance heterogeneous parallel method for encryption using GPUs. Our heterogeneous design executesmost of the encryption processes on the GPU and partially on CPU. Aside from the AES 16 Byte block size, our parallel AES CTR algorithm divides work in small logical data blocks.
Journal Article•10.1016/J.PARCO.2011.11.001•
Load balancing in homogeneous pipeline based applications

[...]

Andreu Moreno, Eduardo César1, A. Guevara1, Joan Sorribes1, Tomàs Margalef1 •
Autonomous University of Barcelona1
1 Mar 2012
TL;DR: The key idea is to have free computational resources by gathering the pipeline's fastest stages and then using these resources to replicate the slowest stages, and this work shows a new strategy for dynamically improving the performance of pipeline applications.
Abstract: We propose to use knowledge about a parallel application's structure that was acquired with the use of a skeleton based development strategy to dynamically improve its performance. Parallel/distributed programming provides the possibility of solving highly demanding computational problems. However, this type of application requires support tools in all phases of the development cycle because the implementation is extremely difficult, especially for non-expert programmers. This work shows a new strategy for dynamically improving the performance of pipeline applications. We call this approach Dynamic Pipeline Mapping (DPM), and the key idea is to have free computational resources by gathering the pipeline's fastest stages and then using these resources to replicate the slowest stages. We present two versions of this strategy, both with complexity O(Nlog(N)) on the number of pipe stages, and we compare them to an optimal mapping algorithm and to the Binary Search Closest (BSC) algorithm [1]. Our results show that the DPM leads to significant performance improvements, increasing the application throughput up to 40% on average.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve