Top 181 papers presented at Parallel Computing in 2013

Showing papers presented at "Parallel Computing in 2013"

Book•10.1007/978-3-642-31046-1•

Space-Filling Curves

[...]

Michael Bader¹, Hans-Joachim Bungartz¹, Miriam Mehl¹•Institutions (1)

Technische Universität München¹

1 Jan 2013

TL;DR: The present book provides an introduction to using space-filling curves as tools in scientific computing.

...read moreread less

247 citations

Journal Article•10.1016/J.PARCO.2013.03.002•

Cost-efficient task scheduling for executing large programs in the cloud

[...]

Sen Su¹, Jian Li¹, Qingjia Huang¹, Xiao Huang¹, Kai Shuang¹, Jie Wang² - Show less +2 more•Institutions (2)

Beijing University of Posts and Telecommunications¹, University of Massachusetts Lowell²

1 Apr 2013

TL;DR: This work presents a cost-efficient task-scheduling algorithm using two heuristic strategies that dynamically maps tasks to the most cost- efficient VMs based on the concept of Pareto dominance and reduces the monetary costs of non-critical tasks.

...read moreread less

Abstract: Executing a large program using clouds is a promising approach, as this class of programs may be decomposed into multiple sequences of tasks that can be executed on multiple virtual machines (VMs) in a cloud. Such sequences of tasks can be represented as a directed acyclic graph (DAG), where nodes are tasks and edges are precedence constraints between tasks. Cloud users pay for what their programs actually use according to the pricing models of the cloud providers. Early task scheduling algorithms are focused on minimizing makespan, without mechanisms to reduce the monetary cost incurred in the setting of clouds. We present a cost-efficient task-scheduling algorithm using two heuristic strategies.The first strategy dynamically maps tasks to the most cost-efficient VMs based on the concept of Pareto dominance. The second strategy, a complement to the first strategy, reduces the monetary costs of non-critical tasks. We carry out extensive numerical experiments on large DAGs generated at random as well as on real applications. The simulation results show that our algorithm can substantially reduce monetary costs while producing makespan as good as the best known task-scheduling algorithm can provide.

...read moreread less

192 citations

Journal Article•10.1016/J.PARCO.2013.09.009•

A survey on resource allocation in high performance distributed computing systems

[...]

Hameed Hussain¹, Saif Ur Rehman Malik², Abdul Hameed², Samee U. Khan², Gage Bickler², Nasro Min-Allah¹, Muhammad Bilal Qureshi¹, Limin Zhang², Wang Yong-Ji³, Nasir Ghani⁴, Joanna Kolodziej, Albert Y. Zomaya⁵, Cheng-Zhong Xu⁶, Pavan Balaji⁷, Abhinav Vishnu⁸, Fredric Pinel⁹, Johnatan E. Pecero⁹, Dzmitry Kliazovich⁹, Pascal Bouvry⁹, Hongxiang Li¹⁰, Lizhe Wang³, Dan Chen¹¹, Ammar Rayes¹² - Show less +19 more•Institutions (12)

COMSATS Institute of Information Technology¹, North Dakota State University², Chinese Academy of Sciences³, University of South Florida⁴, University of Sydney⁵, Wayne State University⁶, Argonne National Laboratory⁷, Pacific Northwest National Laboratory⁸, University of Luxembourg⁹, University of Louisville¹⁰, China University of Geosciences (Wuhan)¹¹, Cisco Systems, Inc.¹²

1 Nov 2013

TL;DR: In this study, through analysis, a comprehensive survey for describing resource allocation in various HPCs is reported and the system classification is used to identify approaches followed by the implementation of existing resource allocation strategies that are widely presented in the literature.

...read moreread less

Abstract: Classification of high performance computing (HPC) systems is provided.Current HPC paradigms and industrial application suites are discussed.State of the art in HPC resource allocation is reported.Hardware and software solutions are discussed for optimized HPC systems. An efficient resource allocation is a fundamental requirement in high performance computing (HPC) systems. Many projects are dedicated to large-scale distributed computing systems that have designed and developed resource allocation mechanisms with a variety of architectures and services. In our study, through analysis, a comprehensive survey for describing resource allocation in various HPCs is reported. The aim of the work is to aggregate under a joint framework, the existing solutions for HPC to provide a thorough analysis and characteristics of the resource management and allocation strategies. Resource allocation mechanisms and strategies play a vital role towards the performance improvement of all the HPCs classifications. Therefore, a comprehensive discussion of widely used resource allocation strategies deployed in HPC environment is required, which is one of the motivations of this survey. Moreover, we have classified the HPC systems into three broad categories, namely: (a) cluster, (b) grid, and (c) cloud systems and define the characteristics of each class by extracting sets of common attributes. All of the aforementioned systems are cataloged into pure software and hybrid/hardware solutions. The system classification is used to identify approaches followed by the implementation of existing resource allocation strategies that are widely presented in the literature.

...read moreread less

184 citations

Journal Article•10.1016/J.PARCO.2012.10.002•

Multi-level parallelism for incompressible flow computations on GPU clusters

[...]

Dana Jacobsen¹, Inanc Senocak¹•Institutions (1)

Boise State University¹

1 Jan 2013

TL;DR: The results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance.

...read moreread less

Abstract: We investigate multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow computations using up to 256GPUs on a problem with approximately 17.2 billion cells. Our work addresses some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI or MPI-OpenMP for communications. We present three different strategies to overlap computations with communications, and systematically assess their impact on parallel performance on two different GPU clusters. Our results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance. We also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage in performance over the dual-level implementation on GPU clusters with two GPUs per node, but on clusters with higher GPU counts per node or with different domain decomposition strategies a tri-level implementation may exhibit higher efficiency than a dual-level implementation and needs to be investigated further.

...read moreread less

75 citations

Journal Article•10.1016/J.PARCO.2013.09.005•

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

[...]

Hoang-Vu Dang¹, Bertil Schmidt¹•Institutions (1)

University of Mainz¹

1 Nov 2013

TL;DR: A new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations is presented and compared to existing formats of the NVIDIA Cusp library using large sparse matrices.

...read moreread less

Abstract: We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix into SCOO format.An efficient Dual-GPU implementation which overlaps computation and communication is described.Extensive performance comparisons of SCOO compared to other formats on GPUs and CPUs are provided. Existing formats for Sparse Matrix-Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.Source code is available at https://github.com/danghvu/cudaSpmv.

...read moreread less

50 citations

Journal Article•10.1016/J.PARCO.2013.03.001•

Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines

[...]

George Teodoro¹, Tony Pan¹, Tahsin Kurc¹, Jun Kong¹, Lee Cooper¹, Joel H. Saltz¹ - Show less +2 more•Institutions (1)

Emory University¹

1 Apr 2013

TL;DR: This work develops and evaluates strategies for efficient computation and propagation of wavefronts using a multi-level queue structure that improves the utilization of fast memories in a GPU and reduces synchronization overheads and develops a tile-based parallelization strategy to support execution on multiple CPUs and GPUs.

...read moreread less

Abstract: We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.

...read moreread less

40 citations

Journal Article•10.1016/J.PARCO.2013.01.003•

Hierarchical QR factorization algorithms for multi-core clusters

[...]

Jack Dongarra¹, Mathieu Faverge¹, Thomas Herault¹, Mathias Jacquelin², Julien Langou³, Yves Robert⁴ - Show less +2 more•Institutions (4)

University of Tennessee¹, French Institute for Research in Computer Science and Automation², University of Colorado Denver³, École normale supérieure de Lyon⁴

1 Apr 2013

TL;DR: The implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.

...read moreread less

Abstract: This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ''communication-avoiding''), it is natural to consider hierarchical trees composed of an ''inter-node'' tree which acts on top of ''intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ''TS level'' for cache-friendliness, (1) ''low-level'' for decoupled highly parallel inter-node reductions, (2) ''domino level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.

...read moreread less

40 citations

Proceedings Article•10.3929/ETHZ-A-009922757•

A fault tolerant implementation of Multi-Level Monte Carlo methods

[...]

Stefan Pauli¹, Manuel Kohler¹, Peter Arbenz¹•Institutions (1)

ETH Zurich¹

1 Jan 2013

TL;DR: An MPI-parallelized fault tolerant MLMC version of an existing parallel MLMC code (ALSVID-UQ) is implemented, based on the User Level Failure Mitigation, a fault tolerant extension of MPI.

...read moreread less

Abstract: The theory behind fault tolerant multi-level Monte Carlo (FT-MLMC) methods was recently developed and tested. These tests were made without a real fault tolerant implementation. We implemented an MPI-parallelized fault tolerant MLMC version of an existing parallel MLMC code (ALSVID-UQ). It is based on the User Level Failure Mitigation, a fault tolerant extension of MPI. We confirm our FT-MLMC theory by means of simulations of the two-dimensional stochastic Euler equations of gas dynamics.

...read moreread less

38 citations

Journal Article•10.1016/J.PARCO.2013.04.010•

Fixed Latency On-Chip Interconnect for Hardware Spiking Neural Network Architectures

[...]

Sandeep Pande¹, Fearghal Morgan¹, Gerard J. M. Smit², T.M. Bruintjes², B Rutgers², Brian McGinley¹, Seamus Cawley¹, Jim Harkin³, Liam McDaid³ - Show less +5 more•Institutions (3)

National University of Ireland, Galway¹, University of Twente², Intel³

1 Sep 2013

TL;DR: In this paper, a fixed spike transfer latency ring topology interconnect for spike communication between neural tiles, using a novel timestamped spike broadcast flow control scheme, is proposed.

...read moreread less

Abstract: Information in a Spiking Neural Network (SNN) is encoded as the relative timing between spikes. Distortion in spike timings can impact the accuracy of SNN operation by modifying the precise firing time of neurons within the SNN. Maintaining the integrity of spike timings is crucial for reliable operation of SNN applications. A packet switched Network on Chip (NoC) infrastructure offers scalable connectivity for spike communication in hardware SNN architectures. However, shared resources in NoC architectures can result in unwanted variation in spike packet transfer latency. This packet latency jitter distorts the timing information conveyed on the synaptic connections in the SNN, resulting in unreliable application behaviour. This paper presents a SystemC simulation based analysis of the synaptic information distortion in NoC based hardware SNNs. The paper proposes a fixed spike transfer latency ring topology interconnect for spike communication between neural tiles, using a novel timestamped spike broadcast flow control scheme. The proposed architectural technique is evaluated using spike rates employed in previously reported mesh topology NoC based hardware SNN applications, which exhibited spike latency jitter over NoC paths. Results indicate that the proposed interconnect offers fixed spike transfer latency and eliminates the associated information distortion. The paper presents the micro-architecture of the proposed ring router. The FPGA validated ring interconnect architecture has been synthesised using 65nm low-power CMOS technology. Silicon area comparisons for various ring sizes are presented. Scalability of the proposed architecture has been addressed by employing a hierarchical NoC architecture.

...read moreread less

35 citations

Journal Article•10.1016/J.PARCO.2013.08.010•

MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy

[...]

Hisham Mohamed¹, Stéphane Marchand-Maillet¹•Institutions (1)

University of Geneva¹

1 Dec 2013

TL;DR: This implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing.

...read moreread less

Abstract: MapReduce is a programming model proposed to simplify large-scale data processing. In contrast, the message passing interface (MPI) standard is extensively used for algorithmic parallelization, as it accommodates an efficient communication infrastructure. In the original implementation of MapReduce, the reduce function can only start processing following termination of the map function. If the map function is slow for any reason, this will affect the whole running time. In this paper, we propose MapReduce overlapping using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing. Our implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI. At the same time, we maintain the usability and the simplicity of MapReduce. Experimental results based on three different applications (WordCount, Distributed Inverted Indexing and Distributed Approximate Similarity Search) show a good speedup compared to the earlier versions of MapReduce such as Hadoop and the available MPI-MapReduce implementations.

...read moreread less

31 citations

Proceedings Article•10.3233/978-1-61499-381-0-263•

A space-time parallel solver for the three-dimensional heat equation

[...]

Robert Speck¹, Daniel Ruprecht¹, Matthew Emmett², Matthias Bolten³, Rolf Krause¹ - Show less +1 more•Institutions (3)

University of Lugano¹, Lawrence Berkeley National Laboratory², University of Wuppertal³

1 Jan 2013

TL;DR: In this article, a combination of the time-parallel PFASST with a parallel multigrid method (PMG) in space and time is presented, resulting in a mesh-based solver for the three-dimensional heat equation with a uniquely high degree of efficient concurrency.

...read moreread less

Abstract: The paper presents a combination of the time-parallel “parallel full approximation scheme in space and time” (PFASST) with a parallel multigrid method (PMG) in space, resulting in a mesh-based solver for the three-dimensional heat equation with a uniquely high degree of efficient concurrency. Parallel scaling tests are reported on the Cray XE6 machine “Monte Rosa” on up to $16,384$ cores and on the IBM Blue Gene/Q system “JUQUEEN” on up to $65,536$ cores. The efficacy of the combined spatial- and temporal parallelization is shown by demonstrating that using PFASST in addition to PMG significantly extends the strong-scaling limit. Implications of using spatial coarsening strategies in PFASST’s multi-level hierarchy in large-scale parallel simulations are discussed.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.05.004•

Framework for a productive performance optimization

[...]

Harald Servat¹, Germán Llort¹, Kevin Huck², Judit Gimenez¹, Jesús Labarta¹ - Show less +1 more•Institutions (2)

Polytechnic University of Catalonia¹, University of Oregon²

1 Aug 2013

TL;DR: A framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications and simplifies the semantics of the performance counters into metrics that refer to processor functional units is presented.

...read moreread less

Abstract: Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer. We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications' performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes.

...read moreread less

Proceedings Article•

Load Balancing for Massively Parallel Computations with the Sparse Grid Combination Technique.

[...]

Mario Heene¹, Christoph Kowitz², Dirk Pflüger¹•Institutions (2)

University of Stuttgart¹, Technische Universität München²

1 Jan 2013

TL;DR: A load model for linear initial value runs with GENE is introduced for effective load balancing for the sparse grid combination technique, which will equip GENE for exascale computing.

...read moreread less

Abstract: Massively parallel simulations of plasma microturbulence using GENE are facing the curse of dimensionality, since the discretization of the fivedimensional gyrokinetic equations requires a large amount of grid points even for only moderate resolutions The sparse grid combination technique can be used to tackle the curse of dimensionality Being based on a superposition of anisotropic full grid solutions that can be computed independently of each other, it introduces a second layer of parallelism that will equip GENE for exascale computing Since the anisotropy of the discretizations of the partial solutions results in massive load imbalances, effective scheduling is crucial in order to exploit this parallelism In this paper a load model for linear initial value runs with GENE is introduced for effective load balancing for the combination technique

...read moreread less

Journal Article•10.1016/J.PARCO.2012.04.001•

Performance comparison of parallel eigensolvers based on a contour integral method and a Lanczos method

[...]

Ichitaro Yamazaki¹, Ichitaro Yamazaki², Hiroto Tadano², Tetsuya Sakurai², Tsutomu Ikegami¹ - Show less +1 more•Institutions (2)

National Institute of Advanced Industrial Science and Technology¹, University of Tsukuba²

1 Jun 2013

TL;DR: Develop performance models and present numerical results of solving large-scale eigenvalue problems arising from simulations of modeling accelerator cavities and identify the crossover point, where SSEig becomes faster than TRLan.

...read moreread less

Abstract: We study the performance of a parallel nonlinear eigensolver SSEig which is based on a contour integral method. We focus on symmetric generalized eigenvalue problems (GEPs) of computing interior eigenvalues. We chose to focus on GEPs because we can then compare the performance of SSEig with that of a publicly-available software package TRLan, which is based on a thick restart Lanczos method. To solve this type of problems, SSEig requires the solution of independent linear systems with different shifts, while TRLan solves a sequence of linear systems with a single shift. Therefore, while SSEig typically has a computational cost greater than that of TRLan, it also has greater parallel scalability. To compare the performance of these two solvers, in this paper, we develop performance models and present numerical results of solving large-scale eigenvalue problems arising from simulations of modeling accelerator cavities. In particular, we identify the crossover point, where SSEig becomes faster than TRLan. The parallel performance of SSEig solving nonlinear eigenvalue problems is also studied.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.03.003•

A fast parallel algorithm for solving block-tridiagonal systems of linear equations including the domain decomposition method

[...]

Andrew V. Terekhov

1 Jun 2013

TL;DR: In this paper, a parallel algorithm for solving block-tridiagonal systems of equations is presented, which is an effective and simple set of procedures for solving engineering tasks on a supercomputer.

...read moreread less

Abstract: In this study, we develop a new parallel algorithm for solving systems of linear algebraic equations with the same block-tridiagonal matrix but with different right-hand sides. The method is a generalization of the parallel dichotomy algorithm for solving systems of linear equations with tridiagonal matrices [1] . Using this approach, we propose a parallel realization of the domain decomposition method (the Schur complement method). The calculation of acoustic wave fields using the spectral-difference technique improves the efficiency of the parallel algorithms. A near-linear dependence of the speedup with the number of processors is attained using both several and several thousands of processors. This study is innovative because the parallel algorithm developed for solving block-tridiagonal systems of equations is an effective and simple set of procedures for solving engineering tasks on a supercomputer.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.08.012•

A map-reduce lagrangian heuristic for multidimensional assignment problems with decomposable costs

[...]

Gregory Tauer¹, Rakesh Nagi²•Institutions (2)

University at Buffalo¹, University of Illinois at Urbana–Champaign²

1 Nov 2013

TL;DR: A Lagrangian relaxation based heuristic for the multi-dimensional assignment problem with decomposable costs that can be largely implemented in a map-reduce framework and thus easily distributed across a cluster of computers is described.

...read moreread less

Abstract: Data Association framed as multidimensional assignment with decomposable costs.Distribution of a Lagrangian relaxation heuristic using a map-reduce framework.Parallel computation of Lagrange multipliers.New feasibility procedure for the relaxed multidimensional assignment problem.High quality and scalable solutions to large data association problems. Data association is the problem of identifying when multiple data sources have observed the same entity. Central to this effort is the multidimensional assignment problem, which is often used to formulate data association problems. The nature of data association problems dictate that solution methods for the multidimensional assignment problem must return results promptly, and work on large data sets. The contribution of this work is to describe a Lagrangian relaxation based heuristic for the multi-dimensional assignment problem with decomposable costs that can be largely implemented in a map-reduce framework and thus easily distributed across a cluster of computers. Distribution allows the heuristic to address run time and large data requirements of data association. The developed algorithm is tested on a synthesized dataset, and shown to achieve an optimality gap ranging from 0.00008% to 0.6% for dense (no filtering) problems having 10,000 observation. Distribution of the algorithm was found to offer a significant reduction in run time on 30,000 observation problems for an 8 node computing cluster with 96 processors over a single node with 12 processors.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.04.012•

A framework for argument-based task synchronization with automatic detection of dependencies

[...]

Carlos H. González, Basilio B. Fraguela

1 Sep 2013

TL;DR: This paper presents a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user and is the first generic approach to express parallelism that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks.

...read moreread less

Abstract: Synchronization in parallel applications can be achieved either implicitly or explicitly. Implicit synchronization is typical of programming environments that provide predefined, and often simple, patterns of parallelism such as data-parallel libraries and languages and skeletal operations. Nevertheless, more flexible approaches that allow to express arbitrary task-level parallel computations without a predefined structure request in turn that the user explicitly specifies the synchronization needed among the parallel tasks. In this paper we present a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user. Our proposal is the first generic approach to express parallelism we know of that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks. Our strategy relies on expressing the parallel tasks as functions that convey their dependencies implicitly by means of their arguments. These function arguments are analyzed by our library, called DepSpawn, when a parallel task is spawned in order to enforce its dependencies. Our experiments indicate that DepSpawn is very competitive, both in terms of performance and programmability, with respect to a widespread high-level approach like OpenMP.

...read moreread less

Journal Article•10.1016/J.PARCO.2012.09.003•

LIBI: A framework for bootstrapping extreme scale software systems

[...]

J. D. Goehner¹, Dorian Arnold¹, Dong H. Ahn², Gregory L. Lee², B R de Supinski², Matthew Legendre², Barton P. Miller³, Martin Schulz² - Show less +4 more•Institutions (3)

University of New Mexico¹, Lawrence Livermore National Laboratory², University of Wisconsin-Madison³

1 Mar 2013

TL;DR: This paper describes the lightweight infrastructure-bootstrapping infrastructure (LIBI), both a bootstrapping API specification and a reference implementation and presents a performance evaluation of different process launching schemes based on the LIBI prototype.

...read moreread less

Abstract: As the sizes of high-end computing systems continue to grow to massive scales, efficient bootstrapping for distributed software infrastructures is becoming a greater challenge. Distributed software infrastructure bootstrapping is the procedure of instantiating all processes of the distributed system on the appropriate hardware nodes and disseminating to these processes the information that they need to complete the infrastructure's start-up phase. In this paper, we describe the lightweight infrastructure-bootstrapping infrastructure (LIBI), both a bootstrapping API specification and a reference implementation. We describe a classification system for process launching mechanism and then present a performance evaluation of different process launching schemes based on our LIBI prototype.

...read moreread less

Book Chapter•10.1007/978-3-642-53962-6_13•

Parallelization of a DEM Code Based on CPU-GPU Heterogeneous Architecture

[...]

Xiaoqiang Yue¹, Hao Zhang², Congshu Luo¹, Shi Shu¹, Chunsheng Feng¹ - Show less +1 more•Institutions (2)

Xiangtan University¹, Polytechnic University of Catalonia²

20 May 2013

TL;DR: It is shown that the final parallel code gave a substantial acceleration on the Trubal, and an average speedup of 4.69 in computational time was obtained.

...read moreread less

Abstract: Particulate flows are commonly encountered in both engineering and environmental applications. The discrete element method (DEM) has attracted plentiful attentions since it can predict the whole motion of the particulate flow by monitoring every single particle. However the computational capability of the method relies strongly on the numerical scheme as well as the hardware environment. In this study, a parallelization of a DEM based code titled Trubal was implemented. Numerical simulations were carried out to show the benefits of this research. It is shown that the final parallel code gave a substantial acceleration on the Trubal. By simulating 6,000 particles using a NVIDIA Tesla C2050 card together with Intel Core-Dual 2.93 GHz CPU, an average speedup of 4.69 in computational time was obtained.

...read moreread less

Proceedings Article•10.3233/978-1-61499-381-0-703•

End-to-end Parallel Simulations with APES.

[...]

Harald Klimach¹, Kartik Jain², Sabine Roller²•Institutions (2)

RWTH Aachen University¹, Folkwang University of the Arts²

1 Jan 2013

Journal Article•10.1016/J.PARCO.2013.04.009•

Efficient routing techniques in heterogeneous 3D Networks-on-Chip

[...]

Michael Opoku Agyeman¹, Ali Ahmadinia¹, Alireza Shahrabi¹•Institutions (1)

Glasgow Caledonian University¹

1 Sep 2013

TL;DR: Experimental results show that the proposed architecture significantly improves the performance up to 75% by replacing 2D static routers with adaptive 2D routers in heterogeneous 3D NoCs, while keeping the maximum clock frequency, power and energy consumption of the adaptive router nearly at the same level as the static router.

...read moreread less

Abstract: Three-dimensional Networks-on-Chips (3D NoCs) have recently been proposed to address the on-chip communication demands of future highly dense 3D multi-core systems. Homogeneous 3D NoC topologies have many Through Silicon Vias (TSVs) which have a costly and complex manufacturing process. Also, 3D routers use more memory and are more power hungry than conventional 2D routers. Alternatively, heterogeneous 3D NoCs combine both the area and performance benefits of 2D and 3D static router architectures by using a limited number of TSVs. To improve the performance of heterogeneous 3D NoCs, we propose an adaptive router architecture which balances the traffic in such NoCs. Particularly, experimental results show that our proposed architecture significantly improves the performance up to 75% by replacing 2D static routers with adaptive 2D routers in heterogeneous 3D NoCs, while keeping the maximum clock frequency, power and energy consumption of the adaptive router nearly at the same level as the static router.

...read moreread less

Proceedings Article•10.3233/978-1-61499-381-0-753•

Divide and Conquer Parallelization of Finite Element Method Assembly

[...]

Loïc Thébault¹, Eric Petit¹, Marc Tchiboukdjian², Quang Dinh³, William Jalby¹ - Show less +1 more•Institutions (3)

Versailles Saint-Quentin-en-Yvelines University¹, University of Grenoble², Dassault Aviation³

1 Jan 2013

TL;DR: This paper proposes and evaluates a Divide and Conquer, D&C, algorithm to efficiently parallelize the FEM assembly, and compares this hybrid approach to the pure domain decomposition and to a state-of-the-art hybrid approach using mesh coloring.

...read moreread less

Abstract: Relying solely on domain decomposition and distributed memory parallelism can limit the performance on current supercomputers. At scale, a larger number of smaller domains can lead to an increased communication volume and to load balancing issues. Moreover, the decreasing memory per core is not compatible with the memory overhead of a finer domain decomposition. A popular alternative is to use shared memory parallelism in addition to the domain decomposition. In the context of Finite Element Method, FEM, one of the challenging steps to parallelize in shared memory is the matrix assembly. In this paper, we propose and evaluate a Divide and Conquer, D&C, algorithm to efficiently parallelize the FEM assembly. We compare this hybrid approach using D&C to the pure domain decomposition and to a state-of-the-art hybrid approach using mesh coloring. Our target application is an industrial fluid dynamics code, developed by Dassault Aviation and parallelized with MPI domain decomposition. The original Fortran code has been modified with minimum intrusion. Our D&C approach uses task parallelism with Intel Cilk+. Preliminary results show a good data locality and a 14% performance improvement on a 12 cores 2 sockets Westmere-EP node.

...read moreread less

Proceedings Article•

Profiling Hybrid HMPP Applications with Score-P on Heterogeneous Hardware.

[...]

Marc Schlütter¹, Peter Philippen², Laurent Morin, Markus Geimer², Bernd Mohr¹ - Show less +1 more•Institutions (2)

Forschungszentrum Jülich¹, Dresden University of Technology²

1 Jan 2013

TL;DR: The integration and combined use of Score-P and the CAPS compilers are presented as one approach to efficiently parallelize and optimize codes and the PHMPP profiling interface is described, it’s implementation inscore-P, and the presentation of preliminary results in CUBE.

...read moreread less

Abstract: In heterogeneous environments with multi-core systems and accelerators, programming and optimizing large parallel applications turns into a time-intensive and hardware-dependent challenge. To assist application developers in this process, a number of tools and high-level compilers have been developed. Directive-based programming models such as HMPP and OpenACC provide abstractions over lowlevel GPU programming models, such as CUDA or OpenCL. The compilers developed by CAPS automatically transform the pragma-annotated application code into low-level code, thereby allowing the parallelization and optimization for a given accelerator hardware. To analyze the performance of parallel applications, multiple partners in Germany and the US jointly develop the community measurement infrastructure Score-P. Score-P gathers performance execution profiles, which can be presented and analyzed within the CUBE result browser, and collects detailed event traces to be processed by post-mortem analysis tools such as Scalasca and Vampir. In this paper we present the integration and combined use of Score-P and the CAPS compilers as one approach to efficiently parallelize and optimize codes. Specifically, we describe the PHMPP profiling interface, it’s implementation in Score-P, and the presentation of preliminary results in CUBE.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.04.006•

DDS: A deadlock detection-based scheduling algorithm for workflow computations in HPC systems with storage constraints

[...]

Yang Wang¹, Paul Lu²•Institutions (2)

University of New Brunswick¹, University of Alberta²

1 Aug 2013

TL;DR: A deadlock detection-based scheduling (DDS) algorithm that can achieve high performance by making the best use of the available storage resources and achieve higher performance than some deadlock avoidance methods in synthetic and real workflow computations.

...read moreread less

Abstract: Workflow-based workloads usually consist of multiple instances of the same workflow, which are jobs with control or data dependencies, to carry out a well-defined scientific computation task, with each instance acting on its own input data. To maximize throughput performance, a high degree of concurrency is achievable by running multiple instances simultaneously. However, deadlock is a potential problem when storage is constrained. To address this problem, we design and evaluate a deadlock detection-based scheduling (DDS) algorithm that can achieve high performance by making the best use of the available storage resources. Our algorithm takes advantages of the dataflow information of the workflow to speculatively schedule each instance if the instant storage is sufficient for some constituent jobs, but not necessarily for the whole workflow instance. Whenever deadlock or a performance anomaly is detected, some selected in-progress workflow instances are required to be rollbacked to release storage for other blocked jobs. We develop a suite of strategies to select the victims and beneficiaries (instances or jobs) and evaluate their performance via a simulation-based study. Our results show that the DDS algorithm can adapt the job concurrency to the available storage resources and achieve higher performance than some deadlock avoidance methods in our synthetic and real workflow computations.

...read moreread less

Book Chapter•10.1007/978-3-642-53962-6_48•

Application of Improved Simulated Annealing Optimization Algorithms in Hardware/Software Partitioning of the Reconfigurable System-on-Chip

[...]

Yiming Jing, Jishun Kuang, Jiayi Du, Biao Hu

20 May 2013

TL;DR: The experimental results on a set of benchmarks show the proposed Greedy Simulated Annealing Algorithm (GSAA) can improve the performance by 34.96% and 18.85% on average when comparing with a pure greedy algorithm and a pure simulating annealing algorithm, respectively.

...read moreread less

Abstract: The hardware/software codedesign technique traditionally is taken to design embedded systems. The hardware/software partitioning is a key problem in hardware/software codedesign. In this paper, we propose Greedy Simulated Annealing Algorithm (GSAA) to implement an approximately optimal or optimal partition on reconfigurable System-on-Chip (SoC) in embedded system. The experimental results on a set of benchmarks show the proposed GSA algorithm can improve the performance by 34.96% and 18.85% on average when comparing with a pure greedy algorithm and a pure simulating annealing algorithm, respectively. So our algorithm is an effective hardware/software partitioning algorithm.

...read moreread less

Proceedings Article•

Parallelizing Equation-Based Models for Simulation on Multi-Core Platforms by Utilizing Model Structure

[...]

Martin Sjölund¹, Mahder Gebremedhin¹, Peter Fritzson¹•Institutions (1)

Linköping University¹

1 Jan 2013

TL;DR: An automatic parallelization approach for Modelica models using Transmission Line Modeling (TLM), which re-uses the dependency analysis from the sequential translation step of OMC to introduce parallelism into the system.

...read moreread less

Abstract: In today’s world of high tech manufacturing and computer-aided design simulations of models is at the heart of the whole manufacturing process. Trying to represent and study the variables of real world models using simulation computer programs can turn out to be a very expensive and time consuming task. On the other hand advancements in modern multi-core CPUs promise remarkable computational power. Modern modeling environments provide different optimization and parallelization options to take advantage of the available computational power. Some of these parallelization approaches are based on automatically extracting parallelism with the help of the model compiler or translator. Another approach is to provide the model programmers with the necessary language constructs to express any potential parallelism in their models.In this paper we present an automatic parallelization approach for Modelica models using Transmission Line Modeling (TLM). TLM is suitable for parallel simulations because larger models can be partitioned into smaller independent sub-models. TLM introduces parallelism into the system by decoupling subsystems using delays greater than the step size of the numerical solver. A prototype has been implemented in the OpenModelica Compiler (OMC) framework. Our approach re-uses the dependency analysis from the sequential translation step of OMC. With the help of the dependency analysis information the set of equations for a model is partitioned into a number of sub-systems. The resulting independent sub-systems are scheduled and executed in parallel. The run-time system for OMC has been improved to provide thread safety and handle parallelism while keeping the introduced overhead to minimum for normal sequential operation and maintaining portability.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.08.003•

P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

[...]

Agustín C. Caminero¹, Antonio Robles-Gómez¹, Salvador Ros¹, Roberto Hernandez¹, Llanos Tobarra¹ - Show less +1 more•Institutions (1)

National University of Distance Education¹

1 Oct 2013

TL;DR: A new technique for the discovery of resources in grids which can be used in the case of multi-attribute and range queries and upon users' requests is presented.

...read moreread less

Abstract: A key point for the efficient use of large grid systems is the discovery of resources, and this task becomes more complicated as the size of the system grows up. In this case, large amounts of information on the available resources must be stored and kept up-to-date along the system so that it can be queried by users to find resources meeting specific requirements (e.g. a given operating system or available memory). Thus, three tasks must be performed, (1) information on resources must be gathered and processed, (2) such processed information has to be disseminated over the system, and (3) upon users' requests, the system must be able to discover resources meeting some requirements using the processed information. This paper presents a new technique for the discovery of resources in grids which can be used in the case of multi-attribute (e.g. {OS=Linux &memory=4GB}) and range queries (e.g. {50GB

...read moreread less

Proceedings Article•

GPI2 for GPUs: A PGAS framework for efficient communication in hybrid clusters.

[...]

Lena Oden¹•Institutions (1)

Fraunhofer Institute for Industrial Mathematics¹

1 Jan 2013

Journal Article•10.1016/J.PARCO.2012.11.003•

Accurate prediction of the behavior of multithreaded applications in shared caches

[...]

Diego Andrade¹, Basilio B. Fraguela¹, Ramón Doallo¹•Institutions (1)

University of A Coruña¹

1 Jan 2013

TL;DR: This is the first analytical model that tackles the behavior of multithreaded applications on realistic shared caches without requiring profiling and the experimental results show that the model predictions are precise and very fast and that it can help a compiler or programmer choose the best parallelization strategy.

...read moreread less

Abstract: Multicores are the norm nowadays and in many of them there are cores that share one or several levels of cache. The theoretical performance gain expected when several cores cooperate in the parallel execution of an application can be reduced in some cases by a cache access bottleneck, as the data accessed by them can interfere in the shared cache levels. In other cases the performance gain can be increased due to a greater reuse of the data loaded in the cache. This paper presents an analytical model that can predict the behavior of shared caches when executing applications parallelized at loop level. To the best of our knowledge, this is the first analytical model that tackles the behavior of multithreaded applications on realistic shared caches without requiring profiling. The experimental results show that the model predictions are precise and very fast and that the model can help a compiler or programmer choose the best parallelization strategy.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.04.011•

A hardware/software platform for QoS bridging over multi-chip NoC-based systems

[...]

Ashkan Beyranvand Nejad¹, Anca Molnos¹, Matias Escudero Martinez¹, Kees Goossens²•Institutions (2)

Delft University of Technology¹, Eindhoven University of Technology²

1 Sep 2013

TL;DR: The NoC protocol stack is explored to determine the best layer for implementing the off-chip bridge, a generic hardware architecture for the bridge is proposed, and a new software architecture is developed enabling seamless configuration and communication of multi-chip NoC-based SoCs.

...read moreread less

Abstract: Recent embedded systems integrate a growing number of intellectual property cores into increasingly large designs. Implementation, prototyping, and verification of such large systems has become very challenging. One of the reasons is that chips/FPGAs resources are limited and therefore it is not always possible to implement the whole design in the traditional system-on-a-chip solutions. The state-of-the-art is to partition such systems into smaller sub-systems to implement each on a separate chip. Consequently, it requires interconnecting separate chips/FPGAs. Since Networks-on-Chip (NoCs) have become common interconnection solutions in embedded designs, we propose to bridge NoC-based SoCs enabling a generic multi-chip systems interconnection. In this context, the contribution of this paper is threefold, (i) we explore the NoC protocol stack to determine the best layer for implementing the off-chip bridge, (ii) we propose a generic hardware architecture for the bridge, and (iii) we develop a new software architecture enabling seamless configuration and communication of multi-chip NoC-based SoCs. Finally, we demonstrate performance, i.e., bandwidth and latency, of the bridge in a multi-FPGA platform, while the bridge guarantees QoS of traffic. The synthesis results indicate the implementation area cost of the bridge is only 1% of Xilinx Virtex6 FPGA.

...read moreread less

...

Expand