Top 81 papers presented at Parallel Computing in 2014

Showing papers presented at "Parallel Computing in 2014"

Journal Article•10.1016/J.PARCO.2014.03.012•

Sparse matrix multiplication: The distributed block-compressed sparse row library

[...]

Urban Borštnik¹, Joost VandeVondele², Valéry Weber¹, Jürg Hutter¹•Institutions (2)

1 May 2014

TL;DR: The DBCSR (Distributed Block Compressed Sparse Row) library for scalable sparse matrix–matrix multiplication and its use in the CP2K program for linear-scaling quantum-chemical calculations is presented.

...read moreread less

Abstract: Efficient parallel multiplication of sparse matrices is key to enabling many large-scale calculations. This article presents the DBCSR (Distributed Block Compressed Sparse Row) library for scalable sparse matrix–matrix multiplication and its use in the CP2K program for linear-scaling quantum-chemical calculations. The library combines several approaches to implement sparse matrix multiplication in a way that performs well and is demonstrably scalable. Parallel communication has well-defined limits. Data volume decreases with O ( 1 / P ) with increasing process counts P and every process communicates with at most O ( P ) others. Local sparse matrix multiplication is handled efficiently using a combination of techniques: blocking elements together in an application-relevant way, an autotuning library for small matrix multiplications, cache-oblivious recursive multiplication, and multithreading. Additionally, on-the-fly filtering not only increases sparsity but also avoids performing calculations that fall below the filtering threshold. We demonstrate and analyze the performance of the DBCSR library and its various scaling behaviors.

...read moreread less

232 citations

Journal Article•10.1016/J.PARCO.2013.11.001•

The effect of communication and synchronization on Amdahl's law in multicore systems

[...]

Leonid Yavits¹, Amir Morad¹, Ran Ginosar¹•Institutions (1)

Technion – Israel Institute of Technology¹

1 Jan 2014

TL;DR: This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling from Amdahl's law perspective and results show lower than originally predicted speedup in applications with high degree of data sharing.

...read moreread less

Abstract: This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling from Amdahl's law perspective. Analytical modeling supported by simulation leads to a modification of Amdahl's law, reflecting lower than originally predicted speedup, due to these effects. In applications with high degree of data sharing, leading to intense inter-core connectivity requirements, the workload should be executed on a smaller number of larger cores. Applications requiring intense sequential-to-parallel synchronization, even highly parallelizable ones, may better be executed by the sequential core. To improve the scalability and performance speedup of a multicore, it is as important to address the synchronization and connectivity intensities of parallel algorithms as their parallelization factor.

...read moreread less

85 citations

Journal Article•10.1145/2661651•

ACM transactions on parallel computing: An introduction

[...]

Phillip B. Gibbons¹•Institutions (1)

Intel¹

3 Oct 2014

TL;DR: In this paper, the authors present techniques and techniques for the extraction of a set of features from a single-dimensional model.Techniques:Techniques, techniques, and techniques.

...read moreread less

Abstract: Techniques

...read moreread less

69 citations

Journal Article•10.1016/J.PARCO.2014.09.001•

A survey of power and energy efficient techniques for high performance numerical linear algebra operations

[...]

Li Tan¹, Shashank Kothapalli¹, Longxiang Chen¹, Omar Hussaini¹, Ryan Bissiri¹, Zizhong Chen¹ - Show less +2 more•Institutions (1)

University of California, Riverside¹

1 Dec 2014

TL;DR: This paper surveys the research on saving power and energy for numerical linear algebra algorithms in high performance scientific computing on supercomputers around the world and summarizes state-of-the-art techniques for achieving power andEnergy efficiency in each category individually.

...read moreread less

Abstract: A thorough survey of energy efficient techniques for linear algebra operations.A complete list of valuable research efforts.A deeply review of all aspects of feasible solutions.Detailed discussion and comprehensive comparison.Thought-provoking directions of future research are provided. Extreme scale supercomputers available before the end of this decade are expected to have 100 million to 1billion computing cores. The power and energy efficiency issue has become one of the primary concerns of extreme scale high performance scientific computing. This paper surveys the research on saving power and energy for numerical linear algebra algorithms in high performance scientific computing on supercomputers around the world. We first stress the significance of numerical linear algebra algorithms in high performance scientific computing nowadays, followed by a background introduction on widely used numerical linear algebra algorithms and software libraries and benchmarks. We summarize commonly deployed power management techniques for reducing power and energy consumption in high performance computing systems by presenting power and energy models and two fundamental types of power management techniques: static and dynamic. Further, we review the research on saving power and energy for high performance numerical linear algebra algorithms from four aspects: profiling, trading off performance, static saving, and dynamic saving, and summarize state-of-the-art techniques for achieving power and energy efficiency in each category individually. Finally, we discuss potential directions of future work and summarize the paper.

...read moreread less

43 citations

Journal Article•10.1016/J.PARCO.2014.03.002•

Parallel eigenvalue calculation based on multiple shift-invert Lanczos and contour integral based spectral projection method

[...]

Hasan Metin Aktulga¹, Lin Lin¹, Christopher Haine², Esmond G. Ng¹, Chao Yang¹ - Show less +1 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, Versailles Saint-Quentin-en-Yvelines University²

1 Jul 2014

TL;DR: The possibility of using multiple shift-invert Lanczos and contour integral based spectral projection method to compute a relatively large number of eigenvalues of a large sparse and symmetric matrix on distributed memory parallel computers is discussed.

...read moreread less

Abstract: We discuss the possibility of using multiple shift-invert Lanczos and contour integral based spectral projection method to compute a relatively large number of eigenvalues of a large sparse and symmetric matrix on distributed memory parallel computers. The key to achieving high parallel efficiency in this type of computation is to divide the spectrum into several intervals in a way that leads to optimal use of computational resources. We discuss strategies for dividing the spectrum. Our strategies make use of an eigenvalue distribution profile that can be estimated through inertial counts and cubic spline fitting. Parallel sparse direct methods are used in both approaches. We use a simple cost model that describes the cost of computing k eigenvalues within a single interval in terms of the asymptotic cost of sparse matrix factorization and triangular substitutions. Several computational experiments are performed to demonstrate the effect of different spectrum division strategies on the overall performance of both multiple shift-invert Lanczos and the contour integral based method. We also show the parallel scalability of both approaches in the strong and weak scaling sense. In addition, we compare the performance of multiple shift-invert Lanczos and the contour integral based spectral projection method on a set of problems from density functional theory (DFT).

...read moreread less

40 citations

Journal Article•10.1145/2661653•

A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction

[...]

Julian Shun¹, Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

3 Oct 2014

TL;DR: It is shown that bottom-up traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix array of a string can be used to answer certain string queries.

...read moreread less

Abstract: We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees We show that bottom-up traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix array of a string can be used to answer certain string queries By adding downward pointers in the tree (eg using a hash table), we can also generate suffix trees from suffix arrays on arbitrary alphabets in the same bounds In conjunction with parallel suffix array algorithms, such as the skew algorithm, this gives a rather simple linear work parallel, O(n e) time (0eΣ{1, , n}, where n is the length of the input string It also gives a linear work parallel algorithm requiring O(log2 n) time with high probability for constant-sized alphabets More generally, given a sorted sequence of strings and the longest common prefix lengths between adjacent elements, the algorithm will generate a patricia tree (compacted trie) over the strings Of independent interest, we describe a work-efficient parallel algorithm for solving the all nearest smaller values problem using Cartesian trees, which is much simpler than the work-efficient parallel algorithm described in previous work We also present experimental results comparing the performance of the algorithm to existing sequential implementations and a second parallel algorithm that we implement We present comparisons for the Cartesian tree algorithm on its own and for constructing a suffix tree The results show that on a variety of strings our algorithm is competitive with the sequential version on a single processor and achieves good speedup on multiple processors We present experiments for three applications that require only the Cartesian tree, and also for searching using the suffix tree

...read moreread less

35 citations

Journal Article•10.1016/J.PARCO.2014.06.006•

Structure-adaptive parallel solution of sparse triangular linear systems

[...]

Ehsan Totoni¹, Michael T. Heath¹, Laxmikant V. Kale¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

1 Oct 2014

TL;DR: A novel parallel algorithm based on various heuristics that adapt to the structure of the matrix and extract parallelism that is unexploited by conventional methods is developed.

...read moreread less

Abstract: Solving sparse triangular systems of linear equations is a performance bottleneck in many methods for solving more general sparse systems. Both for direct methods and for many iterative preconditioners, it is used to solve the system or improve an approximate solution, often across many iterations. Solving triangular systems is notoriously resistant to parallelism, however, and existing parallel linear algebra packages appear to be ineffective in exploiting significant parallelism for this problem. We develop a novel parallel algorithm based on various heuristics that adapt to the structure of the matrix and extract parallelism that is unexploited by conventional methods. By analyzing and reordering operations, our algorithm can often extract parallelism even for cases where most of the nonzero matrix entries are near the diagonal. Our main parallelism strategies are: (1) identify independent rows, (2) send data earlier to achieve greater overlap, and (3) process dense off-diagonal regions in parallel. We describe the implementation of our algorithm in Charm++ and MPI and present promising experimental results on up to 512 cores of BlueGene/P, using numerous sparse matrices from real applications.

...read moreread less

34 citations

Journal Article•10.1016/J.PARCO.2014.04.009•

Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

[...]

Roman Wyrzykowski¹, Lukasz Szustak¹, Krzysztof Rojek¹•Institutions (1)

Częstochowa University of Technology¹

1 Aug 2014

TL;DR: A method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations.

...read moreread less

Abstract: EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components. Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are: • method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations; • method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques; • method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources; • approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs. Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively.

...read moreread less

29 citations

Journal Article•10.1016/J.PARCO.2014.03.005•

Energy profile of rollback-recovery strategies in high performance computing

[...]

Esteban Meneses¹, Osman Sarood¹, Laxmikant V. Kale¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

1 Oct 2014

TL;DR: A comparative evaluation and analysis of energy consumption of three different rollback-recovery protocols: checkpoint/restart, message logging and parallel recovery is presented, which shows parallel recovery has the minimum execution time and energy consumption.

...read moreread less

Abstract: An analytical model to understand and represent the energy consumption of rollback-recovery mechanisms.Insights on the main factors that affect the power draw in rollback-recovery protocols.An experimental evaluation of the energy consumed by three rollback-recovery techniques.Projections of the energy profile of the rollback-recovery strategies at extreme scale. Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a large number of parts will substantially increase the failure rate of the system compared to the failure frequency of current machines. Second, those components have to fit within the power envelope of the installation and keep the energy consumption within operational margins. Extreme-scale machines will have to incorporate fault tolerance mechanisms and honor the energy and power restrictions. Therefore, it is essential to understand how fault tolerance and energy consumption interplay. This paper presents a comparative evaluation and analysis of energy consumption of three different rollback-recovery protocols: checkpoint/restart, message logging and parallel recovery. Our experimental evaluation shows parallel recovery has the minimum execution time and energy consumption. Additionally, we present an analytical model that projects parallel recovery can reduce energy consumption more than 37% compared to checkpoint/restart at extreme scale.

...read moreread less

26 citations

Journal Article•10.1016/J.PARCO.2014.03.003•

Implementing QR factorization updating algorithms on GPUs

[...]

Robert Andrew¹, Nicholas J. Dingle¹•Institutions (1)

University of Manchester¹

1 Jul 2014

TL;DR: This work investigates the viability of implementing QR updating algorithms on GPUs and demonstrates that GPU-based updating for removing columns achieves speed-ups of up to 13.5x compared with full GPU QR factorization.

...read moreread less

Abstract: Linear least squares problems are commonly solved by QR factorization. When multiple solutions need to be computed with only minor changes in the underlying data, knowledge of the difference between the old data set and the new can be used to update an existing factorization at reduced computational cost. We investigate the viability of implementing QR updating algorithms on GPUs and demonstrate that GPU-based updating for removing columns achieves speed-ups of up to 13.5x compared with full GPU QR factorization. We characterize the conditions under which other types of updates also achieve speed-ups.

...read moreread less

25 citations

Journal Article•10.1016/J.PARCO.2013.12.003•

An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systems

[...]

Marc Baboulin¹, Dulceneia Becker², George Bosilca², Anthony Danalis², Jack Dongarra² - Show less +1 more•Institutions (2)

University of Paris-Sud¹, University of Tennessee²

1 Jul 2014

TL;DR: Efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up are proposed.

...read moreread less

Abstract: Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.10.002•

Couillard: Parallel programming via coarse-grained Data-flow Compilation

[...]

Leandro A. J. Marzulo¹, Tiago A. O. Alves², Felipe M. G. França², Vítor Santos Costa³•Institutions (3)

Rio de Janeiro State University¹, Federal University of Rio de Janeiro², University of Porto³

1 Dec 2014

TL;DR: Couillard as discussed by the authors is a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction.

...read moreread less

Abstract: Data-flow is a natural approach to parallelism. However, describing dependencies and control between fine-grained data-flow tasks can be complex and present unwanted overheads. TALM (TALM is an Architecture and Language for Multi-threading) introduces a user-defined coarse-grained parallel data-flow model, where programmers identify code blocks, called super-instructions, to be run in parallel and connect them in a data-flow graph. TALM has been implemented as a hybrid Von Neumann/data-flow execution system: the Trebuchet . We have observed that TALM’s usefulness largely depends on how programmers specify and connect super-instructions. Thus, we present Couillard , a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction. We show that our toolchain allows one to benefit from data-flow execution and explore sophisticated parallel programming techniques, with small effort. To evaluate our system we have executed a set of real applications on a large multi-core machine. Comparison with popular parallel programming methods shows competitive speedups, while providing an easier parallel programing approach. More specifically, for an application that follows the wavefront method, running with big inputs, Trebuchet achieved up to 4.7% speedup over Intel® TBB novel flow-graph approach and up to 44% over OpenMP.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.09.003•

Region templates

[...]

George Teodoro¹, Tony Pan², Tahsin Kurc³, Jun Kong², Lee Cooper², Scott Klasky³, Joel H. Saltz⁴ - Show less +3 more•Institutions (4)

University of Brasília¹, Emory University², Oak Ridge National Laboratory³, Stony Brook University⁴

1 Dec 2014

TL;DR: An experimental evaluation on a state-of-the-art hybrid cluster using a microscopy imaging application of the region template abstraction shows that the abstraction adds negligible overhead and achieves good scalability and high data transfer rates.

...read moreread less

Abstract: Region templates (RT) data management abstraction and runtime system is introduced.It provides a container for data structures used by spatiotemporal applications.RT supports execution on machines with CPU-GPU and unified data access interface.Different data I/O implementations targeting multiple memory levels are provided.Example application attains high processing rates and good scalability with RT. We introduce a region template abstraction and framework for the efficient storage, management and processing of common data types in analysis of large datasets of high resolution images on clusters of hybrid computing nodes. The region template abstraction provides a generic container template for common data structures, such as points, arrays, regions, and object sets, within a spatial and temporal bounding box. It allows for different data management strategies and I/O implementations, while providing a homogeneous, unified interface to applications for data storage and retrieval. A region template application is represented as a hierarchical dataflow in which each computing stage may be represented as another dataflow of finer-grain tasks. The execution of the application is coordinated by a runtime system that implements optimizations for hybrid machines, including performance-aware scheduling for maximizing the utilization of computing devices and techniques to reduce the impact of data transfers between CPUs and GPUs. An experimental evaluation on a state-of-the-art hybrid cluster using a microscopy imaging application shows that the abstraction adds negligible overhead (about 3%) and achieves good scalability and high data transfer rates. Optimizations in a high speed disk based storage implementation of the abstraction to support asynchronous data transfers and computation result in an application performance gain of about 1.13 ? . Finally, a processing rate of 11,730 4K ? 4K tiles per minute was achieved for the microscopy imaging application on a cluster with 100 nodes (300 GPUs and 1200 CPU cores). This computation rate enables studies with very large datasets.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.02.003•

A study of shared-memory parallelism in a multifrontal solver

[...]

Jean-Yves L'Excellent¹, Wissam M. Sid-Lakhdar¹•Institutions (1)

University of Lyon¹

1 Mar 2014

TL;DR: A direct solver for sparse systems of linear equations, which avoids a deep redesign and fully benefits from the numerical kernels and features of the original code, and proposes simple approaches to take advantage of NUMA architectures, and original optimizations to limit thread synchronization costs.

...read moreread less

Abstract: We introduce shared-memory parallelism in a parallel distributed-memory solver, targeting multi-core architectures. Our concern in this paper is pure shared-memory parallelism, although the work will also impact distributed-memory parallelism. Our approach avoids a deep redesign and fully benefits from the numerical kernels and features of the original code. We use performance models to exploit coarse-grain parallelism in an OpenMP environment while, at the same time, also relying on third-party optimized multithreaded libraries. In this context, we propose simple approaches to take advantage of NUMA architectures, and original optimizations to limit thread synchronization costs. The performance gains are analyzed in detail on test problems from various application areas. Although the studied code is a direct solver for sparse systems of linear equations, the contributions of this paper are more general and could be useful in a wider range of situations.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.11.007•

A compiler infrastructure for embedded heterogeneous MPSoCs

[...]

Weihua Sheng¹, Stefan Schürmans¹, Maximilian Odendahl¹, Mark Bertsch¹, Vitaliy Volevach¹, Rainer Leupers¹, Gerd Ascheid¹ - Show less +3 more•Institutions (1)

RWTH Aachen University¹

1 Feb 2014

TL;DR: This paper argues the need for and significance of positioning the language and tool design from the perspective of practicality to address the challenge of programming heterogeneous MPSoCs, and motivate, describe and justify such a practical design of a compilation framework, named MAPS.

...read moreread less

Abstract: Programming heterogeneous MPSoCs (Multi-Processor Systems on Chip) is a grand challenge for embedded SoC providers and users today. In this paper, we argue the need for and significance of positioning the language and tool design from the perspective of practicality to address this challenge. We motivate, describe and justify such a practical design of a compilation framework for heterogeneous MPSoCs targeting the domain of streaming applications, named MAPS (MPSoC Application Programming Studio). MAPS defines a clean, light-weight C language extension to capture streaming programming models. A retargetable source-to-source compiler is developed to provide key capabilities to construct practical compilation frameworks for real-world, complex MPSoC platforms. Our results have shown that MAPS is a promising compiler infrastructure that enables programming of heterogeneous MPSoCs and increases productivity of MPSoC software developers.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.07.003•

MPI for Big Data

[...]

Dominique LaSalle¹, George Karypis¹•Institutions (1)

University of Minnesota¹

1 Dec 2014

TL;DR: BDMPI enables the development of efficient out-of-core parallel distributed memory codes without the high engineering and algorithmic complexities associated with multiple levels of blocking.

...read moreread less

Abstract: We present BDMPI, a system for running MPI programs transparently out-of-core.We evaluate BDMPI on two small compute clusters.BDMPI is significantly faster than competing frameworks.BDMPI performs within 30% of optimized out-of-core implementations. The processing of massive amounts of data on clusters with finite amount of memory has become an important problem facing the parallel/distributed computing community. While MapReduce-style technologies provide an effective means for addressing various problems that fit within the MapReduce paradigm, there are many classes of problems for which this paradigm is ill-suited. In this paper we present a runtime system for traditional MPI programs that enables the efficient and transparent out-of-core execution of distributed-memory parallel programs. This system, called BDMPI,1The source code is available at http://glaros.dtc.umn.edu/gkhome/bdmpi/download.1 leverages the semantics of MPI's API to orchestrate the execution of a large number of MPI processes on much fewer compute nodes, so that the running processes maximize the amount of computation that they perform with the data fetched from the disk. BDMPI enables the development of efficient out-of-core parallel distributed memory codes without the high engineering and algorithmic complexities associated with multiple levels of blocking. BDMPI achieves significantly better performance than existing technologies on a single node as well as on a small cluster, and performs within 30% of optimized out-of-core implementations.

...read moreread less

Proceedings Article•10.3233/978-1-61499-381-0-63•

An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

[...]

Ashkan Tousimojarad¹, Wim Vanderbauwhede¹•Institutions (1)

University of Glasgow¹

1 Mar 2014

TL;DR: In this article, the authors propose a low-overhead heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads among the cores in a multiprogramming environment.

...read moreread less

Abstract: The emergence of multicore and manycore processors is set to change the parallel computing world. Applications are shifting towards increased parallelism in order to utilise these architectures efficiently. This leads to a situation where every application creates its desirable number of threads, based on its parallel nature and the system resources allowance. Task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In task scheduling, not only the order of the execution, but also the mapping of threads to the execution resources is of a great importance. In this paper we state and discuss some fundamental rules based on results obtained from selected applications of the BOTS benchmarks on the 64-core TILEPro64 processor. We demonstrate how previously efficient mapping policies such as those of the SMP Linux scheduler become inefficient when the number of threads and cores grows. We propose a novel, low-overhead technique, a heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads amongst the cores in a multiprogramming environment. Our novel approach could be implemented as a pragma similar to those in the new task-based OpenMP versions, or can be incorporated as a distributed thread mapping mechanism in future manycore programming frameworks. We show that our thread mapping scheme can outperform the native GNU/Linux thread scheduler in both single-programming and multiprogramming environments.

...read moreread less

Proceedings Article•

A Many-core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

[...]

Sardar Anisul Haque¹, Marc Moreno Maza¹, Ning Xie¹•Institutions (1)

University of Western Ontario¹

1 Jan 2014

TL;DR: A Graham-Brent theorem is established for this model of multithreaded computation, combining fork-join and single-instruction-multiple-data parallelisms, with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures.

...read moreread less

Abstract: We present a model of multithreaded computation, combining fork-join and single-instruction-multiple-data parallelisms, with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. We establish a Graham-Brent theorem for this model so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with four fundamental algorithms from scientific computing. In each case, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter; moreover experimentation confirms the model's prediction.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.07.005•

Dynamic core affinity for high-performance file upload on Hadoop Distributed File System

[...]

Joong-Yeon Cho¹, Hyun-Wook Jin¹, Min Lee², Karsten Schwan²•Institutions (2)

Konkuk University¹, Georgia Institute of Technology²

1 Dec 2014

TL;DR: This paper considers the dynamic changes in the processor load and the intensiveness of the file upload at run-time, and accordingly decides the core affinity for service threads, with the objective of maximizing the parallelism, data locality, and resource efficiency.

...read moreread less

Abstract: We analyze the impact of core affinity on both network and disk I/O performance.Both parallelism and locality are important for tasks that access disk and network.We suggest a novel approach to dynamically decide the core affinity of HDFS threads.Our dynamic core affinity improves the file upload throughput more than 30%. The MapReduce programming model, in which the data nodes perform both the data storing and the computation, was introduced for big-data processing. Thus, we need to understand the different resource requirements of data storing and computation tasks and schedule these efficiently over multi-core processors. In particular, the provision of high-performance data storing has become more critical because of the continuously increasing volume of data uploaded to distributed file systems and database servers. However, the analysis of the performance characteristics of the processes that store upstream data is very intricate, because both network and disk inputs/outputs (I/O) are heavily involved in their operations. In this paper, we analyze the impact of core affinity on both network and disk I/O performance and propose a novel approach for dynamic core affinity for high-throughput file upload. We consider the dynamic changes in the processor load and the intensiveness of the file upload at run-time, and accordingly decide the core affinity for service threads, with the objective of maximizing the parallelism, data locality, and resource efficiency. We apply the dynamic core affinity to Hadoop Distributed File System (HDFS). Measurement results show that our implementation can improve the file upload throughput of end applications by more than 30% as compared with the default HDFS, and provide better scalability.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.08.003•

Novel parallel method for association rule mining on multi-core shared memory systems

[...]

Lan Vu¹, Gita Alaghband¹•Institutions (1)

University of Colorado Denver¹

1 Dec 2014

TL;DR: This paper presents a novel parallel method for finding frequent patterns, the most computational intensive phase of ARM, named ShaFEM, which combines two mining strategies and applies the most appropriate one to each data subset of the database to efficiently adapt to the data characteristics and run fast on both sparse and dense databases.

...read moreread less

Abstract: ShaFEM: a novel association rule mining method for multi-core shared memory systems.ShaFEM self-adapts to data characteristic to run fast on sparse and dense databases.ShaFEM uses two mining strategies and dynamically switching between them.ShaFEM applies its new lock free solution, new data structure named XFP-tree.ShaFEM is up to 5.8 times faster and 7.1 times less memory than the compared method. Association rule mining (ARM) is an important task in data mining with many practical applications. Current methods for association rule mining have shown unstable performance for different database types and under-utilize the benefits of multi-core shared memory machines. In this paper, we address these issues by presenting a novel parallel method for finding frequent patterns, the most computational intensive phase of ARM. Our proposed method, named ShaFEM, combines two mining strategies and applies the most appropriate one to each data subset of the database to efficiently adapt to the data characteristics and run fast on both sparse and dense databases. In addition, our newlock-free design minimizes the synchronization needs and maximizes the data independence to enhance the scalability. The new structure lends itself well to dynamic job scheduling resulting in a well-balanced load on the new multi-core shared memory architectures. We have evaluated ShaFEM on 12-core multi-socket servers and found that our method run up to 5.8 times faster and consumes memory up to 7.1 times less than the state-of-the-art parallel method. For some test cases, ShaFEM can save up to 4.9days of execution time over the compared method.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.02.002•

A comparison of CPU and GPU implementations for solving the Convection Diffusion equation using the local Modified SOR method

[...]

Yiannis Cotronis¹, Elias Konstantinidis¹, Maria Louka¹, Nikolaos M. Missirlis¹•Institutions (1)

National and Kapodistrian University of Athens¹

1 Jul 2014

TL;DR: A parallel form of the SOR method for the numerical solution of the Convection Diffusion equation suitable for GPUs using CUDA using the fine grain parallelism model and two parallel CPU programs utilizing manual SSE2 and AVX vectorization were developed as performance references.

...read moreread less

Abstract: In this paper we study a parallel form of the SOR method for the numerical solution of the Convection Diffusion equation suitable for GPUs using CUDA. To exploit the parallelism offered by GPUs we consider the fine grain parallelism model. This is achieved by considering the local relaxation version of SOR. More specifically, we use SOR with red-black ordering using two sets of parameters @w"1"i"j and @w"2"i"j for the 5 point stencil. The parameter @w"1"i"j is associated with each red (i+j even) grid point (i,j), whereas the parameter @w"2"i"j is associated with each black (i+j odd) grid point (i,j). The use of a parameter for each grid point avoids the global communication required in the adaptive determination of the best value of @w and also increases the convergence rate of the SOR method (Varga, 1962) [38] and (Young, 1971) [41]. We present our strategy and the results of our effort to exploit the computational capabilities of GPUs under the CUDA environment. Additionally, two parallel CPU programs utilizing manual SSE2 (Streaming SIMD Extensions 2) and AVX (Advanced Vector Extensions) vectorization were developed as performance references. The optimizations applied on the GPU version were also considered for the CPU version. Significant performance improvement was achieved with all three developed GPU kernels differentiated by the degree of recomputations thus affecting the flops per element access ratio.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.09.012•

An adaptive and hierarchical task scheduling scheme for multi-core clusters

[...]

Yizhuo Wang¹, Yang Zhang², Yan Su¹, Xiaojun Wang¹, Xu Chen¹, Weixing Ji¹, Feng Shi¹ - Show less +3 more•Institutions (2)

Beijing Institute of Technology¹, Hebei University of Science and Technology²

1 Dec 2014

TL;DR: An adaptive and hierarchical task scheduling scheme for multi-core clusters, in which work-stealing and work-sharing are adaptively used to achieve load balancing and outperforms existing schemes by 11-21%, is introduced.

...read moreread less

Abstract: An adaptive and hierarchical task scheduling scheme (AHS) is proposed.Work-sharing is used in conjunction with work-stealing.An initial partitioning is performed with respect to the pattern of task parallelism.A practical implementation of AHS is described.The theoretical, simulation and experimental studies of AHS are presented. Work-stealing and work-sharing are two basic paradigms for dynamic task scheduling. This paper introduces an adaptive and hierarchical task scheduling scheme (AHS) for multi-core clusters, in which work-stealing and work-sharing are adaptively used to achieve load balancing.Work-stealing has been widely used in task-based parallel programing languages and models, especially on shared memory systems. However, high inter-node communication costs hinder work-stealing from being directly performed on distributed memory systems. AHS addresses this issue with the following techniques: (1) initial partitioning, which reduces the inter-node task migrations; (2) hierarchical scheduling scheme, which performs work-stealing inside a node before going across the node boundary and adopts work-sharing to overlap computation and communication at the inter-node level; and (3) hierarchical and centralized control for inter-node task migration, which improves the efficiency of victim selection and termination detection.We evaluated AHS and existing work-stealing schemes on a 16-nodes multi-core cluster. Experimental results show that AHS outperforms existing schemes by 11-21.4%, for the benchmarks studied in this paper.

...read moreread less

Proceedings Article•10.3233/978-1-61499-381-0-564•

Global Communication Schemes for the Sparse Grid Combination Technique

[...]

Phillip Hupp, Riko Jacob¹, Mario Heene², Dirk Pflüger², Markus Hegland³ - Show less +1 more•Institutions (3)

ETH Zurich¹, University of Stuttgart², Australian National University³

1 Jan 2014

Journal Article•10.1016/J.PARCO.2013.09.002•

Large scale micro finite element analysis of 3D bone poroelasticity

[...]

Erhan Turan¹, Peter Arbenz¹•Institutions (1)

ETH Zurich¹

1 Jul 2014

TL;DR: A finite element solver based on Biot's consolidation equations has been developed for poroelasticity problems related to osteoporotic human bones and a mixed formulation is used to discretize the geometries taken from medical imaging.

...read moreread less

Abstract: In this paper, a solver for poroelasticity problems related to osteoporotic human bones is discussed. Osteoporosis is a major health problem that compromises the integrity of bones. A good understanding of the disease requires an accurate simulation of the physics. For that purpose, a finite element solver based on Biot's consolidation equations has been developed. A mixed formulation is used to discretize the geometries taken from medical imaging. The resulting indefinite linear systems are solved by Krylov space methods supplemented by variants of Schur complement-based block preconditioners.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.07.002•

Petascale large eddy simulation of jet engine noise based on the truncated SPIKE algorithm

[...]

Yingchong Situ¹, Chandra S. Martha¹, Matthew E. Louis¹, Zhiyuan Li¹, Ahmed H. Sameh¹, Gregory A. Blaisdell¹, Anastasios S. Lyrintzis¹ - Show less +3 more•Institutions (1)

Purdue University¹

1 Oct 2014

TL;DR: A new, scalable parallelization scheme with a three-dimensional computational space partitioning that uses the truncated SPIKE algorithm to solve the governing equations accurately and limit one-sided biased differentiation to just the physical boundaries.

...read moreread less

Abstract: We apply the truncated SPIKE algorithm to petascale simulation of jet engine noise.We derive a tridiagonal linear system solver from the truncated SPIKE algorithm.The new solver has optimal weak scalability not found in traditional methods.Experimental data show the new solver to be much faster than traditional methods. With the emergence of petascale computing platforms, high-fidelity computational aeroacoustics (CAA) simulation has become a feasible, robust and accurate tool that complements theoretical and empirical approaches in the prediction of sound levels generated by aircraft airframes and engines. Differentiating itself from the broader discipline of computational fluid dynamics, CAA is particularly challenging as it demands high accuracy, good spectral resolution, and low dispersion and diffusion errors from the underlying numerical methods. Large eddy simulation based on space-implicit high-order compact finite difference schemes has been shown to meet such stringent requirements. In this paper, we discuss a new, scalable parallelization scheme with a three-dimensional computational space partitioning. Unlike many traditional multiblock computational fluid dynamics (CFD) methods, our partitioning is non-overlapping. We use the truncated SPIKE algorithm to solve the governing equations accurately and limit one-sided biased differentiation to just the physical boundaries. We present experimental performance data collected on Kraken and Ranger, two near-petascale computing platforms.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.06.007•

Distributed text search using suffix arrays

[...]

Diego Arroyuelo¹, Carolina Bonacic², Veronica Gil-Costa¹, Mauricio Marin³, Gonzalo Navarro³ - Show less +1 more•Institutions (3)

Yahoo!¹, University of Santiago, Chile², University of Chile³

1 Oct 2014

TL;DR: The theoretical and experimental performance studies show that the proposed strategies for building efficient and scalable on-line search services based on suffix arrays are suitable solutions for achieving high query throughput at low operational costs.

...read moreread less

Abstract: We introduce distributed-memory text-search algorithms based on suffix arrays.We analyze the proposed search algorithms to determine their theoretical speed-up.We do intensive experimentation on high-performance clusters of processors.We assess our practicality using also web search engines as a case study. Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, where it is necessary to efficiently process intensive-traffic streams of on-line queries. This paper proposes strategies to enable such services by means of suffix arrays. We introduce techniques for deploying suffix arrays on clusters of distributed-memory processors and then study the processing of multiple queries on the distributed data structure. Even though the cost of individual search operations in sequential (non-distributed) suffix arrays is low in practice, the problem of processing multiple queries on distributed-memory systems, so that hardware resources are used efficiently, is relevant to services aimed at achieving high query throughput at low operational costs. Our theoretical and experimental performance studies show that our proposals are suitable solutions for building efficient and scalable on-line search services based on suffix arrays.

...read moreread less

Journal Article•10.1145/2588788•

Enhancing Performance Optimization of Multicore/Multichip Nodes with Data Structure Metrics

[...]

Ashay Rane¹, James Browne¹•Institutions (1)

University of Texas at Austin¹

1 May 2014

TL;DR: A low-overhead tool (MACPO) that captures memory traces and computes metrics for the memory access behavior of source-level (C, C++, Fortran) data structures and uses more realistic cache models for computation of latency metrics than those used by previous tools.

...read moreread less

Abstract: Program performance optimization is usually based solely on measurements of execution behavior of code segments using hardware performance counters. However, memory access patterns are critical performance limiting factors for today's multicore chips where performance is highly memory bound. Therefore diagnoses and selection of optimizations based only on measurements of the execution behavior of code segments are incomplete because they do not incorporate knowledge of memory access patterns and behaviors. This article presents a low-overhead tool (MACPO) that captures memory traces and computes metrics for the memory access behavior of source-level (C, C++, Fortran) data structures. MACPO explicitly targets the measurement and metrics important to performance optimization for multicore chips. The article also presents a complete process for integrating measurement and analyses of code execution with measurements and analyses of memory access patterns and behaviors for performance optimization, specifically targeting multicore chips and multichip nodes of clusters. MACPO uses more realistic cache models for computation of latency metrics than those used by previous tools. Evaluation of the effectiveness of adding memory access behavior characteristics of data structures to performance optimization was done on subsets of the ASCI, NAS and Rodinia parallel benchmarks and two versions of one application program from a domain not represented in these benchmarks. Adding characteristics of the behavior of data structures enabled easier diagnoses of bottlenecks and more accurate selection of appropriate optimizations than with only code centric behavior measurements. The performance gains ranged from a few percent to 38 percent.

...read moreread less

Journal Article•10.1016/J.PARCO.2013.11.006•

X10-FT: Transparent fault tolerance for APGAS language and runtime

[...]

Zhijun Hao¹, Chenning Xie², Haibo Chen², Binyu Zang²•Institutions (2)

Fudan University¹, Shanghai Jiao Tong University²

1 Feb 2014

TL;DR: This paper designs and implements a fault-tolerance framework called X10-FT that leverages renowned techniques in distributed systems like distributed file systems and Paxos, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus that allows the system to transparently handle machine failures at different granularities.

...read moreread less

Abstract: The asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming on multicore and clusters, with good productivity. However, it currently lacks support for fault tolerance (FT) such that a single transient failure may render hours to months of computation useless. In this paper, we thoroughly analyze the feasibility of providing fault tolerance for APGAS model and make the first attempt to add fault tolerance support to an APGAS language called X10. Based on the analysis, we design and implement a fault-tolerance framework called X10-FT that leverages renowned techniques in distributed systems like distributed file systems and Paxos, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus. This allows the system to transparently handle machine failures at different granularities. Using the features of the APGAS model, we extend the X10 compiler to automatically locate execution points to checkpoint program states without any intervention from programmers. Evaluation using a set of benchmarks shows that the cost for fault tolerance is modest.

...read moreread less

Proceedings Article•10.3233/978-1-61499-381-0-584•

A Parallel Fault Tolerant Combination Technique

[...]

Brendan Harding¹, Markus Hegland¹•Institutions (1)

Australian National University¹

1 Jan 2014

Proceedings Article•10.3233/978-1-61499-381-0-203•

Heterogeneous Sparse Matrix Computations on Hybrid GPU/CPU Platforms

[...]

Valeria Cardellini¹, Alessandro Fanfarillo¹, Salvatore Filippone¹•Institutions (1)

University of Rome Tor Vergata¹

1 Jan 2014

TL;DR: This paper detailed how design patterns for sparse matrix computations enable us to easily adapt to such a heterogeneous GPU/CPU platform using several sparse matrix formats in order to achieve best performance; then, it analyzed static load balancing strategies for devising a suitable data decomposition and proposed approach.

...read moreread less

Abstract: Hybrid GPU/CPU clusters are becoming very popular in the scientific computing community, as attested by the number of such systems present in the Top 500 list. In this paper, we address one of the key algorithms for scientific applications: the computation of sparse matrix-vector products that lies at the heart of iterative solvers for sparse linear systems. We detail how design patterns for sparse matrix computations enable us to easily adapt to such a heterogeneous GPU/CPU platform using several sparse matrix formats in order to achieve best performance; then, we analyze static load balancing strategies for devising a suitable data decomposition and propose our approach. We discuss our experience in using different sparse matrix formats and data partitioning algorithms with a number of computational experiments executed on three different hybrid GPU/CPU platforms.

...read moreread less