Top 89 papers presented at Parallel Computing in 2018

Showing papers presented at "Parallel Computing in 2018"

Journal Article•10.1016/J.PARCO.2018.06.009•

Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures

[...]

Mehmet Deveci¹, Christian Robert Trott¹, Sivasankaran Rajamanickam¹•Institutions (1)

1 Oct 2018

TL;DR: This paper develops parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures and develops a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem.

...read moreread less

Abstract: Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM , to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

...read moreread less

54 citations

Journal Article•10.1016/J.PARCO.2018.02.001•

DVFS-aware application classification to improve GPGPUs energy efficiency

[...]

Joao Guerreiro¹, Aleksandar Ilic¹, Nuno Roma¹, Pedro Tomás¹•Institutions (1)

Instituto Superior Técnico¹

1 Feb 2018

TL;DR: Novel DVFS-aware performance and power classification models are herein proposed that correlate application characteristics and GPU architecture features, and by analysing the utilization of graphics and memory components at a single voltage and frequency levels are able to predict the impact of DVFS on GPGPU applications execution time and power and energy consumption.

...read moreread less

Abstract: The increasing importance of GPUs as high-performance accelerators and the power and energy constraints of computing systems, make it fundamental to develop techniques for energy efficiency maximization of GPGPU applications. Among several potential techniques, dynamic voltage and frequency scaling (DVFS) stands out as one of the most promising approaches. Hence, novel DVFS-aware performance and power classification models are herein proposed that correlate application characteristics and GPU architecture features. In particular, by analysing the utilization of graphics and memory components at a single voltage and frequency levels, the proposed classification methodologies are able to predict the impact of DVFS on GPGPU applications execution time and power and energy consumption. The accuracy of the proposed approach is validated on two modern NVIDIA GPUs from the Maxwell and Pascal generations, by relying on 35 benchmarks from the Rodinia, Polybench, Parboil, SHOC and CUDA SDK suites. Experimental results show that the proposed approach can typically predict the optimal operating frequencies of graphics and memory subsystems, attaining up to 36% energy savings (average of 16%), which correspond to an average deviation of 0.74% regarding the optimal case. Moreover, when considering a maximum performance penalty of 10%, up to 26% energy savings are still attained.

...read moreread less

48 citations

Journal Article•10.1145/3201897•

ThreadScan: Automatic and Scalable Memory Reclamation

[...]

Dan Alistarh¹, William M. Leiserson², Alexander Matveev², Nir Shavit²•Institutions (2)

Institute of Science and Technology Austria¹, Massachusetts Institute of Technology²

1 May 2018

TL;DR: Instead of manually tracking access to memory locations as done in techniques like hazard pointers, or restricting shared accesses to specific epoch boundaries as in RCU, the algorithm, called ThreadScan, leverages operating system signaling to automatically detect which memory locations are being accessed by concurrent threads.

...read moreread less

Abstract: The concurrent memory reclamation problem is that of devising a way for a deallocating thread to verify that no other concurrent threads hold references to a memory block being deallocated. To date, in the absence of automatic garbage collection, there is no satisfactory solution to this problem; existing tracking methods like hazard pointers, reference counters, or epoch-based techniques like RCU are either prohibitively expensive or require significant programming expertise to the extent that implementing them efficiently can be worthy of a publication. None of the existing techniques are automatic or even semi-automated.In this article, we take a new approach to concurrent memory reclamation. Instead of manually tracking access to memory locations as done in techniques like hazard pointers, or restricting shared accesses to specific epoch boundaries as in RCU, our algorithm, called ThreadScan, leverages operating system signaling to automatically detect which memory locations are being accessed by concurrent threads.Initial empirical evidence shows that ThreadScan scales surprisingly well and requires negligible programming effort beyond the standard use of Malloc and Free.

...read moreread less

40 citations

Journal Article•10.1145/3155292•

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication

[...]

Kadir Akbudak¹, Oguz Selvitopi¹, Cevdet Aykanat¹•Institutions (1)

Bilkent University¹

3 Jan 2018

TL;DR: Results show that by reducing the latency cost besides the bandwidth cost through using the communication hypergraph models, the parallel SpGEMM time can be further improved up to 32% and the proposed bipartite graph models are attractive alternatives to their hypergraph counterparts because of their lower partitioning overhead.

...read moreread less

Abstract: We investigate outer-product--parallel, inner-product--parallel, and row-by-row-product--parallel formulations of sparse matrix-matrix multiplication (SpGEMM) on distributed memory architectures. For each of these three formulations, we propose a hypergraph model and a bipartite graph model for distributing SpGEMM computations based on one-dimensional (1D) partitioning of input matrices. We also propose a communication hypergraph model for each formulation for distributing communication operations. The computational graph and hypergraph models adopted in the first phase aim at minimizing the total message volume and balancing the computational loads of processors, whereas the communication hypergraph models adopted in the second phase aim at minimizing the total message count and balancing the message volume loads of processors. That is, the computational partitioning models reduce the bandwidth cost and the communication hypergraph models reduce the latency cost. Our extensive parallel experiments on up to 2048 processors for a wide range of realistic SpGEMM instances show that although the outer-product--parallel formulation scales better, the row-by-row-product--parallel formulation is more viable due to its significantly lower partitioning overhead and competitive scalability. For computational partitioning models, our experimental findings indicate that the proposed bipartite graph models are attractive alternatives to their hypergraph counterparts because of their lower partitioning overhead. Finally, we show that by reducing the latency cost besides the bandwidth cost through using the communication hypergraph models, the parallel SpGEMM time can be further improved up to 32%.

...read moreread less

37 citations

Journal Article•10.1016/J.PARCO.2018.03.007•

Superlinear speedup phenomenon in parallel 3D Discrete Element Method (DEM) simulations of complex-shaped particles

[...]

Beichuan Yan¹, Richard A. Regueiro¹•Institutions (1)

University of Colorado Boulder¹

1 Jul 2018

TL;DR: The strong scaling measurements show that cache miss rate is sensitive to the memory consumption shrinkage per processor, and the last level cache contributes most significantly to the strong superlinear speedup among all of the three cache levels, and this is also revealed in the weak scaling measurements.

...read moreread less

Abstract: Strong superlinear speedup has been discovered in large scale simulations of parallel 3D DEM for complex-shaped particles, which is based on an algorithm of spatial domain decomposition, and exhibits the “high-CPU-low-memory” characteristics. The interpretation of this phenomenon requires a careful examination of the speedup theory and practice in the field of parallel computing. The superlinear speedup is investigated from three perspectives: (i) memory footprint per process, (ii) cache miss rates of L1, L2 and L3 level caches, and (iii) uniprocessor performance, using a wide range of problem size (across five orders of magnitude of simulation scale regarding number of particles) and number of compute nodes (1–2048 nodes) on DoD supercomputers. The Performance-API (PAPI) is employed in the source code to measure cache miss rate and FLOPS. The strong scaling measurements show that cache miss rate is sensitive to the memory consumption shrinkage per processor, and the last level cache (LLC) contributes most significantly to the strong superlinear speedup among all of the three cache levels, and this is also revealed in the weak scaling measurements. The findings are associated with the inherently perfect scalability of 3D DEM: its memory scalability function is a nonlinearly decreasing function of the number of processors. In addition, a constant (non-increasing) uniprocessor FLOPS performance w.r.t problem size can also contribute to the superlinear speedup. The superlinear speedup is a common phenomenon for large scale 3D DEM simulations of complex-shaped particles, and the larger the scale, the stronger is the superlinear speedup. DEM researchers should take advantage of this effect to speedup their parallel simulations.

...read moreread less

24 citations

Journal Article•10.1016/J.PARCO.2018.05.003•

MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU

[...]

Srinivasan Ramesh¹, Aurèle Mahéo¹, Sameer Shende¹, Allen D. Malony¹, Hari Subramoni², Amit Ruhela², Dhabaleswar K. Panda² - Show less +3 more•Institutions (2)

University of Oregon¹, Ohio State University²

1 Sep 2018

TL;DR: An infrastructure that extends existing components — TAU, MVAPICH2, and BEACON to take advantage of the MPI_T interface and offer runtime introspection, online monitoring, recommendation generation, and autotuning capabilities is proposed.

...read moreread less

Abstract: The desire for high performance on scalable parallel systems is increasing the complexity and tunability of MPI implementations. The MPI Tools Information Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to fine-tune the performance of the MPI library dynamically at runtime. In this paper, we propose an infrastructure that extends existing components — TAU, MVAPICH2, and BEACON to take advantage of the MPI_T interface and offer runtime introspection, online monitoring, recommendation generation, and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. Using our infrastructure, we implement an autotuning policy for AmberMD (a molecular dynamics package) that monitors and reduces the internal memory footprint of the MVAPICH2 MPI library without affecting performance. For applications such as MiniAMR whose collective communication is latency sensitive, our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, the MPI time for MiniAMR at 224 processes reduces by 15%.

...read moreread less

22 citations

Proceedings Article•10.1145/3210377.3210413•

Lock-free Contention Adapting Search Trees

[...]

Kjell Winblad, Konstantinos Sagonas, Bengt Jonsson

11 Jul 2018

TL;DR: This work presents the first lock-free key-value store with linearizable range query support that dynamically adapts its synchronization granularity and shows that LFCA trees are able to perform substantially better than data structures with a fixed synchronization granular over a wide range of scenarios due to their ability to adapt to the scenario at hand.

...read moreread less

Abstract: Concurrent key-value stores with range query support are crucial for the scalability and performance of many applications. Existing lock-free data structures of this kind use a fixed synchronization granularity. Using a fixed synchronization granularity in a concurrent key-value store with range query support is problematic as the best performing synchronization granularity depends on a number of factors that are difficult to predict, such as the level of contention and the number of items that are accessed by range queries. We present the first linearizable lock-free key-value store with range query support that dynamically adapts its synchronization granularity. This data structure is called the lock-free contention adapting search tree (LFCA tree). An LFCA tree automatically performs local adaptations of its synchronization granularity based on heuristics that take contention and the performance of range queries into account. We show that the operations of LFCA trees are linearizable, that the lookup operation is wait-free, and that the remaining operations (insert, remove and range query) are lock-free. Our experimental evaluation shows that LFCA trees achieve more than twice the throughput of related lock-free data structures in many scenarios. Furthermore, LFCA trees are able to perform substantially better than data structures with a fixed synchronization granularity over a wide range of scenarios due to their ability to adapt to the scenario at hand.

...read moreread less

21 citations

Journal Article•10.1016/J.PARCO.2018.06.001•

Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations

[...]

James Lin¹, James Lin², Zhigeng Xu², Linjin Cai², Akira Nukada¹, Satoshi Matsuoka¹ - Show less +2 more•Institutions (2)

Tokyo Institute of Technology¹, Shanghai Jiao Tong University²

1 Sep 2018

TL;DR: The methodology developed in this study, that infers a processor’s micro-architecture design from micro-benchmark results, can also be applied on other processors lacking of public information.

...read moreread less

Abstract: The inadequate public information of China’s SW26010 processor’s micro-architecture prevents global researchers from improving application performances on the TaihuLight supercomputer. This study aims to illuminate the uncharted area of SW26010 in order to provide important information for performance optimizations and modeling. First, we developed a micro-benchmark suite, swCandle, to evaluate the key micro-architectural features. The benchmark results revealed some unanticipated findings beyond the publicly available data. For instance, the broadcast mode of register communications has the same latency as the peer-to-peer mode. Second, we applied the roofline model, with the key parameters obtained with swCandle, to identify the key programming challenge of SW26010. Third, based on the micro-benchmark results and the roofline model analysis, we proposed a systematic guideline for performance optimizations on SW26010 and instantiated the guideline with two cases. The methodology we developed in this study, that infers a processor’s micro-architecture design from micro-benchmark results, can also be applied on other processors lacking of public information.

...read moreread less

20 citations

Journal Article•10.1016/J.PARCO.2017.11.003•

Benchmarking the GPU memory at the warp level

[...]

Minquan Fang¹, Jianbin Fang², Weimin Zhang², Haifang Zhou², Jianxing Liao¹, Yuangang Wang¹ - Show less +2 more•Institutions (2)

Huawei¹, National University of Defense Technology²

1 Jan 2018

TL;DR: This work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines, and summarizes the optimization guidelines for different types of memories, and builds an optimization framework on GPU memories.

...read moreread less

Abstract: Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level. In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective. Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system.

...read moreread less

19 citations

Journal Article•10.1016/J.PARCO.2017.10.004•

Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs

[...]

Mark Gates¹, Stanimire Tomov¹, Jack Dongarra², Jack Dongarra¹, Jack Dongarra³ - Show less +1 more•Institutions (3)

University of Tennessee¹, University of Manchester², Oak Ridge National Laboratory³

1 May 2018

TL;DR: An accelerated SVD employing a two stage reduction to bidiagonal and a number of other algorithms that are highly optimized for GPUs are developed that provide a significant speedup compared to existing multi-core and GPU-based SVD implementations.

...read moreread less

Abstract: The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value decomposition (SVD) uses a one stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two stage reduction to bidiagonal has been gaining popularity. It first reduces the matrix to band form using high performance Level 3 BLAS, then reduces the band matrix to bidiagonal form. As accelerators such as GPUs and co-processors are becoming increasingly widespread in high-performance computing, a question of great interest to many SVD users is how much the employment of a two stage reduction, as well as other current best practices in GPU computing, can accelerate this important routine. To fulfill this interest, we have developed an accelerated SVD employing a two stage reduction to bidiagonal and a number of other algorithms that are highly optimized for GPUs. Notably, we also parallelize and accelerate the divide and conquer algorithm used to solve the subsequent bidiagonal SVD. By accelerating all phases of the SVD algorithm, we provide a significant speedup compared to existing multi-core and GPU-based SVD implementations. In particular, using a P100 GPU, we illustrate a performance of up to 804 Gflop/s in double precision arithmetic to compute the full SVD of a 20k × 20k matrix in 90 seconds, which is 8.9 × faster than MKL on two 10 core Intel Haswell E5-2650 v3 CPUs, 3.7 × over the multi-core PLASMA two stage version, and 2.6 × over the previously accelerated one stage MAGMA version.

...read moreread less

18 citations

Journal Article•10.1016/J.PARCO.2017.11.007•

Block SS–CAA: A complex moment-based parallel nonlinear eigensolver using the block communication-avoiding Arnoldi procedure

[...]

Akira Imakura¹, Tetsuya Sakurai¹•Institutions (1)

University of Tsukuba¹

1 May 2018

TL;DR: Numerical experiments indicate that the proposed block SS-CAA method has higher performance compared with traditional complex moment-based nonlinear eigensolvers, i.e., the block SS–Hankel and Beyn methods.

...read moreread less

Abstract: Complex moment-based parallel eigensolvers have been actively studied owing to their high parallel efficiency. In this paper, we propose a block SS–CAA method, which is a complex moment-based parallel nonlinear eigensolver that makes use of the block communication-avoiding Arnoldi procedure. Numerical experiments indicate that the proposed method has higher performance compared with traditional complex moment-based nonlinear eigensolvers, i.e., the block SS–Hankel and Beyn methods.

...read moreread less

Journal Article•10.1145/3291525•

A High-Quality and Fast Maximal Independent Set Implementation for GPUs

[...]

Martin Burtscher¹, Sindhu Devale¹, Sahar Azimi¹, Jayadharini Jaiganesh¹, Evan Powers¹ - Show less +1 more•Institutions (1)

Texas State University¹

8 Dec 2018

TL;DR: This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs that includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size.

...read moreread less

Abstract: Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size. Its CUDA implementation requires fewer than 30 kernel statements, runs asynchronously, and produces a deterministic result. It outperforms the maximal independent set implementations of Pannotia, CUSP, and IrGL on each of the 16 tested graphs of various types and sizes. On a Titan X GPU, ECL-MIS is between 3.9 and 100 times faster (11.5 times, on average). ECL-MIS running on the GPU is also faster than the parallel CPU codes Ligra, Ligra+, and PBBS running on 20 Xeon cores, which it outperforms by 4.1 times, on average. At the same time, ECL-MIS produces maximal independent sets that are up to 52% larger (over 10%, on average) compared to these preexisting CPU and GPU implementations. Whereas these codes produce maximal independent sets that are, on average, about 15% smaller than the largest possible such sets, ECL-MIS sets are less than 6% smaller than the maximum independent sets.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.03.001•

Introducing the explicitly many-processor approach

[...]

Janos Vegh¹•Institutions (1)

University of Miskolc¹

1 Jul 2018

TL;DR: The goal of the approach is to enhance the single-processor performance using cooperating cores, rather than to introduce a new method for parallelization, and the resulting computing stack is upward compatible with the presently existing one.

...read moreread less

Abstract: The deeper reasons of the present stalling in computing is scrutinized, and to enhance the single-processor performance, a new approach explicitly considering the presence of several computing units is introduced, as opposed to the presently exclusively used, 70-years old single-processor approach. The appearance of many-core processors, having many processing units in close vicinity to each other, requires to re-think some principles of computing. The goal of the approach is to enhance the single-processor performance using cooperating cores, rather than to introduce a new method for parallelization. Technically, it introduces a new control layer above the cores, a new intermediate execution unit called quasi-thread, a modified compiling method and object code for transferring parallelization information from the development system to the processor, and an on-demand self-organizing processor architecture. The resulting processors have more effective and more “green” architecture, considerably increased single-thread performance, allow for more deterministic real-time behaviour, new scheduling principles for multitasking, less operating system overhead, etc. Surprisingly, the resulting computing stack is upward compatible with the presently existing one.

...read moreread less

Journal Article•10.1016/J.PARCO.2017.12.003•

Comparing load-balancing algorithms for MapReduce under Zipfian data skews

[...]

Joanna Berlińska¹, Maciej Drozdowski²•Institutions (2)

Adam Mickiewicz University in Poznań¹, Poznań University of Technology²

1 Feb 2018

TL;DR: It is concluded that no single algorithm is a panacea and hybrid approaches are necessary in countering data skew in MapReduce computations, and algorithms of four different types for balancing computational effort in reduce-heavy Map Reduce jobs are proposed.

...read moreread less

Abstract: In this paper, we analyze applicability of various load-balancing methods in countering data skew in MapReduce computations. A MapReduce job consists of several phases: mapping, shuffling data, sorting and reducing. The distribution of the work in the last three phases is data-driven, and unequal distribution of the data keys may cause imbalance in the computation completion times and prolonged execution of the whole job. We propose algorithms of four different types for balancing computational effort in reduce-heavy MapReduce jobs and evaluate their performance under various degrees of data skew and system parameters. By applying an innovative method of visualizing algorithm dominance conditions, we are able to determine conditions under which certain load-balancing algorithms are capable of scheduling MapReduce computations well. We conclude that no single algorithm is a panacea and hybrid approaches are necessary.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.05.007•

MPI windows on storage for HPC applications

[...]

Sergio Rivas-Gomez¹, Roberto Gioiosa², Ivy Bo Peng², Gokcen Kestor², Sai Narasimhamurthy, Erwin Laure¹, Stefano Markidis¹ - Show less +3 more•Institutions (2)

Royal Institute of Technology¹, Oak Ridge National Laboratory²

1 Sep 2018

TL;DR: The design and implementation of MPI storage windows are described, and its benefits for out-of-core execution, parallel I/O and fault-tolerance are presented, and the integration of heterogeneous window allocations are explored.

...read moreread less

Abstract: Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we explore the integration of heterogeneous window allocations, where memory and storage share a unified virtual address space. When performing large, irregular memory operations, we verify that MPI windows on local storage incurs a 55% performance penalty on average. When using a Lustre parallel file system, “asymmetric” performance is observed with over 90% degradation in writing operations. Nonetheless, experimental results of a Distributed Hash Table, the HACC I/O kernel mini-application, and a novel MapReduce implementation based on the use of MPI one-sided communication, indicate that the overall penalty of MPI windows on storage can be negligible in most cases in real-world applications.

...read moreread less

Proceedings Article•10.1145/3281070.3281071•

PIRA: performance instrumentation refinement automation

[...]

Jan-Patrick Lehr¹, Alexander Hück¹, Christian Bischof¹•Institutions (1)

Technische Universität Darmstadt¹

6 Nov 2018

TL;DR: PIRA automates the generation of an initial performance overview measurement and gradually refines it, based on the recorded runtime information, to help a performance analyst with the time consuming and largely manual, yet mechanical, task of selecting which functions to capture in subsequent measurements.

...read moreread less

Abstract: In this paper we present PIRA – an infrastructure for automatic instrumentation refinement for performance analysis. It automates the generation of an initial performance overview measurement and gradually refines it, based on the recorded runtime information. This can help a performance analyst with the time consuming and largely manual, yet mechanical, task of selecting which functions to capture in subsequent measurements. PIRA implements an existing aggregation strategy that heuristically determines which functions to include for initial overview measurements. Moreover, it implements a newly developed heuristic to incorporate profile information and expand instrumentation in hot-spot regions only. The approach is evaluated on different benchmarks, including the SU 2 multi-physics solver package. PIRA is able to generate instrumentation configurations that contain the application’s hot-spot, but generate significantly less overhead when compared to the Score-P reference measurement.

...read moreread less

Journal Article•10.1016/J.PARCO.2017.11.002•

HPC formulations of optimization algorithms for tensor completion

[...]

Shaden Smith¹, Jongsoo Park², George Karypis¹•Institutions (2)

University of Minnesota¹, Facebook²

1 May 2018

TL;DR: This work study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++) explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication.

...read moreread less

Abstract: Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.10.002•

A Local Parallel Communication Algorithm for Polydisperse Rigid Body Dynamics

[...]

Sebastian Eibl¹, Ulrich Rüde¹•Institutions (1)

University of Erlangen-Nuremberg¹

8 Feb 2018

TL;DR: In this paper, the authors propose a new communication and synchronization algorithm that can handle the parallelization without size restrictions on the particles, and demonstrate the scalability of the previous and new communication algorithms up to almost two million parallel processes and for handling ten billion (1010) geometrically resolved particles on a state-of-the-art petascale supercomputer.

...read moreread less

Abstract: The simulation of large ensembles of particles is usually parallelized by partitioning the domain spatially and using message passing to communicate between the processes handling neighboring subdomains. The particles are represented as individual geometric objects and are assigned to the subdomains. Handling collisions and migrating particles between subdomains, as required for proper parallel execution, requires a complex communication protocol. Typically, the parallelization is restricted to handling only particles that are smaller than a subdomain. In many applications, however, particle sizes may vary drastically with some of them being larger than a subdomain. In this article we propose a new communication and synchronization algorithm that can handle the parallelization without size restrictions on the particles. Despite the additional complexity and extended functionality, the new algorithm introduces only minimal overhead. We demonstrate the scalability of the previous and the new communication algorithms up to almost two million parallel processes and for handling ten billion (1010) geometrically resolved particles on a state-of-the-art petascale supercomputer. Different scenarios are presented to analyze the performance of the new algorithm and to demonstrate its capability to simulate polydisperse scenarios, where large individual particles can extend across several subdomains.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.02.002•

Computation of the 100 quadrillionth hexadecimal digit of π on a cluster of Intel Xeon Phi processors

[...]

Daisuke Takahashi¹•Institutions (1)

University of Tsukuba¹

1 Jul 2018

TL;DR: The 100 quadrillionth hexadecimal digit of π was computed on a 512-node cluster of Intel Xeon Phi processors with an elapsed time of 641 h 29 min that includes the time required for verification.

...read moreread less

Abstract: This paper presents the computation of a specific hexadecimal digit of π by using a Bailey–Borwein–Plouffe (BBP)-type formula on a cluster of Intel Xeon Phi processors. The BBP-type formula can be computed using modular exponentiation. We use Montgomery multiplication for the modular multiplication, which is the most time-consuming part of the modular exponentiation. We vectorize multiple modular exponentiations and multiple integer divisions by using Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions. A parallel implementation of the BBP-type formula is presented. The 100 quadrillionth hexadecimal digit of π was computed on a 512-node cluster of Intel Xeon Phi processors with an elapsed time of 641 h 29 min that includes the time required for verification.

...read moreread less

Journal Article•10.1016/J.PARCO.2017.11.009•

PSelInv – A distributed memory parallel algorithm for selected inversion: The non-symmetric case

[...]

Mathias Jacquelin¹, Lin Lin¹, Lin Lin², Chao Yang¹•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

1 May 2018

TL;DR: In this paper, the authors generalize the parallel selected inversion algorithm called PSelInv to sparse non-symmetric matrices, where the selection is confined by the sparsity pattern of the matrix AT.

...read moreread less

Abstract: This paper generalizes the parallel selected inversion algorithm called PSelInv to sparse non-symmetric matrices. We assume a general sparse matrix A has been decomposed as P A Q = L U on a distributed memory parallel machine, where L, U are lower and upper triangular matrices, and P, Q are permutation matrices, respectively. The PSelInv method computes selected elements of A − 1 . The selection is confined by the sparsity pattern of the matrix AT. Our algorithm does not assume any symmetry properties of A, and our parallel implementation is memory efficient, in the sense that the computed elements of A − T overwrites the sparse matrix L + U in situ. PSelInv involves a large number of collective data communication activities within different processor groups of various sizes. In order to minimize idle time and improve load balancing, tree-based asynchronous communication is used to coordinate all such collective communication. Numerical results demonstrate that PSelInv can scale efficiently to 6,400 cores for a variety of matrices.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.06.005•

Hybrid parallelization of a multi-tree path search algorithm: Application to highly-flexible biomolecules

[...]

Alejandro Estaña, Kevin Molloy, Marc Vaisset, Nathalie Sibille, Thierry Simeon, Pau Bernadó, Juan Cortés - Show less +3 more

1 Sep 2018

TL;DR: A hybrid strategy for the efficient parallelization of a multi-tree variant of TRRT, called Multi-TRRT, enabling an efficient execution in (possibly large) computer clusters and showing a near-linear speedup for a wide range of cluster configurations.

...read moreread less

Abstract: The study of the conformational energy landscape of a molecule is essential for the understanding of its physicochemical properties. This requires the exploration of a continuous, high-dimensional space to identify the most probable conformations and the transition paths between them. The problem is computationally difficult, in particular for highly-flexible biomolecules such as Intrinsically Disordered Proteins (IDPs). In recent years, a robotics-inspired algorithm called Transition-based Rapidly-exploring Random Tree (TRRT) has been proposed to solve this problem, and has been shown to provide good results with small and middle-sized biomolecules. Aiming to treat larger systems, we propose a hybrid strategy for the efficient parallelization of a multi-tree variant of TRRT, called Multi-TRRT, enabling an efficient execution in (possibly large) computer clusters. The parallel algorithm uses OpenMP multi-threading for computation inside each multi-core processor and MPI to perform the communication between processors. Results show a near-linear speedup for a wide range of cluster configurations. Although the paper mainly deals with the application of the proposed parallel algorithm to the investigation of biomolecules, the explanations concerning the methods are general, aiming to inspire future work on the parallelization of related algorithms.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.05.005•

Characterizing MPI matching via trace-based simulation

[...]

Kurt B. Ferreira¹, Scott Levy¹, Kevin Pedretti¹, Ryan E. Grant¹•Institutions (1)

Sandia National Laboratories¹

1 Sep 2018

TL;DR: This paper introduces a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution and presents data about search lengths and the duration that messages spend in the queues waiting to be matched.

...read moreread less

Abstract: With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application’s execution, including its matching performance, and is highly dependent on the MPI library’s matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads and microbenchmarks, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.07.003•

The OpenACC data model: Preliminary study on its major challenges and implementations

[...]

Michael Wolfe¹, Seyong Lee², Jung-Won Kim², Xiaonan Tian¹, Rengan Xu, Barbara Chapman³, Sunita Chandrasekaran⁴ - Show less +3 more•Institutions (4)

Nvidia¹, Oak Ridge National Laboratory², Stony Brook University³, University of Delaware⁴

1 Oct 2018

TL;DR: Details and issues in implementing the OpenACC data model in three different OpenACC compilers, including managing page tables, asynchronous data transfers, asynchronous memory allocate and free, host data construct, aliasing on a data directive, reusing device memory, partially present data, and adjacent data are described.

...read moreread less

Abstract: This paper describes how the OpenACC data model is implemented in current OpenACC compilers, ranging from research compilers (OpenUH and OpenARC) to a commercial compiler (the PGI OpenACC compiler). First, we summarize various memory architectures in today’s accelerator systems. We then describe details and issues in implementing the OpenACC data model in three different OpenACC compilers. This includes managing page tables, asynchronous data transfers, asynchronous memory allocate and free, host data construct, aliasing on a data directive, reusing device memory, partially present data, and adjacent data. We also discusses ongoing work to manage large, complex dynamic data structures. We measured the present table lookups, device memory allocation, pinned memory allocation, and managed memory in the three OpenACC compilers using eight OpenACC applications (seven from the SPEC ACCEL benchmark suite and a shock-hydrodynamics mini-application called LULESH).

...read moreread less

Journal Article•10.1016/J.PARCO.2018.06.004•

Petascale scramjet combustion simulation on the Tianhe-2 heterogeneous supercomputer

[...]

Yonggang Che¹, Meifang Yang¹, Chuanfu Xu¹, Yutong Lu¹•Institutions (1)

National University of Defense Technology¹

1 Sep 2018

TL;DR: Porting a real-world supersonic combustion simulation application to the heterogeneous architecture consists of multi-core CPUs and Intel Many Integrated Core (MIC) coprocessors and Scalable OpenMP parallelization is added to make use of the large number of cores on CPUs and MICCoprocessor.

...read moreread less

Abstract: Combustion simulation is complex and computationally expensive as it involves integration of fundamental chemical kinetics and multidimensional Computational Fluid Dynamics (CFD) models. This paper presents our efforts porting a real-world supersonic combustion simulation application to the heterogeneous architecture consists of multi-core CPUs and Intel Many Integrated Core (MIC) coprocessors. Scalable OpenMP parallelization is added to make use of the large number of cores on CPUs and MIC coprocessors. Single thread performance optimizations are addressed to improve the computational efficiency. CPU and MIC collaborative algorithm, along with a series of techniques to improve the data transfer efficiency and load balance, are applied. Performance evaluation is performed on the Tianhe-2 supercomputer. The results show that on a single node, the optimized CPU-only version is 8.33 times faster than the baseline version, and the CPU + MIC heterogeneous version is again 3.07 times faster than the optimized CPU-only version. The resulting codes effectively scale to 5120 nodes (998,400 cores) on a mesh with 27.46 Giga cells. Given that the total number of floating-point operations is reduced by about 10 times after our optimizations, the heterogeneous version still achieves a sustained double precision floating-point performance of 0.46 Pflops on 5120 nodes. This demonstrates Petascale heterogeneous computing capabilities for real-world supersonic combustion problems.

...read moreread less

Proceedings Article•10.1145/3281070.3281072•

PyGA: a Python to FPGA compiler prototype

[...]

Yohann Uguen¹, Eric Petit²•Institutions (2)

University of Lyon¹, Intel²

6 Nov 2018

TL;DR: The study shows that, to avoid overhead that cannot be compensated otherwise, tightly coupled accelerator design such as Intel Xeon+FPGA are necessary to address larger code base with finer grain kernel and shows that without FPGA-specific programming effort, HLS compilation and runtime efforts remain to be done to be competitive with modern multi-core CPUs.

...read moreread less

Abstract: Field Programmable Gate Arrays, FPGAs, are a widely available configurable hardware design that is commonly used in many domain-specific applications. However, the complexity of its programming interface is currently restricting its usage to highly qualified programmers dedicated to FPGAs. In order to democratize FPGAs, many efforts are concentrating on High-Level Synthesis, HLS: the process of compiling a high-level language to hardware. In that context we propose PyGA, a proof of concept of a Python to FPGA compiler based on the Numba Just-In-Time (JIT) compiler for Python and the Intel FPGA SDK for OpenCL. It allows any Python user to use a FPGA card as an accelerator for Python seamlessly. As expected, early performance results are encouraging, but not competitive with compiled CPU version. The study shows that, to avoid overhead that cannot be compensated otherwise, tightly coupled accelerator design such as Intel Xeon+FPGA are necessary to address larger code base with finer grain kernel. It also shows that without FPGA-specific programming effort, HLS compilation and runtime efforts remain to be done to be competitive with modern multi-core CPUs.

...read moreread less

Proceedings Article•10.3233/978-1-61499-843-3-642•

Implementation of the K-Means Algorithm on Heterogeneous Devices: A Use Case Based on an Industrial Dataset

[...]

Ying hao Xu, Miquel Vidal¹, Beñat Arejita, Javier Megchun Díaz, Carlos Alvarez², Daniel Jiménez-González¹, Xavier Martorell², Filippo Mantovani¹ - Show less +4 more•Institutions (2)

Barcelona Supercomputing Center¹, Polytechnic University of Catalonia²

1 Jan 2018

TL;DR: This paper presents and analyzes a heterogeneous implementation of an industrial use case based on K-means that targets symmetric multiprocessing (SMP), GPUs and FPGAs and shows that uses less energy when considering only the active power of the execution.

...read moreread less

Abstract: This work is partially supported by the European Union H2020 project AXIOM (grant agreement n. 645496), HiPEAC (grant agreement n. 687698), and Mont-Blanc (grant agreements n. 288777, 610402 and 671697), the Spanish Government Programa Severo Ochoa (SEV-2015-0493), the Spanish Ministry of Science and Technology (TIN2015- 65316-P) and the Departament d’Innovacio, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programaci´o i Entorns d’Execucio Paral·lels (2014-SGR-1051).

...read moreread less

Journal Article•10.1016/J.PARCO.2018.03.006•

Machine Learning in Multi-Agent Systems using Associative Arrays

[...]

Przemysław Spychalski¹, Ryszard Arendt¹•Institutions (1)

Gdańsk University of Technology¹

1 Jul 2018

TL;DR: A new machine learning algorithm for multi-agent systems is introduced, which becomes less complex and more efficient substitute of artificial neural networks and Bayesian networks, which is confirmed by performance measurements.

...read moreread less

Abstract: In this paper, a new machine learning algorithm for multi-agent systems is introduced. The algorithm is based on associative arrays, thus it becomes less complex and more efficient substitute of artificial neural networks and Bayesian networks, which is confirmed by performance measurements. Implementation of machine learning algorithm in multi-agent system for aided design of selected control systems allowed to improve the performance by reducing time of processing requests, that were previously acknowledged and stored in learning module. This article contains an insight into different machine learning algorithms and includes the classification of learning techniques regarding the criteria depicted by multi-agent systems. The publication is also an attempt to provide the answer for a question posted by Shoham, Powers and Grenager: “If multi-agent learning is the answer, what is the question?”

...read moreread less

Proceedings Article•10.3233/978-1-61499-843-3-605•

Scientific Workflows on Clouds with Heterogeneous and Preemptible Instances

[...]

Fabio Tordini¹, Marco Aldinucci¹, Paolo Viviani¹, Ivan Merelli², Pietro Liò³ - Show less +1 more•Institutions (3)

University of Turin¹, National Research Council², University of Cambridge³

1 Jan 2018

TL;DR: The design of a framework (based on Mesos) aimed at achieving a cost-effective and efficient usage of heterogeneous Processing Elements (PEs) for workflow execution, which supports hybrid cloud bursting over preemptible cloud Virtual Machines is discussed.

...read moreread less

Abstract: The cloud environment is increasingly appealing for the HPC community, which has always dealt with scientific applications. However, there is still some skepticism about moving from traditional physical infrastructures to virtual HPC clusters. This mistrusting probably originates from some well known factors, including the effective economy of using cloud services, data and software availability, and the longstanding matter of data stewardship. In this work we discuss the design of a framework (based on Mesos) aimed at achieving a cost-effective and efficient usage of heterogeneous Processing Elements (PEs) for workflow execution, which supports hybrid cloud bursting over preemptible cloud Virtual Machines.

...read moreread less

Journal Article•10.1016/J.PARCO.2018.05.006•

Hardware topology management in MPI applications through hierarchical communicators

[...]

Brice Goglin, Emmanuel Jeannot, Farouk Mansouri, Guillaume Mercier

1 Aug 2018

TL;DR: Several additions to the MPI standard for building new MPI communicators corresponding to hardware hierarchy levels are detailed, providing the user with tools to address hardware topology and locality issues while improving application performance.

...read moreread less

Abstract: The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90s it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. On the other hand, providing abstract mechanisms to manipulate the hardware topology is also fundamental. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard for building new MPI communicators corresponding to hardware hierarchy levels. It provides the user with tools to address hardware topology and locality issues while improving application performance.

...read moreread less

Journal Article•10.1145/3291529•

Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees

[...]

Antonio Fernández Anta¹, Dariusz R. Kowalski², Miguel A. Mosteiro³, Prudence W. H. Wong²•Institutions (3)

IMDEA¹, University of Liverpool², Pace University³

8 Dec 2018

TL;DR: This work studies the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals.

...read moreread less

Abstract: We study a dynamic resource-allocation problem that arises in various parallel computing scenarios, such as mobile cloud computing, cloud computing systems, Internet of Things systems, and others. Generically, we model the architecture as client mobile devices and static base stations. Each client “arrives” to the system to upload data to base stations by radio transmissions and then “leaves.” The problem, called Station Assignment, is to assign clients to stations so that every client uploads their data under some restrictions, including a target subset of stations, a maximum delay between transmissions, a volume of data to upload, and a maximum bandwidth for each station. We study the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals. We show upper and lower bounds on the rate and burstiness for various client arrival schedules and protocol classes. To the best of our knowledge, this is the first time that Station Assignment is studied under adversarial arrivals and departures.

...read moreread less