Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2016
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2016
Showing papers presented at "Parallel Computing in 2016"
Proceedings Article•10.5281/ZENODO.49285•
petsc: Portable, Extensible Toolkit for Scientific Computation

[...]

Barry Smith1, stefanozampini1, tisaac, SurtaiHan, Satish Balay1, Karl Rupp1, Victor Minden2, sarich, vijaysm, Hong Zhang1, Peter R. Brune1, Jed Brown3, VictorEijkhout, Lisandro Dalcin4, markadams, Matthew G. Knepley5, Dmitry Karpeyev1, Lois Curfman McInnes1, Fande Kong •
Argonne National Laboratory1, Stanford University2, University of Colorado Boulder3, King Abdullah University of Science and Technology4, Rice University5
7 Apr 2016

78 citations

Journal Article•10.1016/J.PARCO.2016.05.012•
GPU-accelerated Hungarian algorithms for the Linear Assignment Problem

[...]

Ketan Date1, Rakesh Nagi1•
University of Illinois at Urbana–Champaign1
1 Sep 2016
TL;DR: An efficient parallelization of the augmenting path search phase of the Hungarian algorithm is described, which reveals that the GPU-accelerated versions are extremely efficient in solving large problems, as compared to their CPU counterparts.
Abstract: Linear Assignment is one of the most fundamental problems in operations research.A creative parallelization of a Hungarian-like algorithm on GPU cluster.Efficient parallelization of the augmenting path search step.Large problems with 1.6 billion variables can be solved.It is probably the fastest LAP solver using a GPU. In this paper, we describe parallel versions of two different variants (classical and alternating tree) of the Hungarian algorithm for solving the Linear Assignment Problem (LAP). We have chosen Compute Unified Device Architecture (CUDA) enabled NVIDIA Graphics Processing Units (GPU) as the parallel programming architecture because of its ability to perform intense computations on arrays and matrices. The main contribution of this paper is an efficient parallelization of the augmenting path search phase of the Hungarian algorithm. Computational experiments on problems with up to 25 million variables reveal that the GPU-accelerated versions are extremely efficient in solving large problems, as compared to their CPU counterparts. Tremendous parallel speedups are achieved for problems with up to 400 million variables, which are solved within 13 seconds on average. We also tested multi-GPU versions of the two variants on up to 16 GPUs, which show decent scaling behavior for problems with up to 1.6 billion variables and dense cost matrix structure.

67 citations

Journal Article•10.1016/J.PARCO.2016.08.005•
Massively parallel lattice–Boltzmann codes on large GPU clusters

[...]

Enrico Calore1, Alessandro Gabbana1, Jiri Kraus2, Elisa Pellegrini1, Sebastiano Fabio Schifano1, Raffaele Tripiccione1 •
University of Ferrara1, Nvidia2
1 Oct 2016
TL;DR: In this paper, the authors describe a massively parallel code for a state-of-the-art thermal lattice-Boltzmann method for large-scale studies of convective turbulence.
Abstract: This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics.

64 citations

Journal Article•10.1145/3015144•
Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication

[...]

Grey Ballard1, Alex Druinsky2, Nicholas Knight3, Oded Schwartz4•
Sandia National Laboratories1, Lawrence Berkeley National Laboratory2, New York University3, Hebrew University of Jerusalem4
26 Dec 2016
TL;DR: In this article, a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM) is proposed, which correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation.
Abstract: We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is nonzero structure dependent, meaning that we seek the best algorithm for the given input matrices.In addition to our three-dimensional fine-grained model, we also propose coarse-grained one-dimensional and two-dimensional models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms.Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space.

59 citations

Journal Article•10.1016/J.PARCO.2015.10.005•
Cinema image-based in situ analysis and visualization of MPAS-ocean simulations

[...]

Patrick O'Leary1, James Ahrens2, Sebastien Jourdain1, Scott Wittenburg1, David Rogers2, Mark R. Petersen2 •
Kitware1, Los Alamos National Laboratory2
1 Jul 2016
TL;DR: This paper created an in situ exploration visualization of an MPAS-Ocean simulation leveraging a third option based on Cinema, which is a novel framework for highly interactive, image-based in situ analysis and visualization that promotes exploration.
Abstract: We created an in situ exploration visualization of an MPAS-Ocean simulation.We leveraged compositing in Cinema to provide interactive exploration.We decreased the storage footprint of the analysis and visualization results. Due to power and I/O constraints associated with extreme scale scientific simulations, in situ analysis and visualization will become a critical component to scientific exploration and discovery. Current analysis and visualization options at extreme scale are presented in opposition: write files to disk for interactive, exploratory analysis, or perform in situ analysis to save data products about phenomena that a scientists knows about in advance. In this paper, we demonstrate extreme scale visualization of MPAS-Ocean simulations leveraging a third option based on Cinema, which is a novel framework for highly interactive, image-based in situ analysis and visualization that promotes exploration.

51 citations

Journal Article•10.1145/2894746•
X10 and APGAS at Petascale

[...]

Olivier Tardieu1, Benjamin Herta1, David Cunningham1, David Grove1, Prabhanjan Kambadur1, Vijay Saraswat1, Avraham Shinnar1, Mikio Takeuchi1, Mandana Vaziri1, Wei Zhang1 •
IBM1
15 Mar 2016
TL;DR: It is demonstrated that X10 delivers solid performance at petascale by running (weak scaling) eight application kernels on an IBM Power--775 supercomputer utilizing up to 55,680 Power7 cores (for 1.7Pflop/s of theoretical peak performance).
Abstract: X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same fine-grained concurrency mechanisms within and across shared-memory nodes.We demonstrate that X10 delivers solid performance at petascale by running (weak scaling) eight application kernels on an IBM Power--775 supercomputer utilizing up to 55,680 Power7 cores (for 1.7Pflop/s of theoretical peak performance). For the four HPC Class 2 Challenge benchmarks, X10 achieves 41p to 87p of the system’s potential at scale (as measured by IBM’s HPCC Class 1 optimized runs). We also implement K-Means, Smith-Waterman, Betweenness Centrality, and Unbalanced Tree Search (UTS) for geometric trees. Our UTS implementation is the first to scale to petaflop systems.We describe the advances in distributed termination detection, distributed load balancing, and use of high-performance interconnects that enable X10 to scale out to tens of thousands of cores. We discuss how this work is driving the evolution of the X10 language, core class libraries, and runtime systems.

47 citations

Journal Article•10.1016/J.PARCO.2015.10.015•
Atomic Detail Visualization of Photosynthetic Membranes with GPU-Accelerated Ray Tracing

[...]

John E. Stone1, Melih Sener1, Kirby L. Vandivort1, Angela M. Barragan1, Abhishek Singharoy1, Ivan Teo1, João V. Ribeiro1, Barry Isralewitz1, Bo Liu1, Boon Chong Goh1, James C. Phillips1, Craig MacGregor-Chatwin2, Matthew P. Johnson2, Lena F. Kourkoutis3, C. Neil Hunter2, Klaus Schulten1 •
University of Illinois at Urbana–Champaign1, University of Sheffield2, Cornell University3
1 Jul 2016
TL;DR: The techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies are described, and cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers are highlighted.
Abstract: The cellular process responsible for providing energy for most life on Earth, namely, photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. We present, in two accompanying movies, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movies are the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. We describe the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers.

45 citations

Journal Article•10.1016/J.PARCO.2016.06.004•
Accelerating sparse Cholesky factorization on GPUs

[...]

Steven C. Rennich1, Darko Stosic1, Timothy A. Davis2•
Nvidia1, Texas A&M University2
1 Nov 2016
TL;DR: A left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices and shows 2x speedup vs. best alternative CPU-only perf.
Abstract: Sparse direct factorization is a critical component of scientific computingGPU acceleration of it is difficult due to algorithmic irregularity and PCIe perf.We present an algorithm for GPU acceleration which resolves these issues.The algorithm shows 2x speedup vs. best alternative CPU-only perf.Aspects of this Cholesky case should be useful in other algorithms (LDL, LU, ... ). Sparse factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. For factorizations which involve sufficient dense math, the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost. However, for many other cases, the prevalence of small/irregular dense math and the relatively slow communication between the host and device over the PCIe bus, make it challenging to significantly accelerate sparse factorization using the GPU.In this paper we describe a left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream subtrees of the elimination tree through the GPU and perform the factorization of each subtree entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these subtrees, many independent, small, dense operations are batched to minimize kernel launch overhead and many of these batched kernels are executed concurrently to maximize device utilization.Performance results for commonly studied matrices are presented along with suggested actions for further optimization.

44 citations

Journal Article•10.1016/J.PARCO.2016.04.002•
The landscape of GPGPU performance modeling tools

[...]

Souley Madougou1, Ana Lucia Varbanescu1, Cees de Laat1, Rob V. van Nieuwpoort•
University of Amsterdam1
1 Aug 2016
TL;DR: The landscape of modern GPUs' performance limiters and optimization opportunities are sketched, and the specific features of the relevant contributions in this field are highlighted, along with the optimization and design spaces they explore.
Abstract: Sketch and taxonomies of current performance modeling landscape for GPGPU.A thorough description of 10 different approaches to GPU performance modeling.Empirical evaluation of models' performance using three kernels and four GPUs.Discussion of the strengths and weaknesses of the studied model classes. GPUs are gaining fast adoption as high-performance computing architectures, mainly because of their impressive peak performance. Yet most applications only achieve small fractions of this performance. While both programmers and architects have clear opinions about the causes of this performance gap, finding and quantifying the real problems remains a topic for performance modeling tools. In this paper, we sketch the landscape of modern GPUs' performance limiters and optimization opportunities, and dive into details on modeling attempts for GPU-based systems. We highlight the specific features of the relevant contributions in this field, along with the optimization and design spaces they explore. We further use typical kernel examples with various computation and memory access patterns to assess the efficacy and usability of a set of promising approaches. We conclude that the available GPU performance modeling solutions are very sensitive to applications and platform changes, and require significant efforts for tuning and calibration when new analyses are required.

41 citations

Journal Article•10.1016/J.PARCO.2016.10.001•
Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems

[...]

Seher Acer1, Oguz Selvitopi1, Cevdet Aykanat1•
Bilkent University1
1 Nov 2016
TL;DR: A comprehensive and generic framework to minimize multiple and different volume-based communication cost metrics for sparse matrix dense matrix multiplication (SpMM) and can simultaneously optimize different cost metrics besides total volume in a single partitioning phase.
Abstract: A generic model to scale sparse matrix dense matrix multiplication (SpMM).SpMM suffers from high communication volume overhead.Different volume-based metrics such as maximum volume besides total volume.Simultaneous minimization of volume-based communication cost metrics.Portable models based on graph and hypergraph partitioning. We propose a comprehensive and generic framework to minimize multiple and different volume-based communication cost metrics for sparse matrix dense matrix multiplication (SpMM). SpMM is an important kernel that finds application in computational linear algebra and big data analytics. On distributed memory systems, this kernel is usually characterized with its high communication volume requirements. Our approach targets irregularly sparse matrices and is based on both graph and hypergraph partitioning models that rely on the widely adopted recursive bipartitioning paradigm. The proposed models are lightweight, portable (can be realized using any graph and hypergraph partitioning tool) and can simultaneously optimize different cost metrics besides total volume, such as maximum send/receive volume, maximum sum of send and receive volumes, etc., in a single partitioning phase. They allow one to define and optimize as many custom volume-based metrics as desired through a flexible formulation. The experiments on a wide range of about thousand matrices show that the proposed models drastically reduce the maximum communication volume compared to the standard partitioning models that only address the minimization of total volume. The improvements obtained on volume-based partition quality metrics using our models are validated with parallel SpMM as well as parallel multi-source BFS experiments on two large-scale systems. For parallel SpMM, compared to the standard partitioning models, our graph and hypergraph partitioning models respectively achieve reductions of 14% and 22% in runtime, on average. Compared to the state-of-the-art partitioner UMPa, our graph model is overall 14.5 faster and achieves an average improvement of 19% in the partition quality on instances that are bounded by maximum volume. For parallel BFS, we show on graphs with more than a billion edges that the scalability can significantly be improved with our models compared to a recently proposed two-dimensional partitioning model.

38 citations

Journal Article•10.1145/2858656•
MASA: A Multiplatform Architecture for Sequence Aligners with Block Pruning

[...]

Edans Flavius de Oliveira Sandes1, Guillermo Miranda2, Xavier Martorell3, Eduard Ayguadé3, George Teodoro1, Alba Cristina Magalhaes Alves de Melo1 •
University of Brasília1, Barcelona Supercomputing Center2, Polytechnic University of Catalonia3
24 Feb 2016
TL;DR: This article proposes and evaluates MASA, a flexible and customizable software architecture that enables the execution of biological sequence alignment applications with three variants (local, global, and semiglobal) in multiple hardware/software platforms with block pruning, which is able to reduce significantly the amount of data processed.
Abstract: Biological sequence alignment is a very popular application in Bioinformatics, used routinely worldwide. Many implementations of biological sequence alignment algorithms have been proposed for multicores, GPUs, FPGAs and CellBEs. These implementations are platform-specific; porting them to other systems requires considerable programming effort. This article proposes and evaluates MASA, a flexible and customizable software architecture that enables the execution of biological sequence alignment applications with three variants (local, global, and semiglobal) in multiple hardware/software platforms with block pruning, which is able to reduce significantly the amount of data processed. To attain our flexibility goals, we also propose a generic version of block pruning and developed multiple parallelization strategies as building blocks, including a new asynchronous dataflow-based parallelization, which may be combined to implement efficient aligners in different platforms. We provide four MASA aligner implementations for multicores (OmpSs and OpenMP), GPU (CUDA), and Intel Phi (OpenMP), showing that MASA is very flexible. The evaluation of our generic block pruning strategy shows that it significantly outperforms the previously proposed block pruning, being able to prune up to 66.5p of the cells when using the new dataflow-based parallelization strategy.
Journal Article•10.1016/J.PARCO.2016.01.008•
A GPU-based Branch-and-Bound algorithm using IntegerVectorMatrix data structure

[...]

Jan Gmys1, Mohand-Said Mezmaz1, Nouredine Melab2, Daniel Tuyttens1•
University of Mons1, French Institute for Research in Computer Science and Automation2
1 Nov 2016
TL;DR: This paper revisits the IVM-based B&B algorithm on the GPU, addressing the irregularity of the algorithm in terms of workload, memory access patterns and control flow, and focuses on reducing thread divergence by making a judicious choice for the mapping of threads onto the data.
Abstract: First Branch-and-Bound (B&B) algorithm entirely deployed on GPU.B&B-tree management with IntegerVectorMatrix data structure instead of linked-list.3.3 times faster than conventional GPU-accelerated B&B based on a linked-list.Branch divergence reduction for B&B on GPU.Work stealing for parallel B&B on GPU. Branch-and-Bound (B&B) algorithms are tree-based exploratory methods for solving combinatorial optimization problems exactly to optimality. These problems are often large in size and known to be NP-hard to solve. The construction and exploration of the B&B-tree are performed using four operators: branching, bounding, selection and pruning. Such algorithms are irregular which makes their parallel design and implementation on GPU challenging. Existing GPU-accelerated B&B algorithms perform only a part of the algorithm on the GPU and rely on the transfer of pools of subproblems across the PCI Express bus to the device. To the best of our knowledge, the algorithm presented in this paper is the first GPU-based B&B algorithm that performs all four operators on the device and subsequently avoids the data transfer bottleneck between CPU and GPU. The implementation on GPU is based on the IntegerVectorMatrix (IVM) data structure which is used instead of a conventional linked-list to store and manage the pool of subproblems. This paper revisits the IVM-based B&B algorithm on the GPU, addressing the irregularity of the algorithm in terms of workload, memory access patterns and control flow. In particular, the focus is put on reducing thread divergence by making a judicious choice for the mapping of threads onto the data. Compared to a GPU-accelerated B&B based on a linked-list, the algorithm presented in this paper solves a set of standard flowshop instances on an average 3.3 times faster.
Journal Article•10.1016/J.PARCO.2016.01.011•
Seismic wave propagation simulations on low-power and performance-centric manycores

[...]

Márcio Castro1, Emilio Francesquini2, Fabrice Dupros, Hideo Aochi, Philippe O. A. Navaux3, Jean-François Méhaut4 •
Universidade Federal de Santa Catarina1, State University of Campinas2, Universidade Federal do Rio Grande do Sul3, University of Grenoble4
1 May 2016
TL;DR: The experimental results show that MPPA-256 has the best energy e consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of the solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs.
Abstract: The large processing requirements of seismic wave propagation simulations make High Performance Computing (HPC) architectures a natural choice for their execution. However, to keep both the current pace of performance improvements and the power consumption under a strict power budget, HPC systems must be more energy e than ever. As a response to this need, energy-e and low-power processors began to make their way into the market. In this paper we employ a novel low-power processor, the MPPA-256 manycore, to perform seismic wave propagation simulations. It has 256 cores connected by a NoC, no cache-coherence and only a limited amount of on-chip memory. We describe how its particular architectural characteristics influenced our solution for an energy-e implementation. As a counterpoint to the low-power MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore. Although both processors share some architectural similarities, the challenges to implement an e seismic wave propagation kernel on these platforms are very di↵erent. In this work we compare the performance and energy e of our implementations for these processors to proven and optimized solutions for other hardware platforms such as general-purpose processors and a GPU. Our experimental results show that MPPA-256 has the best energy e consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of our solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs.
Journal Article•10.1145/2938426•
Locality-Based Network Creation Games

[...]

Davide Bilò1, Luciano Gualà2, Stefano Leucci3, Guido Proietti3•
University of Sassari1, University of Rome Tor Vergata2, University of L'Aquila3
18 Jul 2016
TL;DR: In this article, the authors consider a more compelling scenario in which players have only limited information about the network in which they are embedded, and explore the game-theoretic and computational implications of assuming that players have a complete knowledge of the network structure only up to a given radius k, which is one of the most qualified local-knowledge models used in distributed computing.
Abstract: Network creation games have been extensively studied, both by economists and computer scientists, due to their versatility in modeling individual-based community formation processes. These processes, in turn, are the theoretical counterpart of several economics, social, and computational applications on the Internet. In their several variants, these games model the tension of a player between the player’s two antagonistic goals: to be as close as possible to the other players and to activate a cheapest possible set of links. However, the generally adopted assumption is that players have a common and complete information about the ongoing network, which is quite unrealistic in practice. In this article, we consider a more compelling scenario in which players have only limited information about the network in whicy they are embedded. More precisely, we explore the game-theoretic and computational implications of assuming that players have a complete knowledge of the network structure only up to a given radius k, which is one of the most qualified local-knowledge models used in distributed computing. In this respect, we define a suitable equilibrium concept, and we provide a comprehensive set of upper and lower bounds to the price of anarchy for the entire range of values of k and for the two classic variants of the game, namely, those in which a player’s cost—besides the activation cost of the owned links—depends on the maximum/sum of all distances to the other nodes in the network, respectively. These bounds are assessed through an extensive set of experiments.
Journal Article•10.1145/2896850•
Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling

[...]

Tim Kaler1, William C. Hasenplaugh1, Tao B. Schardl1, Charles E. Leiserson1•
Massachusetts Institute of Technology1
18 Jul 2016
TL;DR: Prism-R is presented, a variation of Prism that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations, and is only marginally slower than Prism.
Abstract: A data-graph computation—popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi—is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex’s prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round.This article introduces Prism, a chromatic-scheduling algorithm for executing dynamic data-graph computations. Prism uses a vertex coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by Prism to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze Prism using work-span analysis. Let G = (V, E) be a degree-Δ graph colored with χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)= vQv + ∑v∈ Q deg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of Prism performs updates in Q using O(χ (lg ( Q/χ ) + lg Δ ) + lg P span and Θ(size(Q) + P) work.These theoretical guarantees are matched by good empirical performance. To isolate the effect of the scheduling algorithm on performance, we modified GraphLab to incorporate Prism and studied seven application benchmarks on a 12-core multicore machine. Prism executes the benchmarks 1.2 to 2.1 times faster than GraphLab’s nondeterministic lock-based scheduler while providing deterministic behavior.This article also presents Prism-R, a variation of Prism that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. Prism-R satisfies the same theoretical bounds as Prism, but its implementation is more involved, incorporating a multivector data structure to maintain a deterministically ordered set of vertices partitioned by color. Despite its additional complexity, Prism-R is only marginally slower than Prism. On the seven application benchmarks studied, Prism-R incurs a 7p geometric mean overhead relative to Prism.
Journal Article•10.1016/J.PARCO.2016.03.001•
A global perspective of atmospheric carbon dioxide concentrations

[...]

William M. Putman1, Lesley Ott1, Anton Darmenov1, Arlindo daSilva1•
Goddard Space Flight Center1
1 Jul 2016
TL;DR: A high-resolution (7 km) non-hydrostatic global mesoscale simulation using the Goddard Earth Observing System (GEOS-5) model is used to visualize the flow and fluxes of carbon dioxide throughout the year.
Abstract: Visualizing global transport of carbon dioxide in the atmosphere over one year.High-resolution global models produce fine details of the weather and constituents.Carbon dioxide is strongly affected by local emissions and large-scale weather.Global models and observational data improve carbon dioxide flux estimates. A high-resolution (7km) non-hydrostatic global mesoscale simulation using the Goddard Earth Observing System (GEOS-5) model is used to visualize the flow and fluxes of carbon dioxide throughout the year. Carbon dioxide (CO2) is the most important greenhouse gas affected by human activity. About half of the CO2 emitted from fossil fuel combustion remains in the atmosphere, contributing to rising temperatures, while the other half is absorbed by natural land and ocean carbon reservoirs. Despite the importance of CO2, many questions remain regarding the processes that control these fluxes and how they may change in response to a changing climate. This visualization shows how column CO2 mixing ratios are strongly affected by local emissions and large-scale weather systems. In order to fully understand carbon flux processes, observations and atmospheric models must work closely together to determine when and where observed CO2 came from. Together, the combination of high-resolution data and models will guide climate models towards more reliable predictions of future conditions.
Journal Article•10.1016/J.PARCO.2016.04.001•
Fast Matlab compatible sparse assembly on multicore computers

[...]

Stefan Engblom1, Dimitar Lukarski1•
Uppsala University1
1 Aug 2016
TL;DR: In this paper, a fast sparse assembly algorithm is proposed to create a compressed matrix from raw index data, which is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation, and moreover, their implementation can be parallelized to utilize the power of modern multicore computers.
Abstract: We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 × in speedup on a typical 6-core machine and 10 × on a dual-socket 16-core machine compared to the built-in serial implementation.
Journal Article•10.1145/2858652•
Leveraging Hardware Message Passing for Efficient Thread Synchronization

[...]

Darko Petrović1, Thomas Ropars1, André Schiper1•
École Polytechnique Fédérale de Lausanne1
29 Jan 2016
TL;DR: This article proposes mp-server and HybComb: the former is a straightforward adaptation of the server approach to hardware message passing, whereas the latter is a novel hybrid combining algorithm that largely outperforms their most efficient shared-memory-only counterparts.
Abstract: As the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent shared-memory processors, synchronization efficiency is ultimately limited by the performance of the underlying cache coherence protocol. This article studies how hardware support for message passing can improve synchronization performance. Considering the ubiquitous problem of mutual exclusion, we devise novel algorithms for (i) classic locking, where application threads obtain exclusive access to a shared resource prior to executing their critical sections (CSes), and (ii) delegation, where CSes are executed by special threads. For classic locking, our HybLock algorithm uses a mix of shared memory and hardware message passing, which introduces the idea of hybrid synchronization algorithms. For delegation, we propose mp-server and HybComb: the former is a straightforward adaptation of the server approach to hardware message passing, whereas the latter is a novel hybrid combining algorithm. Evaluation on Tilera's TILE-Gx processor shows that HybLock outperforms the best known classic locks. Furthermore, mp-server can execute contended CSes with unprecedented throughput, as stalls related to cache coherence are removed from the critical path. HybComb can achieve comparable performance while avoiding the need to dedicate server cores. Consequently, our queue and stack implementations, based on the new synchronization algorithms, largely outperform their most efficient shared-memory-only counterparts.
Journal Article•10.1145/2938378•
Competitively Scheduling Tasks with Intermediate Parallelizability

[...]

Sungjin Im1, Benjamin Moseley2, Kirk Pruhs3, Eric Torng4•
University of California, Merced1, Washington University in St. Louis2, University of Pittsburgh3, Michigan State University4
18 Jul 2016
TL;DR: A scheduling algorithm Intermediate-SRPT is introduced, and it is shown that it is O(logP)-competitive with respect to average flow time when scheduling jobs whose parallelizability is intermediate between being fully parallelizable and sequential.
Abstract: We introduce a scheduling algorithm Intermediate-SRPT, and show that it is O(logP)-competitive with respect to average flow time when scheduling jobs whose parallelizability is intermediate between being fully parallelizable and sequential. Here, the parameter P denotes the ratio between the maximum job size to the minimum. We also show a general matching lower bound on the competitive ratio. Our analysis builds on an interesting combination of potential function and local competitiveness arguments.
Journal Article•10.1016/J.PARCO.2015.11.001•
Bit-parallel approximate pattern matching

[...]

Tuan Tu Tran1, Yongchao Liu2, Bertil Schmidt1•
University of Mainz1, Georgia Institute of Technology2
1 May 2016
TL;DR: Mappings of the Wu-Manber algorithm onto two accelerator types are investigated, able to achieve around two orders-of-magnitude speedups compared to a single-threaded CPU implementation.
Abstract: Advanced SIMD features on GPUs and Xeon Phis promote efficient long pattern search.A tiled approach to accelerating the Wu-Manber algorithm on GPUs has been proposed.Both the GPU and Xeon Phi yield two orders-of-magnitude speedup over one CPU core.The GPU-based version with tiling runs up to 2.9 × faster than the Xeon Phi version. Approximate pattern matching (APM) targets to find the occurrences of a pattern inside a subject text allowing a limited number of errors. It has been widely used in many application areas such as bioinformatics and information retrieval. Bit-parallel APM takes advantage of the intrinsic parallelism of bitwise operations inside a machine word. This approach typically encodes non-deterministic finite automaton (NFA) states or value differences between adjacent cells of a dynamic programming matrix in the form of bit arrays. Wu-Manber (WM) is a well-known bit-parallel APM algorithm, which simulates an NFA and gains parallel efficiency by performing multiple state updates within a machine word. An important parameter is the machine word size (e.g. 32 or 64?bits for CPUs). Due to increasing vector capabilities, efficient mapping of bit-parallel APM algorithms onto modern high performance computing architectures is an interesting research topic. Prominent examples are Xeon Phi coprocessors and CUDA-enabled GPUs, which provide words of size 512?bits (by means of vector registers) and 1024?bits (by means of warps), respectively. In this paper, we investigate mappings of the WM algorithm onto these two accelerator types. Both architectures are able to achieve around two orders-of-magnitude speedups compared to a single-threaded CPU implementation. Moreover, our tile-based implementation on a GeForce Titan graphics card runs up to 2.9 × faster than our implementation on an Intel Xeon Phi 5110P. Source code is available at http://xbitpar.sourceforge.net.
Journal Article•10.1145/2948973•
Sixteen Heuristics for Joint Optimization of Performance, Energy, and Temperature in Allocating Tasks to Multi-Cores

[...]

Hafiz Fahad Sheikh1, Ishfaq Ahmad1•
University of Texas at Arlington1
2 Aug 2016
TL;DR: 16 heuristics that utilize various methods for task-to-core allocation and frequency selection are proposed and a methodical classification scheme is presented which not only categorizes the proposedHeuristics but can also accommodate additional heuristically.
Abstract: Three-way joint optimization of performance (P), energy (E), and temperature (T) in scheduling parallel tasks to multiple cores poses a challenge that is staggering in its computational complexity. The goal of the PET optimized scheduling (PETOS) problem is to minimize three quantities: the completion time of a task graph, the total energy consumption, and the peak temperature of the system. Algorithms based on conventional multi-objective optimization techniques can be designed for solving the PETOS problem. But their execution times are exceedingly high and hence their applicability is restricted merely to problems of modest size. Exacerbating the problem is the solution space that is typically a Pareto front since no single solution can be strictly best along all three objectives. Thus, not only is the absolute quality of the solutions important but “the spread of the solutions” along each objective and the distribution of solutions within the generated tradeoff front are also desired. A natural alternative is to design efficient heuristic algorithms that can generate good solutions as well as good spreads -- note that most of the prior work in energy-efficient task allocation is predominantly single- or dual-objective oriented. Given a directed acyclic graph (DAG) representing a parallel program, a heuristic encompasses policies as to what tasks should go to what cores and at what frequency should that core operate. Various policies, such as greedy, iterative, and probabilistic, can be employed. However, the choice and usage of these policies can influence a heuristic towards a particular objective and can also profoundly impact its performance. This article proposes 16 heuristics that utilize various methods for task-to-core allocation and frequency selection. This article also presents a methodical classification scheme which not only categorizes the proposed heuristics but can also accommodate additional heuristics. Extensive simulation experiments compare these algorithms while shedding light on their strengths and tradeoffs.
Journal Article•10.1016/J.PARCO.2016.06.005•
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

[...]

Emanuel H. Rubensson1, Elias Rudberg1•
Uppsala University1
1 Sep 2016
TL;DR: In this paper, a distributed quadtree matrix representation is proposed for block-sparse matrix-matrix multiplication on distributed memory clusters, where data locality is exploited without prior information about matrix sparsity pattern.
Abstract: We present a method for parallel block-sparse matrix-matrix multiplication.A distributed quadtree matrix representation allows exploitation of data locality.The quadtree structure is implemented using the Chunks and Tasks programming model.Data locality is exploited without prior information about matrix sparsity pattern.Constant communication per node on average is achieved in weak scaling tests. We present a method for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation, data locality is exploited without prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model Parallel Comput. 40, 328 (2014). The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments.Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case graphics processing units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node.The performance is evaluated for matrices with different sparsity structures, including examples from electronic structure calculations. Compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests.
Journal Article•10.1145/2938389•
Experimental Analysis of Space-Bounded Schedulers

[...]

Harsha Vardhan Simhadri1, Guy E. Blelloch2, Jeremy T. Fineman3, Phillip B. Gibbons4, Aapo Kyrola2 •
Lawrence Berkeley National Laboratory1, Carnegie Mellon University2, Georgetown University3, Intel4
28 Jun 2016
TL;DR: The results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing Schedulers by 25% to 65% for most of the benchmarks, but incur up to 27% additional scheduler and load-imbalance overhead.
Abstract: The running time of nested parallel programs on shared-memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processor cores on the machine. Recent work proposed “space-bounded schedulers” for scheduling such programs on the multilevel cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, which can result in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice.In this article, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a head-to-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel® Cilk™ Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon® 7560 comparing work stealing, hierarchy-minded work stealing, and two variants of space-bounded schedulers on both divide-and-conquer microbenchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25p to 65p for most of the benchmarks, but incur up to 27p additional scheduler and load-imbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25p improvement in running time for synthetic benchmarks and about 20p improvement for algorithmic kernels. We also quantify runtime improvements varying the available bandwidth per core (the “bandwidth gap”) and show up to 50p improvements in the running times of kernels as this gap increases fourfold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees) and explore implementation tradeoffs.
Journal Article•10.1145/2858650•
Well-Structured Futures and Cache Locality

[...]

Maurice Herlihy1, Zhiyu Liu1•
Brown University1
9 Feb 2016
TL;DR: If futures are used in a simple, disciplined way, then the situation is much better: if each future is touched only once, then parallel executions with work stealing can incur at most O(CPT2∞) additional cache misses—a substantial improvement.
Abstract: In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create futures. A thread creates a future to hold the results of a computation, which may or may not be executed in parallel. That result is returned when some thread touches that future, blocking if necessary until the result is ready.Recent research has shown that although futures can, of course, enhance parallelism in a structured way, they can have a deleterious effect on cache locality. In the worst case, futures can incur Ω(PT∞ + tT∞) deviations, which implies Ω (CPT∞ + CtT∞) additional cache misses, where C is the number of cache lines, P is the number of processors, t is the number of touches, and T∞ is the computation span. Since cache locality has a large impact on software performance on modern multicores, this result is troubling.In this article, we show that if futures are used in a simple, disciplined way, then the situation is much better: if each future is touched only once, either by the thread that created it or by a later descendant of the thread that created it, then parallel executions with work stealing can incur at most O(CPT2∞) additional cache misses—a substantial improvement. This structured use of futures is characteristic of many (but not all) parallel applications.
Journal Article•10.1016/J.PARCO.2016.01.001•
A secure and efficient file protecting system based on SHA3 and parallel AES

[...]

Xiongwei Fei1, Kenli Li1, Yang Wangdong1, Keqin Li2•
Hunan University1, State University of New York System2
1 Feb 2016
TL;DR: SEFPS based on advanced SHA3 and parallel AES can provide the protection of both confidentiality and integrity, and produce high performance by GPU parallelism or/and CPU parallelism, and can be used in computers no matter whether equipped with Nvidia GPUs or not.
Abstract: We propose a secure and efficient file protecting system (SEFPS) that uses currently excellent SHA3 and parallel AES.We prove the security of SEFPS in terms of confidentiality and integrity.We design and implement the three variants of SEFPS adopting GPU parallelism or/and CPU parallelism.We evaluate the SEFPS' performance on two representative platforms. There are many private or confidential files stored in computers or transferred on the Internet. People worry and even fear their security problems, such as stealing, breaking, forging, and so on, and urgently need a both secure and highly efficient (for better experience) file protecting system. Thus, we propose and implement a secure and efficient file protecting system (SEFPS) for this demand. SEFPS based on advanced SHA3 (Secure Hash Algorithm 3) and parallel AES (Advance Encryption Standard) can provide the protection of both confidentiality and integrity, and produce high performance by GPU (Graphics Processing Unit) parallelism or/and CPU (Central Processing Unit) parallelism. Correspondingly, SEFPS has three variants for GPU parallelism, CPU parallelism, and both of them. The first variant includes CPU Parallel Protecting (CPP) and CPU Parallel Unprotecting (CPUP). The second variant includes GPU Parallel Protecting (GPP) and GPU Parallel Unprotecting (GPUP). The third variant includes Hybrid Parallel Protecting (HPP) and Hybrid Parallel Unprotecting (HPUP). We design and implement them, and evaluate their performance on two representative platforms by some experiments. HPP and HPUP outperform GPP and GPUP, respectively, and outperform CPP and CPUP more. For those computers not equipped with Nvidia GPUs, CPP and CPUP can be employed, because they still outperform CSP and CSUP, respectively, where CSP and CSUP denote the serial implementation of SEFPS for protecting and for unprotecting, respectively. Moreover, we also prove the security of SEFPS in terms of integrity and confidentiality. Thus, SEFPS is a secure and efficient file protecting system, and can be used in computers no matter whether equipped with Nvidia GPUs or not.
Journal Article•10.1016/J.PARCO.2016.05.015•
Application power profiling on IBM Blue Gene/Q

[...]

Sean Wallace1, Zhou Zhou1, Venkatram Vishwanath2, Susan Coghlan2, John R. Tramm2, Zhiling Lan1, Michael E. Papka3 •
Illinois Institute of Technology1, Argonne National Laboratory2, Northern Illinois University3
1 Sep 2016
TL;DR: It's shown that MonEQ is lightweight, has extremely low overhead, is incredibly flexible, and has advanced features which end users can take advantage, and how seemingly simple changes in scale or network topology can have dramatic effects on power consumption is described.
Abstract: We describe our power profiling library called MonEQ, built on the IBM provided API.We integrate MonEQ into several benchmarks to show the data it produces.Applications have different power profiles based on usage of domains in node.There is no difference in power consumption in network for a given network topology.Scale can reduce power consumption but is far less important than execution time. The power consumption of state of the art supercomputers, because of their complexity and unpredictable workloads, is extremely difficult to estimate. Accurate and precise results, as are now possible with the latest generation of IBM Blue Gene/Q, are therefore a welcome addition to the landscape. Only recently have end users been afforded the ability to access the power consumption of their applications. However, just because it's possible for end users to obtain this data does not mean it's a trivial task. This emergence of new data is therefore not only understudied, but also not fully understood.In this paper, we describe our open source power profiling library called MonEQ, built on the IBM provided Environmental Monitoring (EMON) API. We show that it's lightweight, has extremely low overhead, is incredibly flexible, and has advanced features which end users can take advantage. We then integrate MonEQ into several benchmarks and show the data it produces and what analysis of this data can teach us. Going one step further we also describe how seemingly simple changes in scale or network topology can have dramatic effects on power consumption. To this end, previously well understood applications will now have new facets of potential analysis.
Journal Article•10.1016/J.PARCO.2015.12.004•
Exploiting task and data parallelism in ILUPACK's preconditioned CG solver on NUMA architectures and many-core accelerators

[...]

José Ignacio Aliaga1, Rosa M. Badia2, Maria Barreda1, Matthias Bollhöfer3, Ernesto Dufrechou4, Pablo Ezzatti4, Enrique S. Quintana-Ortí1 •
James I University1, Spanish National Research Council2, Braunschweig University of Technology3, University of the Republic4
1 May 2016
TL;DR: This work presents specialized implementations of the preconditioned iterative linear system solver in ILUPACK for Non-Uniform Memory Access (NUMA) platforms and many-core hardware co-processors based on the Intel Xeon Phi and graphics accelerators.
Abstract: Specialized implementations of ILUPACK's iterative solver for NUMA platforms.Specialized implementations of ILUPACK's iterative solver for many-core accelerators.Exploitation of task parallelism via OmpSs runtime (dynamic schedule).Exploitation of task parallelism via MPI (static schedule).Exploitation of data parallelism for GPUs. We present specialized implementations of the preconditioned iterative linear system solver in ILUPACK for Non-Uniform Memory Access (NUMA) platforms and many-core hardware co-processors based on the Intel Xeon Phi and graphics accelerators. For the conventional x86 architectures, our approach exploits task parallelism via the OmpSs runtime as well as a message-passing implementation based on MPI, respectively yielding a dynamic and static schedule of the work to the cores, with different numeric semantics to those of the sequential ILUPACK. For the graphics processor we exploit data parallelism by off-loading the computationally expensive kernels to the accelerator while keeping the numeric semantics of the sequential case.
Journal Article•10.1016/J.PARCO.2016.01.003•
Hybrid MPI-thread parallelization of adaptive mesh operations

[...]

Dan Ibanez1, Ian Dunn1, Mark S. Shephard1•
Rensselaer Polytechnic Institute1
1 Feb 2016
TL;DR: The term phased message passing is used to denote the communication interface based on this termination detection, which is then used to implement parallel operations for adaptive unstructured meshes, and the performance of resulting applications is compared to pure MPI operation.
Abstract: Development of a hybrid MPI-thread programming system called PCU.Inter-thread message passing, including non-blocking collectives.A novel, scalable termination detection technique for communication rounds.Hybrid parallel scalability to 16K cores on an IBM Blue Gene/Q. Many of the world's leading supercomputer architectures are a hybrid of shared memory and network-distributed memory. Such an architecture lends itself to a hybrid MPI-thread programming model. We first present an implementation of inter-thread message passing based on the MPI and pthread libraries. In addition, we present an efficient implementation of termination detection for communication rounds. We use the term phased message passing to denote the communication interface based on this termination detection. This interface is then used to implement parallel operations for adaptive unstructured meshes, and the performance of resulting applications is compared to pure MPI operation. We also present new workflows enabled by the ability to vary the number of threads during runtime.
Journal Article•10.1016/J.PARCO.2016.05.009•
Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

[...]

Anthony Agelastos1, Benjamin A. Allan1, Jim Brandt1, Ann C. Gentile1, Sophia Lefantzi1, Steve Monk1, Jeff Ogden1, Mahesh Rajan1, Joel O. Stevenson1 •
Sandia National Laboratories1
1 Oct 2016
TL;DR: Both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories are presented.
Abstract: Monitoring can provide meaningful system and application profiling in production.Visual and analytical characterizations can inform usage and procurement decisions.Resource utilization scoring provides simple but informative characterizations.Continuous, synchronous, high-fidelity, whole-system monitoring is required. A detailed understanding of HPC applications' resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands.In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.
Journal Article•10.1016/J.PARCO.2016.04.004•
Reducing latency cost in 2D sparse matrix partitioning models

[...]

R. Oguz Selvitopi1, Cevdet Aykanat1•
Bilkent University1
1 Sep 2016
TL;DR: This work proposes two new models based on 2D checkerboard and jagged partitioning that aim at minimizing total message count while maintaining a balance on communication volume loads of processors and analyzes practical performance of 2D models on this scale.
Abstract: Sparse matrix partitioning is a common technique used for improving performance of parallel linear iterative solvers. Compared to solvers used for symmetric linear systems, solvers for nonsymmetric systems offer more potential for addressing different multiple communication metrics due to the flexibility of adopting different partitions on the input and output vectors of sparse matrix-vector multiplication operations. In this regard, there exist works based on one-dimensional (1D) and two-dimensional (2D) fine-grain partitioning models that effectively address both bandwidth and latency costs in nonsymmetric solvers. In this work, we propose two new models based on 2D checkerboard and jagged partitioning. These models aim at minimizing total message count while maintaining a balance on communication volume loads of processors; hence, they address both bandwidth and latency costs. We evaluate all partitioning models on two nonsymmetric system solvers implemented using the widely adopted PETSc toolkit and conduct extensive experiments using these solvers on a modern system (a BlueGene/Q machine) successfully scaling them up to 8K processors. Along with the proposed models, we put practical aspects of eight evaluated models (two 1D- and six 2D-based) under thorough analysis. To the best of our knowledge, this is the first work that analyzes practical performance of 2D models on this scale. Among evaluated models, the models that rely on 2D jagged partitioning obtain the most promising results by striking a balance between minimizing bandwidth and latency costs.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve