TL;DR: It is argued that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices, and PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems is introduced.
Abstract: Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.
TL;DR: Polynomial and pseudo-polynomial time schedulability tests are presented for determining whether a given task set can be scheduled by EDF or DM to always meet deadlines on a specified number of processors.
Abstract: A model is considered for representing recurrent precedence-constrained tasks that are to execute on multiprocessor platforms. A recurrent task is specified as a directed acyclic graph (DAG), a period, and a relative deadline. Each vertex of the DAG represents a sequential job, while the edges of the DAG represent precedence constraints between these jobs. All the jobs of the DAG are released simultaneously and need to complete execution within the specified relative deadline of their release. Each task may release jobs in this manner an unbounded number of times, with successive releases occurring at least the specified period apart. Conditional control structures are also allowed. The scheduling problem is to determine whether a set of such recurrent tasks can be scheduled to always meet all deadlines upon a specified number of identical processors. This problem is shown to be computationally intractable, but amenable to efficient approximate solutions. Earliest Deadline First (EDF) and Deadline Monotonic (DM) are shown to be good approximate global scheduling algorithms. Polynomial and pseudo-polynomial time schedulability tests, of differing effectiveness, are presented for determining whether a given task set can be scheduled by EDF or DM to always meet deadlines on a specified number of processors.
TL;DR: A GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes is developed.
Abstract: This article presents matrix-free finite-element techniques for efficiently solving partial differential equations on modern many-core processors, such as graphics cards. We develop a GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes. The central algorithmic component is the matrix-free operator evaluation with sum factorization. We compare the node-level performance of our implementation running on an Nvidia Pascal P100 GPU to a highly optimized multicore implementation running on comparable Intel Broadwell CPUs and an Intel Xeon Phi. Our experiments show that the GPU implementation is approximately 1.5 to 2 times faster across four different scenarios of the Poisson equation and a variety of element degrees in 2D and 3D. The lowest time to solution per degree of freedom is recorded for moderate polynomial degrees between 3 and 5. A detailed performance analysis highlights the capabilities of the GPU architecture and the chosen execution model with threading within the element, particularly with respect to the evaluation of the matrix-vector product. Atomic intrinsics are shown to provide a fast way for avoiding the possible race conditions in summing the elemental residuals into the global vector associated to shared vertices, edges, and surfaces. In addition, the solver infrastructure allows for using mixed-precision arithmetic that performs the multigrid V-cycle in single precision with an outer correction in double precision, increasing throughput by up to 83%.
TL;DR: This work presents as a solution to the problem of obtaining good I/O performance for a broad range of applications on diverse HPC platforms is a major challenge, in part, because of complex inter dependencies betweenI/O middleware and hardware.
Abstract: Parallel Input output is an essential component of modern high-performance computing (HPC). Obtaining good I/O performance for a broad range of applications on diverse HPC platforms is a major challenge, in part, because of complex inter dependencies between I/O middleware and hardware. The parallel file system and I/O middleware layers all offer optimization parameters that can, in theory, result in better I/O performance. Unfortunately, the right combination of parameters is highly dependent on the application, HPC platform, problem size, and concurrency. Scientific application developers do not have the time or expertise to take on the substantial burden of identifying good parameters for each problem configuration. They resort to using system defaults, a choice that frequently results in poor I/O performance. We expect this problem to be compounded on exascale-class machines, which will likely have a deeper software stack with hierarchically arranged hardware resources.We present as a solution to this problem an autotuning system for optimizing I/O performance, I/O performance modeling, I/O tuning, and I/O patterns. We demonstrate the value of this framework across several HPC platforms and applications at scale.
TL;DR: A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.
Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6X.
TL;DR: The enhanced features of TAMPI not only simplify the development of hybrid MPI+OpenMP applications that use blocking or non-blocking MPI primitives but they also naturally overlap computation and communication phases, which improves application performance and scalability by removing artificial dependencies across communication tasks.
Abstract: In this paper we present the Task-Aware MPI library (TAMPI) that integrates both blocking and non-blocking MPI primitives with task-based programming models. The TAMPI library leverages two new runtime APIs to improve both programmability and performance of hybrid applications. The first API allows to pause and resume the execution of a task depending on external events. This API is used to improve the interoperability between blocking MPI communication primitives and tasks. When an MPI operation executed inside a task blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the blocked MPI operation is completed, the paused task is put again on the runtime system’s ready queue, so eventually it will be scheduled again and its execution will be resumed. The second API defers the release of dependencies associated with a task completion until some external events are fulfilled. This API is composed only of two functions, one to bind external events to a running task and another function to notify about the completion of external events previously bound. TAMPI leverages this API to bind non-blocking MPI operations with tasks, deferring the release of their task dependencies until both task execution and all its bound MPI operations are completed. Our experiments reveal that the enhanced features of TAMPI not only simplify the development of hybrid MPI+OpenMP applications that use blocking or non-blocking MPI primitives but they also naturally overlap computation and communication phases, which improves application performance and scalability by removing artificial dependencies across communication tasks.
TL;DR: This research presents a probabilistic approach that combines reinforcement learning with reinforcement learning to solve the challenge of integrating 2D and 3D image recognition into a discrete-time model.
Abstract: Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.
TL;DR: The ELPA-library as mentioned in this paper is a well-known direct eigenvalue solver library for HPC systems with a large number of cores and GPUs, which has been used extensively in the field of computational materials science.
Abstract: The solution of (generalized) eigenvalue problems for symmetric or Hermitian matrices is a common subtask of many numerical calculations in electronic structure theory or materials science. Depending on the scientific problem, solving the eigenvalue problem can easily amount to a sizeable fraction of the whole numerical calculation, and quite often is even the dominant part by far. For researchers in the field of computational materials science, an efficient and scalable solution of the eigenvalue problem is thus of major importance. The ELPA-library (Eigenvalue SoLvers for Petaflop-Applications) is a well-established dense direct eigenvalue solver library, which has proven to be very efficient and scalable up to very large core counts. It is in a wide-spread production use on a large variety of HPC systems worldwide, and is applied by many codes in the field of materials science. In this paper, we describe the latest optimizations of the ELPA-library for new HPC architectures of the Intel Skylake processor family with an AVX-512 SIMD instruction set, or for HPC systems accelerated with recent GPUs. Apart from those direct hardware-related optimizations, we also describe a complete redesign of the API in a modern modular way, which, apart from a much simpler and more flexible usability, leads to a new path to access system-specific performance optimizations. In order to ensure optimal performance for a particular scientific setting or a specific HPC system, the new API allows the user to influence in a straightforward way the internal details of the algorithms and of performance-critical parameters used in the ELPA-library. On top of that, we introduce an autotuning functionality, which allows for finding the best settings in a self-contained automated way, without the need of much user effort. In situations where many eigenvalue problems with similar settings have to be solved consecutively, the autotuning process of the ELPA-library can be done “on-the-fly”, without the need of preceding the simulation with an “artificial” autotuning step. Practical applications from materials science which rely on reaching a numerical convergence limit by so-called self-consistency iterations can profit from the on-the-fly autotuning. On some examples of scientific interest, simulated with the FHI-aims application, the advantages of the latest optimizations of the ELPA-library are demonstrated.
TL;DR: In this article, the authors proposed a method for the detection of cerebrovascular accidents by microwave tomographic imaging, which requires the solution of an inverse problem relying on a minimization algorithm (for example, gradient-based), where successive iterations consist in repeated solutions of a direct problem.
Abstract: The motivation of this work is the detection of cerebrovascular accidents by microwave tomographic imaging. This requires the solution of an inverse problem relying on a minimization algorithm (for example, gradient-based), where successive iterations consist in repeated solutions of a direct problem. The reconstruction algorithm is extremely computationally intensive and makes use of efficient parallel algorithms and high-performance computing. The feasibility of this type of imaging is conditioned on one hand by an accurate reconstruction of the material properties of the propagation medium and on the other hand by a considerable reduction in simulation time. Fulfilling these two requirements will enable a very rapid and accurate diagnosis. From the mathematical and numerical point of view, this means solving Maxwell's equations in time-harmonic regime by appropriate domain decomposition methods, which are naturally adapted to parallel architectures.
TL;DR: This research presents a meta-analyses of the immune system’s response to distributed computing and its applications in the context of EMMARM, a servers-based approach.
Abstract: Tapir (pronounced TAY-per) is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler’s IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the “serial-projection property,” which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler. For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program’s control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3,000 lines of LLVM’s half-million-line core middle-end functionality. These changes sufficed to enable LLVM’s existing compiler optimizations for serial code—including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination—to work with parallel control constructs such as parallel loops and Cilk’s Cilk_Spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling, which restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way.
TL;DR: This work achieves predictability on multi-cores by employing the predictable execution model (PREM), which splits execution into a sequence of memory and compute phases, and schedules these such that only a single core is executing a memory phase at a time.
Abstract: Many applications require both high performance and predictable timing. High-performance can be provided by COTS Multi-Core System on Chips (MPSoC), however, as cores in these systems share main memory, they are susceptible to interference from each other, which is a problem for timing predictability. We achieve predictability on multi-cores by employing the predictable execution model (PREM), which splits execution into a sequence of memory and compute phases, and schedules these such that only a single core is executing a memory phase at a time. We present a toolchain consisting of a compiler and a scheduling tool. Our compiler uses region and loop based analysis and performs tiling to transform application code into PREM-compliant binaries. In addition to enabling predictable execution, the compiler transformation optimizes accesses to the shared main memory. The scheduling tool uses a state-of-the-art heuristic algorithm and is able to schedule industrial-size instances. For smaller instances, we compare the results of the algorithm with optimal solutions found by solving an integer linear programming model. Furthermore, we solve the problem of scheduling execution on multiple cores while preventing interference of memory phases. We evaluate our toolchain on Advanced Driver Assistance System (ADAS) application workloads running on an NVIDIA Tegra X1 embedded system-on-chip (SoC). The results show that our approach maintains similar average performance to the original (unmodified) program code and execution, while reducing variance of completion times by a factor of 9 with the identified optimal solutions and by a factor of 5 with schedules generated by our heuristic scheduler.
TL;DR: In this article, a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware is proposed for distributed task-based parallel programming, which has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems.
Abstract: Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task-based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task-parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task-parallel programming, and show that it is competitive.
TL;DR: The experiments on NVIDIA’s K40 and P100 architectures reveal that the variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality.
Abstract: In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector multiplication kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.
TL;DR: In this article, the authors consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs, and derive a performance model and show that the new developments can be tuned to obtain performance within 90% of the optimal for any of the architectures of interest.
Abstract: Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.
TL;DR: Two new designs are proposed that provide efficient intra- and inter-node communication of GPU buffers for MPI_Bcast and a Topology-Aware pipelined chain (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node.
Abstract: Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this paper, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) A pipelined chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) A Topology-Aware pipelined chain (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node. To highlight the benefits of our designs, we present an in-depth performance landscape for the proposed MPI_Bcast (MPI) designs, our earlier NCCL-based MPI_Bcast (MPI+NCCL1) design, and ncclBroadcast (NCCL2) design. The proposed designs offer up to 14 × and 16.6 × better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency, respectively. With the recent introduction of NCCL2 (inter-node capable) library, we have enhanced our performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10 × better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. Furthermore, we provide application level performance comparison using a CUDA-Aware version of CNTK called CA-CNTK. The proposed MPI_Bcast designs provide up to 7% improvement over MPI+NCCL based solutions for data parallel training of the VGG network on 128 GPUs. We present our performance evaluation on three GPU clusters with diverse characteristics: (1) KESCH; a dense multi-GPU system with 8 K80 GPU cards per node, (2) RI2; with a single K80 GPU card per node, and (3) Owens; with a single P100 GPU per node.
TL;DR: This article proposes the EagerMap algorithm to determine task mappings, which is based on a greedy heuristic to match application communication patterns to hardware hierarchies and which can also consider the task load.
Abstract: Communication between tasks and load imbalance have been identified as a major challenge for the performance and energy efficiency of parallel applications. A common way to improve communication is to increase its locality, that is, to reduce the distances of data transfers, prioritizing the usage of faster and more efficient local interconnections over remote ones. Regarding load imbalance, cores should execute a similar amount of work. An important problem to be solved in this context is how to determine an optimized mapping of tasks to cluster nodes and cores that increases the overall locality and load balancing. In this article, we propose the EagerMap algorithm to determine task mappings, which is based on a greedy heuristic to match application communication patterns to hardware hierarchies and which can also consider the task load. Compared to previous algorithms, EagerMap is faster, scales better, and supports more types of computer systems, while maintaining the same or better quality of the determined task mapping. EagerMap is therefore an interesting choice for task mapping on a variety of modern parallel architectures.
TL;DR: This article focuses on the Clairvoyant Dynamic Bin Packing problem, and provides a matching lower bound of Ω (√ log μ) on the competitive ratio of any online algorithm, thus closing the gap with regard to this problem.
Abstract: In this article, we focus on the Clairvoyant Dynamic Bin Packing (DBP) problem, which extends the Classical Online Bin Packing problem in that items arrive and depart over time and the departure time of an item is known upon its arrival. The problem naturally arises when handling cloud-based networks. We focus specifically on the MinUsageTime objective function, which aims to minimize the overall usage time of all bins that are opened during the packing process. Earlier work has shown a O(log &mu / log log μ) upper bound on the algorithm’s competitiveness, where μ is defined as the ratio between the maximal and minimal durations of all items. We improve the upper bound by giving a O(√ log μ)-competitive algorithm. We then provide a matching lower bound of Ω (√ log μ) on the competitive ratio of any online algorithm, thus closing the gap with regard to this problem. We then focus on what we call the class of aligned inputs and give a O(log log μ)-competitive algorithm for this case, beating the lower bound of the general case by an exponential factor. Surprisingly enough, the analysis of our algorithm that we present is closely related to various properties of binary strings.
TL;DR: Experimental results showing that the proposed ag-gregating several CPU cores in order to execute larger parallel tasks and improve load balancing between CPUs and accelerators outperforms state of the art implementations on two architectures: a modern heterogeneous machine and the Intel Xeon Phi Knights Landing.
Abstract: Hybrid computing platforms are now commonplace, featuring a large number of CPU cores and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we propose ag-gregating several CPU cores in order to execute larger parallel tasks and improve load balancing between CPUs and accelerators. Additionally, we present our approach to exploit internal parallelism within tasks, by combining two runtime system schedulers: a global runtime system to schedule the main task graph and a local one one to cope with internal task parallelism. We demonstrate the relevance of our approach in the context of the dense Cholesky factorization kernel implemented on top of the StarPU task-based runtime system. We present experimental results showing that our solution outperforms state of the art implementations on two architectures: a modern heterogeneous machine and the Intel Xeon Phi Knights Landing.
TL;DR: An optimization problem that arises in the energy management of data centers and, more generally, multiprocessor environments is formulated and an optimal solution, minimizing energy consumption, can be computed in polynomial time by a combinatorial algorithm.
Abstract: We formulate and study an optimization problem that arises in the energy management of data centers and, more generally, multiprocessor environments. Data centers host a large number of heterogeneous servers. Each server has an active state and several standby/sleep states with individual power consumption rates. The demand for computing capacity varies over time. Idle servers may be transitioned to low-power modes so as to rightsize the pool of active servers. The goal is to find a state transition schedule for the servers that minimizes the total energy consumed. On a small scale, the same problem arises in multicore architectures with heterogeneous processors on a chip. One has to determine active and idle periods for the cores so as to guarantee a certain service and minimize the consumed energy. For this power/capacity management problem, we develop two main results. We use the terminology of the data center setting. First, we investigate the scenario that each server has two states: an active state and a sleep state. We show that an optimal solution, minimizing energy consumption, can be computed in polynomial time by a combinatorial algorithm. The algorithm resorts to a single-commodity minimum-cost flow computation. Second, we study the general scenario that each server has an active state and multiple standby/sleep states. We devise a τ-approximation algorithm that relies on a two-commodity minimum-cost flow computation. Here, τ is the number of different server types. A data center has a large collection of machines but only a relatively small number of different server architectures. Moreover, in the optimization, one can assign servers with comparable energy consumption to the same class. Technically, both of our algorithms involve nontrivial flow modification procedures. In particular, given a fractional two-commodity flow, our algorithm executes advanced rounding and flow packing routines.
TL;DR: In this paper, a client-side, log-assisted, straggler-aware I/O scheduler architecture is proposed to mitigate the impact of storage server straggglers.
Abstract: Object-based parallel file systems have emerged as promising storage solutions for high-performance computing (HPC) systems. Despite the fact that object storage provides a flexible interface, scheduling highly concurrent I/O requests that access a large number of objects still remains as a challenging problem, especially in the case when stragglers (storage servers that are significantly slower than others) exist in the system. An efficient I/O scheduler needs to avoid possible stragglers to achieve low latency and high throughput. In this paper, we introduce a log-assisted straggler-aware I/O scheduling to mitigate the impact of storage server stragglers. The contribution of this study is threefold. First, we introduce a client-side, log-assisted, straggler-aware I/O scheduler architecture to tackle the storage straggler issue in HPC systems. Second, we present three scheduling algorithms that can make efficient decision for scheduling I/Os while avoiding stragglers based on such an architecture. Third, we evaluate the proposed I/O scheduler using simulations, and the simulation results have confirmed the promise of the newly introduced straggler-aware I/O scheduler.
TL;DR: This work shows how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers, and proves that this scheduler has the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions.
Abstract: With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications.In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.
TL;DR: In this article, a simple and distributed algorithm for graph clustering is proposed, which can be implemented in a poly-logarithmic number of rounds and recovers a partition of the graph close to optimal.
Abstract: Graph clustering is a fundamental computational problem with a number of applications in algorithm design, machine learning, data mining, and analysis of social networks. Over the past decades, researchers have proposed a number of algorithmic design methods for graph clustering. Most of these methods, however, are based on complicated spectral techniques or convex optimisation and cannot be directly applied for clustering many networks that occur in practice, whose information is often collected on different sites. Designing a simple and distributed clustering algorithm is of great interest and has comprehensive applications for processing big datasets. In this article, we present a simple and distributed algorithm for graph clustering: For a wide class of graphs that are characterised by a strong cluster-structure, our algorithm finishes in a poly-logarithmic number of rounds and recovers a partition of the graph close to optimal. One of the main procedures behind our algorithm is a sampling scheme that, given a dense graph as input, produces a sparse subgraph that provably preserves the cluster-structure of the input. Compared with previous sparsification algorithms that require Laplacian solvers or involve combinatorial constructions, this procedure is easy to implement in a distributed setting and runs fast in practice.
TL;DR: In this article, the authors describe a new approach to calculate the Custom Correlation Coefficient (CCC) between Single Nucleotide Polymorphisms (SNPs) across a population, suitable for parallel systems equipped with graphics processing units (GPUs) or Intel Xeon Phi processors.
Abstract: The massive quantities of genomic data being made available through gene sequencing techniques are enabling breakthroughs in genomic science in many areas such as medical advances in the diagnosis and treatment of diseases. Analyzing this data, however, is a computational challenge insofar as the computational costs of the relevant algorithms can grow with quadratic, cubic or higher complexity—leading to the need for leadership scale computing. In this paper we describe a new approach to calculations of the Custom Correlation Coefficient (CCC) between Single Nucleotide Polymorphisms (SNPs) across a population, suitable for parallel systems equipped with graphics processing units (GPUs) or Intel Xeon Phi processors. We describe the mapping of the algorithms to accelerated processors, techniques used for eliminating redundant calculations due to symmetries, and strategies for efficient mapping of the calculations to many-node parallel systems. Results are presented demonstrating high per-node performance and near-ideal parallel scalability with rates of more than nine quadrillion (9 × 1015) elementwise comparisons achieved per second with the latest optimized code on the ORNL Titan system, this being orders of magnitude faster than rates achieved using other codes and platforms as reported in the literature. Also it is estimated that as many as 90 quadrillion (90 × 1015) comparisons per second may be achievable on the upcoming ORNL Summit system, an additional 10X performance increase. In a companion paper we describe corresponding techniques applied to calculations of the Proportional Similarity metric for comparative genomics applications.
TL;DR: This work uses OpenMP directives to target hardware accelerators (GPUs) on Summit, a newly deployed supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), demonstrating simplified access to GPU devices for users of the authors' astrophysics code GenASiS and useful speedup on a sample fluid dynamics problem.
Abstract: We use OpenMP to target hardware accelerators (GPUs) on Summit, a newly deployed supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), demonstrating simplified access to GPU devices for users of our astrophysics code GenASiS and useful speedup on a sample fluid dynamics problem. We modify our workhorse class for data storage to include members and methods that significantly streamline the persistent allocation of and association to GPU memory. Users offload computational kernels with OpenMP target directives that are rather similar to constructs already familiar from multi-core parallelization. In this initial example we ask, “With a given number of Summit nodes, how fast can we compute with and without GPUs?”, and find total wall time speedups of ∼ 12X. We also find reasonable weak scaling up to 8000 GPUs (1334 Summit nodes). We make available the source code from this work at https://github.com/GenASiS/GenASiS_Basics .
TL;DR: In this article, the authors evaluated the use of MPI+OpenMP tasking on top of this framework and proved that this approach is able to achieve a good scaling even when computing a relatively low workload (number of neurons) per node.
Abstract: The simulation of the behavior of the human brain is one of the most ambitious challenges today with a non-end of important applications. We can find many different initiatives in the USA, Europe and Japan which attempt to achieve such a challenging target. In this work, we focus on the most important European initiative (the Human Brain Project) and on one of the models developed in this project. This tool simulates the spikes triggered in a neural network by computing the voltage capacitance on the neurons’ morphology, being one of the most precise simulators today. In the present work, we have evaluated the use of MPI+OpenMP tasking on top of this framework. We prove that this approach is able to achieve a good scaling even when computing a relatively low workload (number of neurons) per node. One of our targets consists of achieving not only a highly scalable implementation, but also to develop a tool with a high degree of abstraction without losing control and performance by using MPI+OpenMP tasking. The main motivation of this work is the evaluation of this cutting-edge simulation on multi-morphology neural networks. The simulation of a high number of neurons, which are completely different among them, is an important challenge. In fact, in the multi-morphology simulations, we find an important unbalancing between the nodes, mainly due to the differences in the neurons, which causes an important under-utilization of the available resources. In this work, the authors present and evaluate mechanisms to deal with this and reduce the time of this kind of simulations considerably.
TL;DR: This work designs, analyzed, and compared the performance of a set of heuristic techniques to maximize system performance when scheduling dynamically arriving parallel tasks onto a high performance computing (HPC) system that is oversubscribed and energy constrained.
Abstract: The worth of completing parallel tasks is modeled using utility functions, which monotonically-decrease with time and represent the importance and urgency of a task. These functions define the utility earned by a task at the time of its completion. The performance of a computing system is measured as the total utility earned by all completed tasks over some interval of time (e.g., 24 h). We have designed, analyzed, and compared the performance of a set of heuristic techniques to maximize system performance when scheduling dynamically arriving parallel tasks onto a high performance computing (HPC) system that is oversubscribed and energy constrained. We consider six utility-aware heuristics and four existing heuristics for comparison. A new concept of temporary place-holders is compared with scheduling using permanent reservations. We also present a novel energy filtering technique that constrains the maximum energy-per-resource used by each task. We conducted a simulation study to evaluate the performance of these heuristics and techniques in multiple energy-constrained oversubscribed HPC environments. We conduct an experiment with a subset of the heuristics on a physical testbed system for one scheduling scenario. We demonstrate that our proposed utility-aware resource management heuristics are able to significantly outperform existing techniques.
TL;DR: A novel architecture, called BARAN, that can either improve on-chip network performance or reduce its power consumption (depending on the specific implementation chosen), not both at the same time, when virtual channels are underutilized is proposed.
Abstract: Virtual channels are employed to improve the throughput under high traffic loads in Networks-on-Chips (NoCs). However, they can impose non-negligible overheads on performance by prolonging clock cycle time, especially under low traffic loads where the impact of virtual channels on performance is trivial. In this article, we propose a novel architecture, called BARAN, that can either improve on-chip network performance or reduce its power consumption (depending on the specific implementation chosen), not both at the same time, when virtual channels are underutilized; that is, the average number of virtual channel allocation requests per cycle is lower than the number of total virtual channels. We also introduce a reconfigurable arbitration logic within the BARAN architecture that can be configured to have multiple latencies and, hence, multiple slack times. The increased slack times are then used to reduce the supply voltage of the routers or increase their clock frequency in order to reduce power consumption or improve the performance of the whole NoC system. The power-centric design of BARAN reduces NoC power consumption by 43.4% and 40.6% under CMP and GPU workloads, on average, respectively, compared to a baseline architecture while imposing negligible area and performance overheads. The performance-centric design of BARAN reduces the average packet latency by 45.4% and 42.1%, on average, under CMP and GPU workloads, respectively, compared to the baseline architecture while increasing power consumption by 39.7% and 43.7%, on average. Moreover, the performance-centric BARAN postpones the network saturation rate by 11.5% under uniform random traffic compared to the baseline architecture.
TL;DR: This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations, and provides early performance results, using a modified version of NBCBench and an example application illustrating the potential performance enhancements for such operations.
Abstract: Advantages of nonblocking collective communication in MPI have been established over the past quarter century, even predating MPI-1. For regular computations with fixed communication patterns, significant additional optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 API except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1. This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations. We provide early performance results, using a modified version of NBCBench and an example application (based on 3D conjugate gradient) illustrating the potential performance enhancements for such operations. Persistent operations enable MPI implementations to make intelligent choices about algorithm and resource utilization once and amortize this decision cost across many uses in a long-running program. Evidence that this approach is of value is provided. As with non-persistent, nonblocking collective operations, the requirement for strong progress and blocking completion notification are jointly needed to maximize the benefit of such operations (e.g., to support overlap of communication with computation and/or other communication). Further enhancement of the current reference implementation, as well as additional opportunities to enhance performance through the application of these new APIs, comprise future work.
TL;DR: This work presents a numerical stability analysis that describes and quantifies the impact of local rounding error propagation on the maximal attainable accuracy of the multi-term recurrences in the preconditioned pipelined BiCGStab method.
Abstract: Pipelined Krylov subspace methods avoid communication latency by reducing the number of global synchronization bottlenecks and by hiding global communication behind useful computational work. In exact arithmetic pipelined Krylov subspace algorithms are equivalent to classic Krylov subspace methods and generate identical series of iterates. However, as a consequence of the reformulation of the algorithm to improve parallelism, pipelined methods may suffer from severely reduced attainable accuracy in a practical finite precision setting. This work presents a numerical stability analysis that describes and quantifies the impact of local rounding error propagation on the maximal attainable accuracy of the multi-term recurrences in the preconditioned pipelined BiCGStab method. Theoretical expressions for the gaps between the true and computed residual as well as other auxiliary variables used in the algorithm are derived, and the elementary dependencies between the gaps on the various recursively computed vector variables are analyzed. The norms of the corresponding propagation matrices and vectors provide insights in the possible amplification of local rounding errors throughout the algorithm. Stability of the pipelined BiCGStab method is compared numerically to that of pipelined CG on a symmetric benchmark problem. Furthermore, numerical evidence supporting the effectiveness of employing a residual replacement type strategy to improve the maximal attainable accuracy for the pipelined BiCGStab method is provided.