TL;DR: The recursive algebraic coloring engine (RACE) as discussed by the authors is a novel coloring algorithm and open-source library implementation that eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead.
Abstract: The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today’s multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation that eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially. By means of a parameterized roofline model, we analyze the SymmSpMV performance in detail and discuss outliers. While we focus on SymmSpMV in this article, our algorithm and software are applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring.
TL;DR: A compressed sparse row (CSR) format suitable for unbalanced matrices is presented and a load-balancing kernel for the coordinate (COO) matrix format is provided and extended to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format.
Abstract: Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
TL;DR: GPOP induces locality in memory accesses by increasing granularity of execution to vertex subsets called ’partitions’, thereby dramatically improving the cache performance of a variety of graph algorithms, and achieves high scalability by enabling completely lock and atomic free computation.
Abstract: The past decade has seen the development of many shared-memory graph processing frameworks intended to reduce the effort of developing high-performance parallel applications. However, many of these frameworks, based on Vertex-centric or Edge-centric paradigms suffer from several issues, such as poor cache utilization, irregular memory accesses, heavy use of synchronization primitives, or theoretical inefficiency, that deteriorate over-all performance and scalability. Recently, we proposed a cache and memory-efficient partition-centric paradigm for computing PageRank [26]. In this article, we generalize this approach to develop a novel Graph Processing Over Parts (GPOP) framework that is cache efficient, scalable, and work efficient. GPOP induces locality in memory accesses by increasing granularity of execution to vertex subsets called “parts,” thereby dramatically improving the cache performance of a variety of graph algorithms. It achieves high scalability by enabling completely lock and atomic free computation. GPOP’s built-in analytical performance model enables it to use a hybrid of source and part-centric communication modes in a way that ensures work efficiency each iteration, while simultaneously boosting high bandwidth sequential memory accesses. Finally, the GPOP framework is designed with programmability in mind. It completely abstracts away underlying parallelism and programming model details from the user and provides an easy to program set of APIs with the ability to selectively continue the active vertex set across iterations. Such functionality is useful for many graph algorithms but not intrinsically supported by the current frameworks. We extensively evaluate the performance of GPOP for a variety of graph algorithms, using several large datasets. We observe that GPOP incurs up to 9×, 6.8×, and 5.5× less L2 cache misses compared to Ligra, GraphMat, and Galois, respectively. In terms of execution time, GPOP is up to 19×, 9.3×, and 3.6× faster than Ligra, GraphMat, and Galois, respectively.
TL;DR: The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms, and the performance of the implementations is compared.
Abstract: Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-latency, asynchronous communication. This article proposes constructs for asynchronous multi-GPU programming and describes their implementation in a thin runtime environment called Groute. Groute also implements common collective operations and distributed work-lists, enabling the development of irregular applications without substantial programming effort. We demonstrate that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on eight-GPU and heterogeneous systems, yielding over 7× speedup for some algorithms.
TL;DR: A new accuracy-preserving parallel algorithm employing compact schemes is presented for direct numerical simulation of the Navier-Stokes equations, with the connotation of accuracy preservation.
Abstract: A new accuracy-preserving parallel algorithm employing compact schemes is presented for direct numerical simulation of the Navier-Stokes equations. Here the connotation of accuracy preservation is having the same level of accuracy obtained by the proposed parallel compact scheme, as the sequential code with the same compact scheme. Additional loss of accuracy in parallel compact schemes arises due to necessary boundary closures at sub-domain boundaries. An attempt to circumvent this has been done in the past by the use of Schwarz domain decomposition and compact filters in “A new compact scheme for parallel computing using domain decomposition,” J. Comput. Phys. 220, 2 (2007), 654--677, where a large number of overlap points was necessary to reduce error. A parallel compact scheme with staggered grids has been used to report direct numerical simulation of transition and turbulence by the Schwarz domain decomposition method. In the present research, we propose a new parallel algorithm with two benefits. First, the number of overlap points is reduced to a single common boundary point between any two neighboring sub-domains, thereby saving the number of points used, with resultant speed-up. Second, with a proper design, errors arising due to sub-domain boundary closure schemes are reduced to a user designed error tolerance, bringing the new parallel scheme on par with sequential computing. Error reduction is achieved by using global spectral analysis, introduced in “Analysis of central and upwind compact schemes,” J. Comput. Phys. 192, 2, (2003) 677--694, which analyzes any discrete computing method in the full domain integrally. The design of the parallel compact scheme is explained, followed by a demonstration of the accuracy of the method by solving benchmark flows: (1) periodic two-dimensional Taylor-Green vortex problem; (2) flow inside two-dimensional square lid-driven cavity (LDC) at high Reynolds number; and (3) flow inside a non-periodic three-dimensional cubic LDC with the staggered grid arrangement.
TL;DR: In this paper, a shared-memory parallel algorithm for maximal clique enumeration (MCE) is presented, which is provably work-efficient relative to a state-of-the-art sequential algorithm.
Abstract: Maximal Clique Enumeration (MCE) is a fundamental graph mining problem and is useful as a primitive in identifying dense structures in a graph. Due to the high computational cost of MCE, parallel methods are imperative for dealing with large graphs. We present shared-memory parallel algorithms for MCE, with the following properties: (1) the parallel algorithms are provably work-efficient relative to a state-of-the-art sequential algorithm, (2) the algorithms have a provably small parallel depth, showing they can scale to a large number of processors, and (3) our implementations on a multicore machine show good speedup and scaling behavior with increasing number of cores and are substantially faster than prior shared-memory parallel algorithms for MCE; for instance, on certain input graphs, while prior works either ran out of memory or did not complete in five hours, our implementation finished within a minute using 32 cores. We also present work-efficient parallel algorithms for maintaining the set of all maximal cliques in a dynamic graph that is changing through the addition of edges.
TL;DR: KiWi as mentioned in this paper is the first atomic KV-map to efficiently support simultaneous large scans and real-time access, treating scans as first class citizens and organizing the data structure around them.
Abstract: We present KiWi, the first atomic KV-map to efficiently support simultaneous large scans and real-time access The key to achieving this is treating scans as first class citizens and organizing the data structure around them KiWi provides wait-free scans, whereas its put operations are lightweight and lock-free It optimizes memory management jointly with data structure access We implement KiWi and compare it to state-of-the-art solutions Compared to other KV-maps providing atomic scans, KiWi performs either long scans or concurrent puts an order of magnitude faster Its scans are twice as fast as non-atomic ones implemented via iterators in the Java skiplist
TL;DR: Modifications are described to the basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of the algorithm.
Abstract: We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.
TL;DR: The proposed parallel implementation of MOPSO using a master-slave model provides up to 157 times speedup compared to the corresponding CPU implementation, which can be widely used in real world optimization problems.
Abstract: During the last couple of years, parallel MOPSO (Multi-objective Particle Swarm Optimization) with two or more objectives has gained a lot of attention in the literature on GPU computing. A number of implementations have been published for MOPSO on a GPU. However, none of them have been able to capture good enough Pareto fronts fast. In addition, the authors have pointed out their limitations in various aspects such as archive handling, picking up fewer nondominated solutions and so on. Previous literature also lacks evaluation of its MOPSO implementation with large swarms and high dimensional problems. This paper presents a faster implementation of parallel MOPSO on a GPU based on the CUDA architecture. We achieved our faster implementation by using coalescing memory access, a fast pseudorandom number generator, Thrust library, CUB library, an atomic function, parallel archiving and so on. The proposed parallel implementation of MOPSO using a master-slave model provides up to 157 times speedup compared to the corresponding CPU implementation. As the proposed implementation performs very highly even with increased size of problem dimensionality and swarm population, it can be widely used in real world optimization problems.
TL;DR: A novel dynamic power management strategy for the value-based algorithms that maximizes the system productivity, resource utilization, and job completion rate by utilizing application power-performance models to reallocate power from running jobs to newly arrived jobs.
Abstract: High performance computing (HPC) systems are confronting the challenge of improving their productivity under a system-wide power constraint in the exascale era. To measure the productivity of an HPC job, researchers have proposed to assign a monotonically decreasing time-dependent value function, called job-value, to that job. These job-value functions are used by the value-based scheduling algorithms to maximize the system productivity where system productivity is the accumulation of job-value for the completed jobs. In this study, we first show that the relative performance of the competing state-of-the-art static power allocation strategies interchange based on the level of the power constraint when applied to the value-based algorithms. We then investigate the limitations of these static strategies by relating the job completion rate to the resource utilization, and expose that there is non-negligible amount of unused resources for the scheduler to utilize. Even though the system is oversubscribed, these unused resources are insufficient to schedule new high-value jobs. Based on this observation, we propose a novel dynamic power management strategy for the value-based algorithms. Our dynamic allocation policy maximizes the system productivity, resource utilization, and job completion rate by utilizing application power-performance models to reallocate power from running jobs to newly arrived jobs. We simulate a large-scale system that uses job arrival traces from a real HPC system. We demonstrate that the dynamic-variant of each value-based algorithm earns up to 16% higher productivity and completes 13% more jobs compared to its static variants when power becomes a highly constrained resource in the system.
TL;DR: The Emu Chick prototype implements migratory memory-side processing in a novel hardware system as discussed by the authors, which moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read.
Abstract: The Emu Chick prototype implements migratory memory-side processing in a novel hardware system. Rather than transferring large amounts of data across the system interconnect, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read. Previous work has characterized the performance of the Chick prototype in terms of memory bandwidth and programming differences from more typical, non-migratory platforms, but there has not yet been an analysis of algorithms on this system.This work evaluates irregular algorithms that could benefit from the lightweight, memory-side processing of the Chick and demonstrates techniques and optimization strategies for achieving performance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), and graph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chick system. We also define and justify relative metrics to compare prototype FPGA-based hardware with established ASIC architectures. The Chick currently supports up to 68x scaling for graph alignment, 80 MTEPS for BFS on balanced graphs, and 50% of measured STREAM bandwidth for SpMV.
TL;DR: A communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges is described.
Abstract: We describe the application of a communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges. Our variable-precision strategy, using a customized precision format based on mantissa segmentation (CPMS), abandons the IEEE 754 single- and double-precision number representation formats employed in the standard implementation of PageRank, and instead handles the data in memory using a customized floating-point format. The customized format enables fast data access in different accuracy, prevents overflow/underflow by preserving the IEEE 754 double-precision exponent, and efficiently avoids data duplication, since all bits of the original IEEE 754 double-precision mantissa are preserved in memory, but re-organized for efficient reduced precision access. With this approach, the truncated values (omitting significand bits), as well as the original IEEE double-precision values, can be retrieved without duplicating the data in different formats. Our numerical experiments on an NVIDIA V100 GPU (Volta architecture) and a server equipped with two Intel Xeon Platinum 8168 CPUs (48 cores in total) expose that, compared with a standard IEEE double-precision implementation, the CPMS-based PageRank completes about 10% faster if high-accuracy output is needed, and about 30% faster if reduced output accuracy is acceptable.
TL;DR: Whole Tale as mentioned in this paper is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of "Tales" for the scientific research community.
Abstract: Whole Tale this http URL is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of "Tales" for the scientific research community. Tales are executable research objects that capture the code, data, and environment along with narrative and workflow information needed to re-create computational results from scientific studies. Creating reproducible research objects that enable reproducibility, transparency, and re-execution for computational experiments requiring significant compute resources or utilizing massive data is an especially challenging open problem. We describe opportunities, challenges, and solutions to facilitating reproducibility for data- and compute-intensive research, that we call "Tales at Scale," using the Whole Tale computing platform. We highlight challenges and solutions in frontend responsiveness needs, gaps in current middleware design and implementation, network restrictions, containerization, and data access. Finally, we discuss challenges in packaging computational experiment implementations for portable data-intensive Tales and outline future work.
TL;DR: An approximate solution of the maximum weight matching problem is adopted, based on a recently proposed parallel algorithm, referred as the Suitor algorithm, and it is shown that it allow us to obtain good quality coarse matrices for the authors' AMG on GPUs.
Abstract: We describe main issues and design principles of an efficient implementation, tailored to recent generations of Nvidia Graphics Processing Units (GPUs), of an Algebraic MultiGrid (AMG) preconditioner previously proposed by one of the authors and already available in the open-source package BootCMatch: Bootstrap algebraic multigrid based on Compatible weighted Matching for standard CPUs. The AMG method relies on a new approach for coarsening sparse symmetric positive definite (s.p.d.) matrices, named coarsening based on compatible weighted matching. It exploits maximum weight matching in the adjacency graph of the sparse matrix, driven by the principle of compatible relaxation, providing a suitable aggregation of unknowns which goes beyond the limits of the usual heuristics applied in the current methods. We adopt an approximate solution of the maximum weight matching problem, based on a recently proposed parallel algorithm, referred to as the Suitor algorithm, and show that it allows us to obtain good quality coarse matrices for our AMG on GPUs. We exploit inherent parallelism of modern GPUs in all the kernels involving sparse matrix computations both for the setup of the preconditioner and for its application in a Krylov solver, outperforming preconditioners available in the original sequential CPU code as well as the single node Nvidia AmgX library. Results for a large set of linear systems arising from discretization of scalar and vector partial differential equations (PDEs) are discussed.
TL;DR: Numerically validated experimental results show that ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account, and it comes within approximately 90% precision of current numerical solvers.
Abstract: Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both device- and circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.
TL;DR: New efficient parallel algorithms on Coarse Grained Multicomputer model (CGM) to solve the LCS problem with multiple substring-exclusion constraints, which has great importance in many fields especially in bioinformatics, are described.
Abstract: A variant of the Longest Common Subsequence (LCS) problem is the LCS problem with multiple substring-exclusion constraints (M-STR-EC-LCS), which has great importance in many fields especially in bioinformatics. This problem consists to compute the LCS of two strings X and Y of length n and m respectively that excluded a set of d constraints P = { P 1 , P 2 , … , P d } of total length r. Recently, Wang et al. proposed a sequential solution based on the dynamic programming technique that requires O ( n m r ) execution time and space. To the best of our knowledge, there is no parallel solutions for this problem. This paper describes new efficient parallel algorithms on Coarse Grained Multicomputer model (CGM) to solve this problem. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) that determines the correct evaluation order of sub-problems in order to avoid redundancy due to overlap. Secondly, we propose two CGM parallel algorithms based on our DAG. The first algorithm is based on a regular partitioning of the DAG and requires O ( n m r p ) execution time with O ( p ) communication rounds where p is the number of processors used. Its main drawback is high idleness time of processors because due to the dependencies between the nodes in the DAG, over time it has many idle processors. The second algorithm uses an irregular partitioning of the DAG that minimizes this idleness time by allowing the processors to stay active as long as possible. It requires O ( n m r p ) execution time with O ( k p ) communication rounds. k is a constant integer allowing to setup the irregular partitioning. The both algorithms require O ( r | Σ | p ) preprocessing time where |Σ| is the length of the alphabet. The experimental results performed show a good agreement with theoretical predictions.
TL;DR: This paper explores acceleration of DNNs using BCM on a state-of-the-art GPU, identifying the challenges posed by using BCMs, and performing both general and GPU-specific optimizations that impact the decomposition and interaction of individual operations, and the overall GPU kernel design.
Abstract: Training a Deep Neural Network (DNN) is a significant computing task since it places high demands on computing resources and memory bandwidth. Many approaches have been proposed to compress the network, while maintaining high model accuracy, reducing the computational demands associated with large-scale DNN training. One attractive approach is to leverage Block Circulant Matrices (BCM), compressing the linear transformation layers, e.g., convolutional and fully-connected layers, that heavily rely on performing General Matrix Multiplications (GEMM). By using BCMs, we can reduce the weight storage for a linear transformation layer from O ( N 2 ) to O ( N ) . BCMs are also more efficient in terms of computational complexity, improving algorithmic complexity from O ( N 2 ) to O ( N log ( N ) ) . Previous work has only evaluated DNNs using BCMs targeting FPGAs for inference. There has been little prior work that considers the potential benefits of using BCMs for accelerating DNN training on GPUs. In this paper, we explore acceleration of DNNs using BCM on a state-of-the-art GPU. First, we identify the challenges posed by using BCMs. Next, we perform both general and GPU-specific optimizations that impact: (i) the decomposition and interaction of individual operations, and (ii) the overall GPU kernel design. We modify the algorithmic steps to remove redundant computations, while maintaining mathematical integrity. We also leverage multiple GPU kernel optimizations, considering performance factors, such as occupancy, data sharing/reuse patterns, and memory coalescing. We evaluate the performance of DNN training on an NVIDIA Tesla V100, providing insights into the benefits of our proposed kernel optimizations on a state-of-the-art GPU. Based on our results, we can achieve average speedups of 1.31 × and 2.79 × for the convolutional layers and fully-connected layers, respectively for AlexNet. We can also achieve average speedups of 1.33 × and 3.66 × for the convolutional layers and fully-connected layers, respectively for VGGNet-16.
TL;DR: It is proved that AIR is a backward stable algorithm and can bring up to 2 − 3 × speedups over a mixed precision iterative refinement depending on the characteristics of hardware.
Abstract: The increased degree of concurrent operations by lower precision arithmetic enables high performance for iterative refinement. Most of related work present statically defined mixed precision arithmetic approaches, while adapting a level of arithmetic precision dynamically in a loop with one-bit granularity can further improve the performance. This paper presents Arbitrary Dynamic Precision Iterative Refinement algorithm (AIR) that minimizes the total significand bit-width to solve iterative refinement. AIR detects the number of cancellation bits dynamically per iteration and uses the information to provide the least sufficient significand bit-width for the next iteration. We prove that AIR is a backward stable algorithm and can bring up to 2 − 3 × speedups over a mixed precision iterative refinement depending on the characteristics of hardware. Our software demonstration shows that AIR requires only 83% of the significand bits required by mixed precision iterative refinement that solve linear systems for double precision accuracy for backward error with 32 × 32 standard normally distributed matrices.
TL;DR: The proposed MPF partitioning can successfully improve the strong scalability and reduce the communication cost of a block-based stencil AMR application.
Abstract: In distributed implementations of memory-bound stencil AMR applications, the inter-node communication time often represents a major performance bottleneck. Thus minimizing communication is an objective as important as maintaining a good load balance. We propose a new domain partitioning method for block-based AMR applications based on the multi-phase-field (MPF) model. The MPF model for polycrystalline growth minimizes the interfacial energy and forms a convex shape for each crystal grain. In our method, each phase of the MPF model represents a computational sub-domain of each MPI process. We apply the proposed partitioning method to a block-based AMR application for an interface capturing on multiple GPUs. We measured the strong scalability up to 256 GPUs on the TSUBAME3.0 supercomputer at Tokyo Institute of Technology. The proposed MPF partitioning can successfully improve the strong scalability and reduce the communication cost of a block-based stencil AMR application.
TL;DR: A GPU-accelerated parallel algorithm for the axial Multidimensional Assignment Problem with Decomposable Costs (MDADC), which is one of the most fundamental formulations for data association, is described.
Abstract: In this paper, we describe a GPU-accelerated parallel algorithm for the axial Multidimensional Assignment Problem with Decomposable Costs (MDADC), which is one of the most fundamental formulations for data association. MDADC is known to be NP-hard and is large-dimensioned in most realistic cases; hence, heuristic solutions with qualified optimality gaps is the best one can hope for, given the state-of-knowledge. The main contribution of this paper is an efficient parallelization of the Lagrangian subgradient search algorithm specifically targeted towards the Graphics Processing Units (GPUs) based on the NVIDIA Compute Unified Device Architecture (CUDA). A GPU-accelerated Linear Assignment Problem (LAP) solver is leveraged in concert with the Lagrangian scheme for further speed-up. We also implemented a multi-GPU variant of this algorithm which maintains a good speedup profile, when tested on problems with 31 billion variables, on up to 128 GPUs.
TL;DR: A novel method of grouping paths that refines the measurement of the path similarity when grouping target paths, dynamically increases the number of benchmark paths in a group, and groups the remaining ones based on the similarity between a path and each of these benchmark paths with the purpose of a large similarity between each pair of paths in the same group.
Abstract: Genetic algorithms can be employed to automatically generate desired test data, with the advantage of freeing up manpower. For the path coverage criterion, the problem of test data generation needs to be transformed into an optimization problem before applying genetic algorithms. However, when the number of paths to be covered is large, the transformed optimization problem will be very complicated. Correspondingly, the difficulty of problem solving will be greatly increased. In view of this, the complex optimization problem is divided into a number of sub-optimization problems by grouping paths. However, the existing method of grouping paths has not fully taken the characteristic of multiple processes existing in a parallel program into consideration. As a result, inappropriate paths will be put into the same group, which heavily restricts the efficiency of test data generation. To overcome the above drawback, this study proposes a novel method of grouping paths. This method refines the measurement of the path similarity when grouping target paths, dynamically increases the number of benchmark paths in a group, and groups the remaining ones based on the similarity between a path and each of these benchmark paths, with the purpose of a large similarity between each pair of paths in the same group. The proposed method is applied to test nine typical programs, and compared with the method of randomly grouping paths and the existing method of grouping paths. The experimental results show that paths in the same group obtained by the proposed method have a larger similarity, which is beneficial to efficiently generating test data that satisfy the path coverage criterion.
TL;DR: The results show that by mitigating the data skew, ImRP can decrease the execution time of stream jobs substantially compared with some other partition strategies, especially when the skew degree of input batch is serious.
Abstract: Spark Streaming is an extension of the core Spark engine that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It treats stream as a series of deterministic batches and handles them as regular jobs. However, for a stream job responsible for a batch, data skew (i.e., the imbalance in the amount of data allocated to each reduce task), can degrade the job performance significantly because of load imbalance. In this paper, we propose an improved range partitioner (ImRP) to alleviate the reduce skew for stream jobs in Spark Streaming. Unlike previous work, ImRP does not require any pre-run sampling of input data and generates the data partition scheme based on the intermediate data distribution estimated by the previous batch processing, in which a prediction model EWMA (Exponentially Weighted Moving Average) is adopted. To lighten the data skew, ImRP presents a novel method of calculating the partition borders optimally, and a mechanism of splitting the border key clusters when the semantics of shuffle operators permit. Besides, ImRP considers the integrated partition size and heterogeneity of computing environments when balancing the load among reduce tasks appropriately. We implement ImRP in Spark-3.0 and evaluate its performance on four representative benchmarks: wordCount, sort, pageRank, and LDA. The results show that by mitigating the data skew, ImRP can decrease the execution time of stream jobs substantially compared with some other partition strategies, especially when the skew degree of input batch is serious.
TL;DR: Experimental results indicate that a single-process many-core speedup of 12.89x can be achieved by using manual optimization strategies and two primary optimizations including designing the software cache and minimizing the access frequency of the main memory on an intermediate version of the code generated by the compiler.
Abstract: The Sunway TaihuLight with the theoretical peak performance of 125PFlop/s is now ranked third in the TOP500 list. It provides a high-level programming model named OpenACC, which extends the OpenACC 2.0 standard with some customized extensions. We assess the performance of the extended programming model and the SW26010 heterogeneous many-core processor for running molecular dynamics (MD) simulation of solid covalent crystals using many-body potentials, such as the Tersoff potentials. Considering the special architecture of the SW26010 processor, we implement the porting of the MD simulation of silicon crystals using the Sunway OpenACC under the guidance of the extended Amdahl’s law. Since the Sunway OpenACC compiler cannot deal with the performance bottleneck of the MD simulation of silicon crystals, we implement two primary optimizations including designing the software cache and minimizing the access frequency of the main memory on an intermediate version of the code generated by the compiler. Experimental results indicate that a single-process many-core speedup of 12.89x can be achieved by using manual optimization strategies. Compared with the execution time of the serial version on Intel (R) Xeon (R) CPU E5-2620 v4 processor, 8.7x speedup can be achieved.
TL;DR: This research presents a meta-analyses of the immune system’s response to distributed computing and its applications in the context of large-scale distributed systems.
Abstract: The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. The experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.
TL;DR: This paper proposes an energy-efficient parallel Advance Encryption Standard (AES) algorithm that can be used to encrypt data on heterogeneous platforms to save energy, especially for the computers with thousands of heterogeneous nodes.
Abstract: Encryption plays an important role in protecting data, especially data transferred on the Internet. However, encryption is computationally expensive and this leads to high energy costs. Parallel encryption solutions using more CPU/GPU cores can achieve high performance. If we consider energy efficiency to be cost effective using parallel encryption solutions at the same time, this problem can be alleviated effectively. Because many CPU/GPU cores and encryption are pervasive currently, saving energy cost by parallel encrypting has become an unavoidable problem. In this paper, we propose an energy-efficient parallel Advance Encryption Standard (AES) algorithm for CPU-GPU heterogeneous platforms. These platforms, such as the Green 500 computers, are popular in both high performance and general computing. Parallelizing AES algorithm, using both GPUs and CPUs, balances the workload between CPUs and GPUs based on their computing capacities. This approach also uses the Nvidia Management Library (NVML) to adjust GPU frequencies, overlaps data transfers and computation, and fully utilizes GPU computing resources to reduce energy consumption as much as possible. Experiments conducted on a platform with one K20M GPU and two Xeon E5-2640 v2 CPUs show that this approach can reduce energy consumption by 74% compared to CPU-only parallel AES algorithm and 21% compared to GPU-only parallel AES algorithm on the same platform. Its energy efficiency is 4.66 MB/Joule on average higher than both CPU-only parallel AES algorithm (1.15 MB/Joule) and GPU-only parallel AES algorithm (3.65 MB/Joule). As an energy-efficient parallel AES algorithm solution, it can be used to encrypt data on heterogeneous platforms to save energy, especially for the computers with thousands of heterogeneous nodes.
TL;DR: The interface QMPI is introduced, which addresses the limitations of PMPI and aims to be a successor to it and presents the use cases and requirements that necessitate the development of Q MPI, as well as a design offer for a QMPi prototype followed by its implementation and evaluation.
Abstract: As modern HPC applications and systems advance to exascale, their complexity and the need for more efficient resource utilization increases. This fact demands more advanced monitoring, analysis and optimization approaches. Therefore, the Message Passing Interface (MPI), which is the most common parallel programming system for HPC applications, must enable these advanced approaches. Even if the existing MPI Profiling Interface, PMPI, provides comprehensive tool support, it is no longer sufficient to enable these advanced approaches. In particular, PMPI does not support simultaneous or collaborative monitoring solutions from multiple different agents or sources. In this paper we introduce our interface QMPI, which addresses these limitations of PMPI and aims to be a successor to it. This paper presents the use cases and requirements that necessitate the development of QMPI, as well as a design offer for a QMPI prototype followed by its implementation and evaluation.
TL;DR: The huge data volumes, big data, and the emergence of new parallel architectures lead to revisiting classic computer science topics and the motivation of the proposed work for revisiting the parallel architectures is motivated by the desire to revisit these topics.
Abstract: The huge data volumes, big data, and the emergence of new parallel architectures lead to revisiting classic computer science topics. The motivation of the proposed work for revisiting the parallel k-way in-place merging is primarily related to the unsuitability of the current state-of-the-art parallel algorithms for multicore CPUs with shared memory. These architectures can be profitably employed to solve this problem in parallel. Recently, Intel introduced the parallel Standard Template Library (STL) implementation for multicore CPUs, but it has no in-place merge function with the in-place property. We propose Partition-Shuffle-merge (PS-merge) to address this problem. PS-merge includes combining sequence partitioning with the in-place perfect shuffle effect to address the k-way merge task. At first, each sequence is divided into t equal-sized partitions or ranges. Thus, each partition is spread over at most k sequences. Then, perfect shuffle is utilized as a replacement for the classic block rearrangement. Finally, range subpartitions are merged using a sequential in-place merging algorithm. To evaluate the proposed algorithm, as PS-merge produces the standard merging format, we compare this algorithm against the state-of-the-art methods, bitonic merge, a parallel binary merge tree, and lazy-merge. PS-merge shows a significant improvement in overall execution time.
TL;DR: A novel algorithm – Triangulation of Independent Partitions in Parallel (TIPP) which divides the domain into many independent partitions that can be triangulated in parallel is introduced, and this new design improves both performance and scalability.
Abstract: Because of the importance of Delaunay Triangulation in science and engineering, researchers have devoted extensive attention to parallelizing this fundamental algorithm. However, generating unstructured meshes for extremely large point sets remains a barrier for scientists working with large scale or high resolution datasets. In our previous paper, we introduced a novel algorithm – Triangulation of Independent Partitions in Parallel (TIPP) which divides the domain into many independent partitions that can be triangulated in parallel. However, using only a single master process introduced a performance bottleneck and inhibited scalability. In this paper, we refine our description of the original TIPP algorithm, and also extend TIPP to employ multiple master processes, distributing computational load across several machines. This new design improves both performance and scalability, and can produce 20 billion triangles using only 10 commodity nodes in under 30 minutes.
TL;DR: A sparse matrix vector multiplication (SpMV) kernel that uses a novel sparse matrix storage format and delivers superior performance for unstructured matrices on Intel x86 processors is presented.
Abstract: We present a sparse matrix vector multiplication (SpMV) kernel that uses a novel sparse matrix storage format and delivers superior performance for unstructured matrices on Intel x86 processors. Our kernel exploits the properties of our storage format to enhance load balancing, SIMD efficiency, and data locality. We evaluate the performance of our kernel on a dual 24-core Skylake Xeon Platinum 8160 using 82 HPC and 36 scale-free unstructured matrices from 42 application areas. For HPC matrices, our kernel achieves a speed improvement of up to 19.5x over MKL Inspector–executor SpMV kernel (1.6x on average). For scale-free matrices, the speed improvement is up to 2.6x (1.3x on average).