TL;DR: This paper evaluates the technique on 9 benchmarks using two interpreters and two JIT-based platforms and shows that automatic vectorization is extremely effective for the interpreters on most benchmarks, and moderately effective on some benchmarks in the JIT context.
Abstract: Dynamic array-based languages such as MATLAB provide a wide range of built-in operations which can be efficiently applied to all elements of an array. Historically, MATLAB and Octave programmers have been advised to manually transform loops to equivalent “vectorized” computations in order to maximize performance. In this paper we present the techniques and tools to perform automatic vectorization, including handling for loops with calls to user-defined functions. We evaluate the technique on 9 benchmarks using two interpreters and two JIT-based platforms and show that automatic vectorization is extremely effective for the interpreters on most benchmarks, and moderately effective on some benchmarks in the JIT context.
TL;DR: LightHouse, a GPU code-generator for a graph language named Green-Marl for which a multicore CPU backend already exists, is proposed, which allows a user to seamlessly generate both the multicore as well as the GPU backends from the same specification of a graph algorithm.
Abstract: We propose LightHouse, a GPU code-generator for a graph language named Green-Marl for which a multicore CPU backend already exists. This allows a user to seamlessly generate both the multicore as well as the GPU backends from the same specification of a graph algorithm. This restriction of not modifying the language poses several challenges as we work with an existing abstract syntax tree of the language, which is not tailored to GPUs. LightHouse overcomes these challenges with various optimizations such as reducing the number of atomics and collapsing loops. We illustrate its effectiveness by generating efficient CUDA codes for four graph analytic algorithms, and comparing performance against their multicore OpenMP versions generated by Green-Marl. In particular, our generated CUDA code performs comparable to 4 to 64-threaded OpenMP versions for different algorithms.
TL;DR: A pragma-driven compiler is used to recognize the special structures and optimization needs of stencil computations and thereby to automatically generate low-level code that efficiently utilize the data placement and management support of a C++ runtime on top of NUMA API.
Abstract: This paper presents a system for automatically supporting the optimization of stencil kernels on emerging Non-Uniform Memory Access (NUMA) many-core architectures, through a combined compiler + runtime approach. In particular, we use a pragma-driven compiler to recognize the special structures and optimization needs of stencil computations and thereby to automatically generate low-level code that efficiently utilize the data placement and management support of a C++ runtime on top of NUMA API, a programming interface to the NUMA policy supported by the Linux kernel. Our results show that through automated specialization of code generation, this approach provides a combined benefit of performance, portability, and productivity for developers.
TL;DR: This work proposes to use fine grain synchronization, with event-driven multithreading, to efficiently take advantage of shared-memory systems that feature tens of processing elements, and studies a naive 5-point 2D stencil kernel.
Abstract: Current shared-memory systems can feature tens of processing elements. The old assumption that coarse-grain synchronization is enough in a shared-memory system thus becomes invalid. To efficiently take advantage of such systems, we propose to use fine grain synchronization, with event-driven multithreading. To illustrate our point, we study a naive 5-point 2D stencil kernel. We provide several synchronization variants using our fine-grain multithreading environment, and compare it to a naive coarse-grain implementation using OpenMP. We conducted experiments on three different many-core compute nodes, with speedups ranging from 1.2\(\times \) to 1.75\(\times \).
TL;DR: The implementation of automatic deep copy in a Fortran compiler targeting a heterogeneous system with GPUs is described, the runtime overheads of the deep copies are measured, techniques to reduce this overhead are proposed, and the efficacy of these techniques are evaluated.
Abstract: In systems with multiple memories, software may need to explicitly copy data from one memory location to another. This copying is required to enable access or to unlock performance, and it is especially important in heterogeneous systems. When the data includes pointers to other data, the copying process has to recursively follow the pointers to perform a deep copy of the entire data structure. It is tedious and error-prone to require users to manually program the deep copy code for each pointer-based data structure used. Instead, a compiler and runtime system can automatically handle deep copies if it can identify pointers in the data, and can determine the size and type of data pointed to by each pointer. This is possible if the language provides reflection capabilities, or uses smart pointers that encapsulate this information, e.g. Fortran pointers that intrinsically include dope vectors to describe the data pointed to. In this paper, we describe our implementation of automatic deep copy in a Fortran compiler targeting a heterogeneous system with GPUs. We measure the runtime overheads of the deep copies, propose techniques to reduce this overhead, and evaluate the efficacy of these techniques.
TL;DR: A case is made for harnessing available parallelism in multicore systems to improve performance as well as the quality of function approximation, and a number of tasks that the function approximation schemes can offload to available parallel cores.
Abstract: Approximating functions in applications that can tolerate some inaccuracy in their results can deliver substantial performance gains. This paper makes a case for harnessing available parallelism in multicore systems to improve performance as well as the quality of function approximation. To that end, we discuss a number of tasks that the function approximation schemes can offload to available parallel cores. We also discuss how leveraging parallelism can help provide guarantees about results and dynamically improve approximations. Finally, we present experimental results of a function approximation scheme.
TL;DR: A new parallel algorithm for approximate breadth-first ordering of an unweighted graph is introduced by using bounded asynchrony to parametrically control both the performance and error of the algorithm, leading to higher performance at the expense of a loss of precision.
Abstract: We introduce a new parallel algorithm for approximate breadth-first ordering of an unweighted graph by using bounded asynchrony to parametrically control both the performance and error of the algorithm. This work is based on the \(k\)-level asynchronous (KLA) paradigm that trades expensive global synchronizations in the level-synchronous model for local synchronizations in the asynchronous model, which may result in redundant work. Instead of correcting errors introduced by asynchrony and redoing work as in KLA, in this work we control the amount of work that is redone and thus the amount of error allowed, leading to higher performance at the expense of a loss of precision. Results of an implementation of this algorithm are presented on up to 32,768 cores, showing 2.27x improvement over the exact KLA algorithm and 3.8x improvement over the level-synchronous version with minimal error on several graph inputs.
TL;DR: This research introduces a concurrency-aware solution, the memory Sluice Gate Theory, for solving the outstanding memory wall problem and shows, with sufficient data access concurrency and hardware support, thememory wall impact can be reduced to the minimum.
Abstract: In addition to locality, data access concurrency has emerged as a pillar factor of memory performance. In this research, we introduce a concurrency-aware solution, the memory Sluice Gate Theory, for solving the outstanding memory wall problem. Sluice gates are designed to control data transfer at each memory layer dynamically, and a global control algorithm, named layered performance matching, is developed to match the data transfer request/supply at each memory layer thus matching the overall performance between the CPU and memory system. Formal theoretical analyses are given to show, with sufficient data access concurrency and hardware support, the memory wall impact can be reduced to the minimum. Experimental testing is conducted which confirm the theoretical findings.
TL;DR: It is shown that compilers, both front-ends and back-ends, may generate unstructured CFGs from structured program sources, which necessitates mechanisms to obtain structuredCFGs from unstructuring ones.
Abstract: Structured programs are believed to be easier to understand, and compiler friendly [5, 10, 45]. However, compilers do not process the source programs directly; they instead work on control flow graphs (CFGs) of the programs. Unfortunately, there is little formalization of structured CFGs. This paper shows how the lack of formalization has led to varying interpretations of structured CFGs. The paper next presents new formalization of structured CFGs which eliminates the ambiguity. Structured CFGs gain importance as they ease compiler optimizations, decompilation, and help reduce the performance degradation caused by thread divergence on SIMD units. The paper elaborates on these benefits. It also shows that compilers, both front-ends and back-ends, may generate unstructured CFGs from structured program sources, which necessitates mechanisms to obtain structured CFGs from unstructured ones.
TL;DR: An automatic local memory management method by compilers through multi-dimensional data decomposition techniques to fit working sets onto limited size local memory and suitable block management structures to create application specific fixed size data transfer blocks is proposed.
Abstract: Embedded multicore processors for hard real-time applications like automobile engine control require the usage of local memory on each processor core to precisely meet the real-time deadline constraints, since cache memory cannot satisfy the deadline requirements due to cache misses. To utilize local memory, programmers or compilers need to explicitly manage data movement and data replacement for local memory considering the limited size. However, such management is extremely difficult and time consuming for programmers. This paper proposes an automatic local memory management method by compilers through (i) multi-dimensional data decomposition techniques to fit working sets onto limited size local memory (ii) suitable block management structures, called Adjustable Blocks, to create application specific fixed size data transfer blocks (iii) multi-dimensional templates to preserve the original multi-dimensional representations of the decomposed multi-dimensional data that are mapped onto one-dimensional Adjustable Blocks (iv) block replacement policies from liveness analysis of the decomposed data, and (v) code size reduction schemes to generate shorter codes. The proposed local memory management method is implemented on the OSCAR multi-grain and multi-platform compiler and evaluated on the Renesas RP2 8 core embedded homogeneous multicore processor equipped with local and shared memory. Evaluations on 5 programs including multimedia and scientific applications show promising results. For instance, speedups on 8 cores compared to single core execution using off-chip shared memory on an AAC encoder program, a MPEG2 encoder program, Tomcatv, and Swim are improved from 7.14 to 20.12, 1.97 to 7.59, 5.73 to 7.38, and 7.40 to 11.30, respectively, when using local memory with the proposed method. These evaluations indicate the usefulness and the validity of the proposed local memory management method on real embedded multicore processors.
TL;DR: A dynamic tiling scheme is presented which is energy-conscious: it uses a machine model which does not require symmetric memory anywhere in the hierarchy, which makes it very robust to chip variance and allows all possible resources to be utilized, which is necessary for future near-threshold voltage designs.
Abstract: As multi and many core chips steadily increase their core count, we observe a phenomenon we call memory hierarchy capacity per capita inversion. To overcome this inversion while remaining energy-efficient, we present a dynamic tiling scheme which we apply to solve the classic Matrix Multiply algorithm. The tiling scheme follows a Hilbert-Inspired Curve strategy to minimize data movement energy, while still allowing for slack and variance within the computation and memory usage of a chip. Our algorithm is energy-conscious: it uses a machine model which does not require symmetric memory (in size or addressing) anywhere in the hierarchy. It only concerns itself with the energy consumption of all memories. This property makes it very robust to chip variance and allows all possible resources to be utilized, which is necessary for future near-threshold voltage designs. Initial results, obtained on a future many-core simulator targeting the Traleika Glacier architecture, give initial estimates of memory reads and writes to all parts of the chip as well as relative energy consumption.
TL;DR: This work focuses on tuning irregular, task-parallel applications and/or the GPU architecture in order to achieve similar benefits for irregular applications running on GPUs.
Abstract: GPGPUs deliver high speedup for regular applications while remaining energy efficient. In recent years, there has been much focus on tuning irregular, task-parallel applications and/or the GPU architecture in order to achieve similar benefits for irregular applications running on GPUs. While most of the previous works have focused on minimizing the effect of control and memory divergence, which are prominent in irregular applications and which degrade the performance, there has been less attention paid to decreasing cache pressure and hence improving performance of applications given the small cache sizes on GPUs.
TL;DR: This paper introduces an abstraction for approximating the message passing behaviors of MPI programs that is more precise than prior work and is applicable to a wide variety of applications.
Abstract: Static analysis discovers provable true properties about behaviors of programs that are useful in optimization, debugging and verification. Sequential static analysis techniques fail to interpret the message passing semantics of the MPI and lack the ability to optimize or check the message passing behaviors of MPI programs. In this paper, we introduce an abstraction for approximating the message passing behaviors of MPI programs that is more precise than prior work and is applicable to a wide variety of applications. Our approach builds on the compositional paradigm where we transparently extend MPI support to sequential analyses through composition with our MPI analyses. This is the first framework where the data flow analysis is carried out in parallel on a cluster, with the message-carried data flow facts for refining inter-process data flow analysis states. We detail ParFuse – a framework that supports such parallel and compositional analysis of MPI programs, report its scalability and detail the prospects of extending our work for more powerful analyses.
TL;DR: A collaborator’s QA algorithm for Dynamic Contrast-Enhanced MRI data written in MATLAB is presented as a case study for exploring parallel programming inMATLAB and Julia and a comparison of Julia’'s performance in a parallel, application-level program is novel.
Abstract: The last several decades have been marked by dramatic increases in the use of diagnostic medical imaging and improvements in the modalities themselves. As such, more data is being generated at an ever increasing rate. However, in the case of Magnetic Resonance Imaging (MRI) analysis and reports remain semi-quantitative, despite reported advantages of quantitative analysis (QA), due to prohibitive execution times. We present a collaborator’s QA algorithm for Dynamic Contrast-Enhanced (DCE) MRI data written in MATLAB as a case study for exploring parallel programming in MATLAB and Julia. Parallelization resulted in a 7.66x speedup in MATLAB and a 72x speedup in Julia. To the best of our knowledge, this comparison of Julia’s performance in a parallel, application-level program is novel. On the basis of these results and our experiences while programming in each language, our collaborator now prototypes in MATLAB and then ports to Julia when performance is critical.
TL;DR: This work presents a new set of loop transformations to expose and maximize data locality in loop-nests with non-uniform reuse patterns and exposes loop level parallelism, by grouping together independent iterations, thus improving performance of both serial and parallel execution.
Abstract: This work presents a new set of loop transformations to expose and maximize data locality in loop-nests with non-uniform reuse patterns. The proposed set of transformations use the norms of the Polyhedral Model to represent loop-nests and then leverages such a representation to partition the iteration space into polygonally shaped partitions with maximum locality. However, the partitioning algorithm tends to produce partitions with complex geometry (shape) and with progressively smaller number of iterations, which, in practice, introduces much run-time overhead. This work also focuses on containing the number of partitions and properly manage their geometry at run-time, to contain unnecessary overhead. The proposed transformations also exposes loop level parallelism, by grouping together independent iterations, thus improving performance of both serial and parallel execution. In parallel execution a selective mapping of partitions to threads based on the type of reuse these partitions exhibit is proposed.
TL;DR: QUARC, a framework for the optimized compilation of domain-specific extensions to C++, focuses on stencil-like computations on arrays with an arbitrary number of dimensions and uses a template meta-programming front end to define a high-level array language.
Abstract: We present QUARC, a framework for the optimized compilation of domain-specific extensions to C++. Driven by needs for programmer productivity and portable performance for lattice QCD, the framework focuses on stencil-like computations on arrays with an arbitrary number of dimensions. QUARC uses a template meta-programming front end to define a high-level array language. Unlike approaches that generate scalarized loop nests in the front end, the instantiation of QUARC templates retains high-level abstraction suitable for optimization at the object (array) level. The back end compiler (CLANG/LLVM) is extended to implement array transformations such as transposition, reshaping, and partitioning for parallelism and for memory locality prior to scalarization. We present the design and implementation.
TL;DR: A novel approach to judiciously map parallel programs to hardware in dynamic contexts and goals is proposed by collecting a set of mapping policies to determine best number of threads that are optimal for specific contexts and binds threads to cores for increased affinity.
Abstract: In modern day computing, the performance of parallel programs is bound by the dynamic execution context that includes inherent program behavior, resource requirements, co-scheduled programs sharing the system resources, hardware failures and input data. Besides this dynamic context, the optimization goals are increasingly becoming multi-objective and dynamic such as minimizing execution time while maximizing energy efficiency. Efficiently mapping the parallel threads on to the hardware cores is crucial to achieve these goals. This paper proposes a novel approach to judiciously map parallel programs to hardware in dynamic contexts and goals. It uses a simple, yet novel technique by collecting a set of mapping policies to determine best number of threads that are optimal for specific contexts. It then binds threads to cores for increased affinity. Besides, this approach also determines the optimal DVFS levels for these cores to achieve higher energy efficiency. On extensive evaluation with state-of-art techniques, this scheme outperforms them in the range 1.08x up to 1.21x and 1.39x over OpenMP default.
TL;DR: This work analyzes the performance behavior based on tuning various elements in each collection for the LULESH application using CnC, and proposes guidelines to emulate the techniques used to obtain high performance while improving programmability.
Abstract: Programmers are faced with many challenges for obtaining performance on machines with increasingly capable, yet increasingly complex hardware. A trend towards task-parallel and asynchronous many-task programming models aim to alleviate the burden of parallel programming on a vast array of current and future platforms. One such model, Concurrent Collections (CnC), provides a programming paradigm that emphasizes the separation of the concerns–domain experts concentrate on their algorithms and correctness, whereas performance experts handle mapping and tuning to a target platform. Deep understanding of parallel constructs and behavior is not necessary to write parallel applications that will run on various multi-threaded and multi-core platforms when using the CnC model. However, performance can vary greatly depending on the granularity of tasks and data declared by the programmer. These program-specific decisions are not part of the CnC tuning capabilities and must be tuned in the program. We analyze the performance behavior based on tuning various elements in each collection for the LULESH application using CnC. We demonstrate the effects of different techniques to modify task and data granularity in CnC collections. Our fully tiled CnC implementation outperforms the OpenMP counterpart by 3\(\times \) for 48 processors. Finally, we propose guidelines to emulate the techniques used to obtain high performance while improving programmability.
TL;DR: In this paper, the authors present software strategies that enable explicit management of resilience to errors in the system, in addition to locality of reference in the complex memory hierarchies of future HPC systems.
Abstract: The path to exascale computational capabilities in high-performance computing (HPC) systems is challenged by the inadequacy of present software technologies to adapt to the rapid evolution of architectures of supercomputing systems. The constraints of power have driven system designs to include increasingly heterogeneous architectures and diverse memory technologies and interfaces. Future systems are also expected to experience an increased rate of errors, such that the applications will no longer be able to assume correct behavior of the underlying machine. To enable the scientific community to succeed in scaling their applications, and to harness the capabilities of exascale systems, we need software strategies that enable explicit management of resilience to errors in the system, in addition to locality of reference in the complex memory hierarchies of future HPC systems.
TL;DR: This work presents the contention avoiding concurrent priority queue (CA-PQ), a data structure that functions as a linearizable concurrent priority with traditional semantics under low contention, but activates contention avoiding techniques that give it more relaxed semantics when high contention is detected.
Abstract: Efficient and scalable concurrent priority queues are crucial for the performance of many multicore applications, e.g. for task scheduling and the parallelization of various algorithms. Linearizable concurrent priority queues with traditional semantics suffer from an inherent sequential bottleneck in the head of the queue. This bottleneck is the motivation for some recently proposed priority queues with more relaxed semantics. We present the contention avoiding concurrent priority queue (CA-PQ), a data structure that functions as a linearizable concurrent priority with traditional semantics under low contention, but activates contention avoiding techniques that give it more relaxed semantics when high contention is detected. CA-PQ avoids contention in the head of the queue by removing items in bulk from the global data structure, which also allows it to often serve DelMin operations without accessing memory that is modified by several threads. We show that CA-PQ scales well. Its cache friendly design achieves performance that is twice as fast compared to that of state-of-the-art concurrent priority queues on several instances of a parallel shortest path benchmark.
TL;DR: This paper applies loop and data transformations for sparse matrix computations in a polyhedral compilation framework to a real application code, the LOBPCG solver, which performs a Sparse Matrix Multi-Vector computation at each iteration, and presents the transformation derivation for this application code and resulting performance.
Abstract: Sparse matrix computations are widely used in iterative solvers; they are notoriously memory bound and typically yield poor performance on modern architectures. A common optimization strategy for such computations is to rely on specialized representations that exploit the nonzero structure of the sparse matrix in an application-specific way. Recent research has developed loop and data transformations for sparse matrix computations in a polyhedral compilation framework. In this paper, we apply these and additional loop transformations to a real application code, the LOBPCG solver, which performs a Sparse Matrix Multi-Vector (SpMM) computation at each iteration. The paper presents the transformation derivation for this application code and resulting performance. The compiler-generated code attains a speedup of up to 8.26\(\times \) on 8 threads on an Intel Haswell and 30 GFlops; it outperforms a state-of-the-art manually-written Fortran implementation by 3%.
TL;DR: The polyhedral model is extended to enable analysis of explicitly parallel SPMD programs and a new approach for static detection of data races in SPMd programs using the extended polyhedralmodel is provided.
Abstract: Despite its age, SPMD (Single Program Multiple Data) parallelism continues to be one of the most popular parallel execution models in use today, as exemplified by OpenMP for multicore systems and CUDA and OpenCL for accelerator systems. The basic idea behind the SPMD model, which makes it different from task-parallel models, is that all logical processors (worker threads) execute the same program with sequential code executed redundantly and parallel code executed cooperatively. In this paper, we extend the polyhedral model to enable analysis of explicitly parallel SPMD programs and provide a new approach for static detection of data races in SPMD programs using the extended polyhedral model. We evaluate our approach using 34 OpenMP programs from the OmpSCR and PolyBench-ACC (PolyBench-ACC derives from the PolyBench benchmark suite and provides OpenMP, OpenACC, CUDA, OpenCL and HMPP implementations.) benchmark suites.