TL;DR: NVIDIA’s Compute Unified Device Architecture (CUDA) enabled GPUs become accessible to mainstream programming and provide a viable interface for creating dynamic tasks and handling load balancing issues.
Abstract: NVIDIA’s Compute Unified Device Architecture (CUDA) enabled GPUs become accessible to mainstream programming. Abundance of simple computational cores and high memory bandwidth make GPUs ideal candidates for data parallel applications. However, its potential for executing applications that combine task and data parallelism has not been explored in detail. CUDA does not provide a viable interface for creating dynamic tasks and handling load balancing issues. Any support for such has to be orchestrated entirely by the CUDA programmer today.
TL;DR: An automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck is presented.
Abstract: We present an automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips /Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled effectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naive parallelization using a modern gpu with Par4All , hmpp , and pgi , and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.
TL;DR: This paper describes an implementation of a non-blocking concurrent hash trie based on single-word compare-and-swap instructions in a shared-memory system and shows that the implementation is linearizable and lock-free.
Abstract: This paper describes an implementation of a non-blocking concurrent hash trie based on single-word compare-and-swap instructions in a shared-memory system. Insert, lookup and remove operations modifying different parts of the hash trie can be run completely independently. Remove operations ensure that the unneeded memory is freed and that the trie is kept compact. A pseudocode for these operations is presented and a proof of correctness is given – we show that the implementation is linearizable and lock-free. Finally, benchmarks are presented that compare concurrent hash trie operations against the corresponding operations on other concurrent data structures.
TL;DR: The OpenCL framework is implemented and it is shown that the original OpenCL semantics naturally fits to the GPU cluster environment, and the framework achieves both high performance and ease of programming.
Abstract: In this paper, we propose an OpenCL framework for GPU clusters. The target cluster architecture consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. Each compute node consists of multiple GPUs. Each GPU becomes an OpenCL compute device. The host node executes the host program in an OpenCL application. Our OpenCL framework provides an illusion of a single system for the user. It allows the application to utilize GPUs in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. We show that the original OpenCL semantics naturally fits to the GPU cluster environment, and the framework achieves both high performance and ease of programming. We implement the OpenCL framework and evaluate its performance on a GPU cluster that consists of one host and eight compute nodes using six OpenCL benchmark applications.
TL;DR: A dynamic framework is introduced to help application developers detect instances of false sharing as well as identify the data objects in an OpenMP code that cause the problem.
Abstract: Writing a parallel shared memory application that achieves good performance and scales well as the number of threads increases can be challenging One of the reasons is that as threads proliferate, the contention among shared resources increases and this may cause performance degradation In particular, multi-threaded applications can suffer from the false sharing problem, which can degrade the performance of an application significantly The work in this paper focuses on detecting performance bottlenecks caused by false sharing in OpenMP applications We introduce a dynamic framework to help application developers detect instances of false sharing as well as identify the data objects in an OpenMP code that cause the problem The framework that we have developed leverages features of the OpenMP collector API to interact with the OpenMP compiler’s runtime library and utilizes the information from hardware counters We demonstrate the usefulness of this framework on actual applications that exhibit poor scaling because of false sharing To show the benefit of our technique, we manually modify the identified problem code by adjusting the alignment of the data that are causing false sharing; we then compare the performance with the original version
TL;DR: The increased number of execution units in many-core processors is driving numerous paradigm changes in parallel systems, and previous techniques are being rendered obsolete unless they can also provide results efficiently.
Abstract: The increased number of execution units in many-core processors is driving numerous paradigm changes in parallel systems Previous techniques that focused solely upon obtaining correct results are being rendered obsolete unless they can also provide results efficiently
TL;DR: Two algorithms are presented that improve the state-of-the-art performance for TMs that support the concurrent execution of locks and transactions and demonstrate that an algorithm’s concurrent throughput potential does not always lead to realized performance gains.
Abstract: Transactional memory (TM) is a promising alternative to mutual exclusion. In spite of this, it may be unrealistic for TM programs to be devoid of locks due to their abundant use in legacy software systems. Consequently, for TMs to be practical they may need to manage the interaction of transactions and locks when they access the same shared-memory. This paper presents two algorithms, one coarse-grained and one fine-grained, that improve the state-of-the-art performance for TMs that support the concurrent execution of locks and transactions. We also discuss the programming language constructs that are necessary to implement such algorithms and present analyses that compare and contrast our approach with prior work. Our analyses demonstrate that, (i) in general, our proposed coarse- and fine-grained algorithms improve program concurrency but (ii) an algorithm’s concurrent throughput potential does not always lead to realized performance gains.
TL;DR: This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors and evaluates the translator on representative benchmarks of this class and compares their performance against hand-written MPI variants.
Abstract: OpenMP is an explicit parallel programming model that offers reasonable productivity Its memory model assumes a shared address space, and hence the direct translation - as done by common OpenMP compilers - requires an underlying shared-memory architecture Many lab machines include 10s of processors, built from commodity components and thus include distributed address spaces Despite many efforts to provide higher productivity for these platforms, the most common programming model uses message passing, which is substantially more tedious to program than shared-address-space models This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors We build on previous work that provided a proof of concept of such translation The present paper describes compiler algorithms and runtime techniques that provide the automatic translation of a first class of OpenMP applications: those that exhibit regular write array subscripts and repetitive communication We evaluate the translator on representative benchmarks of this class and compare their performance against hand-written MPI variants In all but one case, our translated versions perform close to the hand-written variants
TL;DR: Parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions, and in execution of multiple light computational load applications, power consumption increases only 12% for one application.
Abstract: This paper evaluates an automatic power reduction scheme of OSCAR automatic parallelizing compiler having power reduction control capability when multiple media applications parallelized by the OSCAR compiler are executed simultaneously on RP2, a 8-core multicore processor developed by Renesas Electronics, Hitachi, and Waseda University. OSCAR compiler enables the hierarchical multigrain parallel processing and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating and power gating for each processor core using the OSCAR multi-platform API. The RP2 has eight SH4A processor cores, each of which has power control mechanisms such as DVFS, clock gating and power gating. First, multiple applications with relatively light computational load are executed simultaneously on the RP2. The average power consumption of power controlled eight AAC encoder programs, each of which was executed on one processor, was reduced by 47%, (to 1.01W), against one AAC encoder execution on one processor (from 1.89W) without power control. Second, when multiple intermediate computational load applications are executed, the power consumptions of an AAC encoder executed on four processors with the power reduction control was reduced by 57% (to 0.84W) against an AAC encoder execution on one processor (from 1.95W). Power consumptions of one MPEG2 decoder on four processors with power reduction control was reduced by 49% (to 1.01W) against one MPEG2 decoder execution on one processor (from 1.99W). Finally, when a combination of a high computational load application program and an intermediate computational load application program are executed simultaneously, the consumed power reduced by 21% by using twice number of cores for each application. This paper confirmed parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions. In execution of multiple light computational load applications, power consumption increases only 12% for one application. Parallel processing being applied to intermediate computational load applications, power consumption of executing one application on one processor core (1.49W) is almost same power consumption of two applications on eight processor cores (1.46W).
TL;DR: The Inspector/Executor is well-known for parallelizing loops with irregular access patterns that cannot be analyzed statically, but they are often specifically designed for array codes and are in general not applicable in object oriented just-in-time compilation.
Abstract: The Inspector/Executor is well-known for parallelizing loops with irregular access patterns that cannot be analyzed statically. The downsides of existing inspectors are that it is hard to amortize their high run-time overheads by actually executing the loop in parallel, that they can only be applied to loops with dependencies that do not change during their execution and that they are often specifically designed for array codes and are in general not applicable in object oriented just-in-time compilation.
TL;DR: CellCilk is presented, a prototype implementation of Cilk for heterogeneous multicores with a host/accelerator design, using the Cell BE in particular, and introduces a new keyword (spu_spawn) for task creation on the accelerator cores.
Abstract: The potential of heterogeneous multicores, like the Cell BE, can only be exploited if the host and the accelerator cores are used in parallel and if the specific features of the cores are considered Parallel programming, especially when applied to irregular task-parallel problems, is challenging itself However, heterogeneous multicores add to that complexity due to their memory hierarchy and specialized accelerators As a solution for these issues we present CellCilk, a prototype implementation of Cilk for heterogeneous multicores with a host/accelerator design, using the Cell BE in particular CellCilk introduces a new keyword (spu_spawn) for task creation on the accelerator cores Task scheduling and load balancing are done by a novel dynamic cross-hierarchy work-stealing regime Furthermore, the CellCilk runtime employs a garbage collection mechanism for distributed data structures that are created during scheduling On benchmarks we achieve a good speedup and reasonable runtimes, even when compared to manually parallelized codes
TL;DR: It is shown that careful dependence analysis may allow a fine-grained treatment to synchronizations and reveal redundant computation at the instruction-instance level and compared to existing translations, the new approach can yield speedup of a factor of integers.
Abstract: GPU-to-CPU translation may extend Graphics Processing Units (GPU) programs executions to multi-/many-core CPUs, and hence enable cross-device task migration and promote whole-system synergy. This paper describes some of our findings in treatment to GPU synchronizations during the translation process. We show that careful dependence analysis may allow a fine-grained treatment to synchronizations and reveal redundant computation at the instruction-instance level. Based on thread-level dependence graphs, we present a method to enable such fine-grained treatment automatically. Experiments demonstrate that compared to existing translations, the new approach can yield speedup of a factor of integers.
TL;DR: This paper compares the performance of hardware and software synchronization schemes to barrier synchronization, and shows that either hardware or software based producer/consumer synchronization can provide applications with superior performance.
Abstract: In the early 1980s, shared memory mini-super-computers had buses and memory whose speeds were relatively fast compared to processor speeds. This led to the widespread use of various producer/consumer (post/wait) synchronization schemes for enforcing data dependences within parallel doacross loops. The rise of the “killer micro”, instruction sets optimized for serial programs, and rapidly increasing processor clock rates driven by Moore’s law, led to special purpose synchronization instructions being replaced by software barriers combined with loop fission (to allow the barriers to enforce dependences.) One cost of this approach is poorer cache behavior because variables on which a dependence exists are now accessed in separate loops. With the advent of the multicore era, producer/consumer synchronization again appears plausible. In this paper we compare the performance of hardware and software synchronization schemes to barrier synchronization, and show that either hardware or software based producer/consumer synchronization can provide applications with superior performance.
TL;DR: Sluice is described, a library based method for the specification of streaming constructs in JavaScript applications, and it is shown that by taking advantage of the streaming model, this methodology can result in significant performance improvements for compute intensive workloads.
Abstract: JavaScript has long been the dominant language for client-side web development The size and complexity of client-side JavaScript programs continues to grow and now includes applications such as games, office suites, and image editing tools traditionally developed using high performance languages More recently, developers have been expanding the use of JavaScript with standards and implementations for server-side JavaScript These trends are driving a need for high performance JavaScript implementations While the performance of JavaScript implementations is improving, support for creating parallel applications that can take advantage of now ubiquitous parallel hardware remains primitive Pipeline, data, and task parallelism are ways of breaking a program into multiple units of work that can be executed concurrently by parallel hardware These concepts are made explicit in the stream processing model of parallelization Using the streaming model, an algorithm is divided into a set of small independent tasks called kernels that are linked together using first-in first-out data channels The advantage of this approach is that it allows a compiler to effectively map computations to a variety of hardware while freeing programmers from the burden of synchronizing tasks or orchestrating communication between them In this paper we describe Sluice, a library based method for the specification of streaming constructs in JavaScript applications While the use of such a library makes concurrency explicit, it does not easily result in parallel execution We show, however, that by taking advantage of the streaming model, we can dynamically re-compile Sluice programs to target a high performance, multi-threaded stream processing runtime layer The stream processing layer executes computations in a different process and the offloaded tasks communicate with the original program using fast shared memory buffers We show that this methodology can result in significant performance improvements for compute intensive workloads
TL;DR: This paper concentrates on optimizing the partition manager by reducing the number of long latency transactions, which is necessary to reduce the overhead of the OPELL framework.
Abstract: Limits on applications and hardware technologies have put a stop to the frequency race during the 2000s. Designs now can be divided into homogeneous and heterogeneous ones. Homogeneous types are the easiest to use since most toolchains and system software do not need too much of a rewrite. On the other end of the spectrum, there are the type two heterogeneous designs. These designs offer tremendous computational raw power, but at the cost of hardware features that might be necessary or even essential for certain types of system software and programming languages. An example of this architectural design is the Cell processor which exhibits both a heavy core and a group of simple cores designed as a computational engine. Even though the Cell processor is very well known for its accomplishments, it is also well known for its low programmability. Among many efforts to increase its programmability, there is the Open OPELL project. This framework tries to port the OpenMP programming model to the Cell architecture. The OPELL framework is composed of four components: a single source toolchain, a very light SPU kernel, a software cache and a partition / code overlay manager. To reduce the overhead, each of these components can be further optimized. This paper concentrates on optimizing the partition manager by reducing the number of long latency transactions. The contributions of this work are as follows.
TL;DR: This article considers the parallel programming model of communicating parallel tasks (CM-tasks), which allows both task-internal communication as well as communication between concurrently executed tasks at arbitrary points of their execution, and proposes a corresponding scheduling algorithm and describes how the scheduling is supported by a transformation tool.
Abstract: Task-based approaches are popular for the development of parallel programs for several reasons. They provide a decoupling of the parallel specification from the scheduling and mapping to the execution resources of a specific hardware platform, thus allowing a flexible and individual mapping. For platforms with a distributed address space, the use of parallel tasks, instead of sequential tasks, adds the additional advantage of a structuring of the program into communication domains that can help to reduce the overall communication overhead. In this article, we consider the parallel programming model of communicating parallel tasks (CM-tasks), which allows both task-internal communication as well as communication between concurrently executed tasks at arbitrary points of their execution. We propose a corresponding scheduling algorithm and describe how the scheduling is supported by a transformation tool. An experimental evaluation of several application programs shows that using the CM-task model may lead to significant performance improvements compared to other parallel execution schemes.
TL;DR: A GPU usually delivers the highest performance when it is fully utilized, that is, programs running on it are taking full advantage of all the GPU resources, so an ideal program will keep both the ALU units and the memory units busy for the duration of the runtime.
Abstract: A GPU usually delivers the highest performance when it is fully utilized, that is, programs running on it are taking full advantage of all the GPU resources. Two main types of resources on the GPU are the compute engine, i.e., the ALU units, and the data mover, i.e., the memory units. This means that an ideal program will keep both the ALU units and the memory units busy for the duration of the runtime. The vast majority of GPU applications, however, either utilize ALU units but leave memory units idle, which is called ALU bound, or use the memory units but idle ALUs, which is called memory bound, and rarely attempt to take full advantage of both at the same time.
TL;DR: An automated solution based on static construction of access summaries is presented, in which the reference non-linearity problem can be solved for a large number of reference patterns by extracting arbitrarily-shaped predicates that can validate the reference monotonicity property and thus prove loop independence.
Abstract: Array references indexed by non-linear expressions or subscript arrays represent a major obstacle to compiler analysis and to automatic parallelization. Most previous proposed solutions either enhance the static analysis repertoire to recognize more patterns, to infer array-value properties, and to refine the mathematical support, or apply expensive run time analysis of memory reference traces to disambiguate these accesses. This paper presents an automated solution based on static construction of access summaries, in which the reference non-linearity problem can be solved for a large number of reference patterns by extracting arbitrarily-shaped predicates that can (in)validate the reference monotonicity property and thus (dis)prove loop independence. Experiments on six benchmarks show that our general technique for dynamic validation of the monotonicity property can cover a large class of codes, incurs minimal run-time overhead and obtains good speedups.
TL;DR: An abstraction to alleviate the difficulty of programming with threads is proposed, which makes available a virtual time in which events in different program time-lines are sequentialized.
Abstract: We propose an abstraction to alleviate the difficulty of programming with threads This abstraction is not directly usable by application programmers Instead, application-visible behavior is defined through a semantical plugin, and invoked via a language or library that uses the plugin The main benefit is that parallel language runtimes become simpler to implement, because they use sequential algorithms for the parallel semantics This is possible because the abstraction makes available a virtual time in which events in different program time-lines are sequentialized The parallel semantics relate events in different time-lines via relating the sequentialized versions within the virtual time-line