Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Languages and Compilers for Parallel Computing
  4. 2011
  1. Home
  2. Conferences
  3. Languages and Compilers for Parallel Computing
  4. 2011
Showing papers presented at "Languages and Compilers for Parallel Computing in 2011"
Book Chapter•10.1007/978-3-642-36036-7_14•
Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

[...]

Sanjay Chatterjee1, Max Grossman1, Alina Sbirlea1, Vivek Sarkar1•
Rice University1
8 Sep 2011
TL;DR: NVIDIA’s Compute Unified Device Architecture (CUDA) enabled GPUs become accessible to mainstream programming and provide a viable interface for creating dynamic tasks and handling load balancing issues.
Abstract: NVIDIA’s Compute Unified Device Architecture (CUDA) enabled GPUs become accessible to mainstream programming. Abundance of simple computational cores and high memory bandwidth make GPUs ideal candidates for data parallel applications. However, its potential for executing applications that combine task and data parallelism has not been explored in detail. CUDA does not provide a viable interface for creating dynamic tasks and handling load balancing issues. Any support for such has to be orchestrated entirely by the CUDA programmer today.

42 citations

Book Chapter•10.1007/978-3-642-36036-7_16•
Static Compilation Analysis for Host-Accelerator Communication Optimization

[...]

Mehdi Amini1, Fabien Coelho1, François Irigoin1, Ronan Keryell•
Mines ParisTech1
8 Sep 2011
TL;DR: An automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck is presented.
Abstract: We present an automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips /Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled effectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naive parallelization using a modern gpu with Par4All , hmpp , and pgi , and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.

37 citations

Book Chapter•10.1007/978-3-642-36036-7_11•
Lock-Free Resizeable Concurrent Tries

[...]

Aleksandar Prokopec1, Phil Bagwell1, Martin Odersky1•
École Polytechnique Fédérale de Lausanne1
8 Sep 2011
TL;DR: This paper describes an implementation of a non-blocking concurrent hash trie based on single-word compare-and-swap instructions in a shared-memory system and shows that the implementation is linearizable and lock-free.
Abstract: This paper describes an implementation of a non-blocking concurrent hash trie based on single-word compare-and-swap instructions in a shared-memory system. Insert, lookup and remove operations modifying different parts of the hash trie can be run completely independently. Remove operations ensure that the unneeded memory is freed and that the trie is kept compact. A pseudocode for these operations is presented and a proof of correctness is given – we show that the implementation is linearizable and lock-free. Finally, benchmarks are presented that compare concurrent hash trie operations against the corresponding operations on other concurrent data structures.

26 citations

Book Chapter•10.1007/978-3-642-36036-7_6•
OpenCL as a Programming Model for GPU Clusters

[...]

Jung-Won Kim1, Sangmin Seo1, Jun Lee1, Jeongho Nah1, Gangwon Jo1, Jaejin Lee1 •
Seoul National University1
8 Sep 2011
TL;DR: The OpenCL framework is implemented and it is shown that the original OpenCL semantics naturally fits to the GPU cluster environment, and the framework achieves both high performance and ease of programming.
Abstract: In this paper, we propose an OpenCL framework for GPU clusters. The target cluster architecture consists of a single host node and multiple compute nodes. They are connected by an interconnection network, such as Gigabit and InfiniBand switches. Each compute node consists of multiple GPUs. Each GPU becomes an OpenCL compute device. The host node executes the host program in an OpenCL application. Our OpenCL framework provides an illusion of a single system for the user. It allows the application to utilize GPUs in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. We show that the original OpenCL semantics naturally fits to the GPU cluster environment, and the framework achieves both high performance and ease of programming. We implement the OpenCL framework and evaluate its performance on a GPU cluster that consists of one host and eight compute nodes using six OpenCL benchmark applications.

16 citations

Book Chapter•10.1007/978-3-642-36036-7_19•
Detecting False Sharing in OpenMP Applications Using the DARWIN Framework

[...]

Besar Wicaksono1, Munara Tolubaeva1, Barbara Chapman1•
University of Houston1
8 Sep 2011
TL;DR: A dynamic framework is introduced to help application developers detect instances of false sharing as well as identify the data objects in an OpenMP code that cause the problem.
Abstract: Writing a parallel shared memory application that achieves good performance and scales well as the number of threads increases can be challenging One of the reasons is that as threads proliferate, the contention among shared resources increases and this may cause performance degradation In particular, multi-threaded applications can suffer from the false sharing problem, which can degrade the performance of an application significantly The work in this paper focuses on detecting performance bottlenecks caused by false sharing in OpenMP applications We introduce a dynamic framework to help application developers detect instances of false sharing as well as identify the data objects in an OpenMP code that cause the problem The framework that we have developed leverages features of the OpenMP collector API to interact with the OpenMP compiler’s runtime library and utilizes the information from hardware counters We demonstrate the usefulness of this framework on actual applications that exhibit poor scaling because of false sharing To show the benefit of our technique, we manually modify the identified problem code by adjusting the alignment of the data that are causing false sharing; we then compare the performance with the original version

13 citations

Book Chapter•10.1007/978-3-642-36036-7_18•
Polytasks: A Compressed Task Representation for HPC Runtimes

[...]

Daniel Orozco1, Elkin Garcia1, Robert Pavel1, Rishi Khan, Guang R. Gao1 •
University of Delaware1
8 Sep 2011
TL;DR: The increased number of execution units in many-core processors is driving numerous paradigm changes in parallel systems, and previous techniques are being rendered obsolete unless they can also provide results efficiently.
Abstract: The increased number of execution units in many-core processors is driving numerous paradigm changes in parallel systems Previous techniques that focused solely upon obtaining correct results are being rendered obsolete unless they can also provide results efficiently

10 citations

Book Chapter•10.1007/978-3-642-36036-7_9•
Optimizing the Concurrent Execution of Locks and Transactions

[...]

Justin Gottschlich1, JaeWoong Chung1•
Intel1
8 Sep 2011
TL;DR: Two algorithms are presented that improve the state-of-the-art performance for TMs that support the concurrent execution of locks and transactions and demonstrate that an algorithm’s concurrent throughput potential does not always lead to realized performance gains.
Abstract: Transactional memory (TM) is a promising alternative to mutual exclusion. In spite of this, it may be unrealistic for TM programs to be devoid of locks due to their abundant use in legacy software systems. Consequently, for TMs to be practical they may need to manage the interaction of transactions and locks when they access the same shared-memory. This paper presents two algorithms, one coarse-grained and one fine-grained, that improve the state-of-the-art performance for TMs that support the concurrent execution of locks and transactions. We also discuss the programming language constructs that are necessary to implement such algorithms and present analyses that compare and contrast our approach with prior work. Our analyses demonstrate that, (i) in general, our proposed coarse- and fine-grained algorithms improve program concurrency but (ii) an algorithm’s concurrent throughput potential does not always lead to realized performance gains.

9 citations

Book Chapter•10.1007/978-3-642-36036-7_1•
Automatic Scaling of OpenMP Beyond Shared Memory

[...]

Okwan Kwon1, Fahed Jubair1, Seung-Jai Min2, Hansang Bae1, Rudolf Eigenmann1, Samuel P. Midkiff1 •
Purdue University1, Lawrence Berkeley National Laboratory2
8 Sep 2011
TL;DR: This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors and evaluates the translator on representative benchmarks of this class and compares their performance against hand-written MPI variants.
Abstract: OpenMP is an explicit parallel programming model that offers reasonable productivity Its memory model assumes a shared address space, and hence the direct translation - as done by common OpenMP compilers - requires an underlying shared-memory architecture Many lab machines include 10s of processors, built from commodity components and thus include distributed address spaces Despite many efforts to provide higher productivity for these platforms, the most common programming model uses message passing, which is substantially more tedious to program than shared-address-space models This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors We build on previous work that provided a proof of concept of such translation The present paper describes compiler algorithms and runtime techniques that provide the automatic translation of a first class of OpenMP applications: those that exhibit regular write array subscripts and repetitive communication We evaluate the translator on representative benchmarks of this class and compare their performance against hand-written MPI variants In all but one case, our translated versions perform close to the hand-written variants

8 citations

Book Chapter•10.1007/978-3-642-36036-7_3•
Evaluation of power consumption at execution of multiple automatically parallelized and power controlled media applications on the RP2 low-power multicore

[...]

Hiroki Mikami1, Shumpei Kitaki1, Masayoshi Mase1, Akihiro Hayashi1, Mamoru Shimaoka1, Keiji Kimura1, Masato Edahiro1, Hironori Kasahara1 •
Waseda University1
8 Sep 2011
TL;DR: Parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions, and in execution of multiple light computational load applications, power consumption increases only 12% for one application.
Abstract: This paper evaluates an automatic power reduction scheme of OSCAR automatic parallelizing compiler having power reduction control capability when multiple media applications parallelized by the OSCAR compiler are executed simultaneously on RP2, a 8-core multicore processor developed by Renesas Electronics, Hitachi, and Waseda University. OSCAR compiler enables the hierarchical multigrain parallel processing and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating and power gating for each processor core using the OSCAR multi-platform API. The RP2 has eight SH4A processor cores, each of which has power control mechanisms such as DVFS, clock gating and power gating. First, multiple applications with relatively light computational load are executed simultaneously on the RP2. The average power consumption of power controlled eight AAC encoder programs, each of which was executed on one processor, was reduced by 47%, (to 1.01W), against one AAC encoder execution on one processor (from 1.89W) without power control. Second, when multiple intermediate computational load applications are executed, the power consumptions of an AAC encoder executed on four processors with the power reduction control was reduced by 57% (to 0.84W) against an AAC encoder execution on one processor (from 1.95W). Power consumptions of one MPEG2 decoder on four processors with power reduction control was reduced by 49% (to 1.01W) against one MPEG2 decoder execution on one processor (from 1.99W). Finally, when a combination of a high computational load application program and an intermediate computational load application program are executed simultaneously, the consumed power reduced by 21% by using twice number of cores for each application. This paper confirmed parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions. In execution of multiple light computational load applications, power consumption increases only 12% for one application. Parallel processing being applied to intermediate computational load applications, power consumption of executing one application on one processor core (1.49W) is almost same power consumption of two applications on eight processor cores (1.46W).

7 citations

Book Chapter•10.1007/978-3-642-36036-7_4•
Double Inspection for Run-Time Loop Parallelization

[...]

Michael Philippsen1, Nikolai Tillmann2, Daniel Brinkers1•
University of Erlangen-Nuremberg1, Microsoft2
8 Sep 2011
TL;DR: The Inspector/Executor is well-known for parallelizing loops with irregular access patterns that cannot be analyzed statically, but they are often specifically designed for array codes and are in general not applicable in object oriented just-in-time compilation.
Abstract: The Inspector/Executor is well-known for parallelizing loops with irregular access patterns that cannot be analyzed statically. The downsides of existing inspectors are that it is hard to amortize their high run-time overheads by actually executing the loop in parallel, that they can only be applied to loops with dependencies that do not change during their execution and that they are often specifically designed for array codes and are in general not applicable in object oriented just-in-time compilation.

5 citations

Book Chapter•10.1007/978-3-642-36036-7_7•
CellCilk: Extending Cilk for Heterogeneous Multicore Platforms

[...]

Tobias Werth1, Silvia Schreier2, Michael Philippsen1•
University of Erlangen-Nuremberg1, FernUniversität Hagen2
8 Sep 2011
TL;DR: CellCilk is presented, a prototype implementation of Cilk for heterogeneous multicores with a host/accelerator design, using the Cell BE in particular, and introduces a new keyword (spu_spawn) for task creation on the accelerator cores.
Abstract: The potential of heterogeneous multicores, like the Cell BE, can only be exploited if the host and the accelerator cores are used in parallel and if the specific features of the cores are considered Parallel programming, especially when applied to irregular task-parallel problems, is challenging itself However, heterogeneous multicores add to that complexity due to their memory hierarchy and specialized accelerators As a solution for these issues we present CellCilk, a prototype implementation of Cilk for heterogeneous multicores with a host/accelerator design, using the Cell BE in particular CellCilk introduces a new keyword (spu_spawn) for task creation on the accelerator cores Task scheduling and load balancing are done by a novel dynamic cross-hierarchy work-stealing regime Furthermore, the CellCilk runtime employs a garbage collection mechanism for distributed data structures that are created during scheduling On benchmarks we achieve a good speedup and reasonable runtimes, even when compared to manually parallelized codes
Book Chapter•10.1007/978-3-642-36036-7_12•
Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation

[...]

Ziyu Guo1, Xipeng Shen1•
College of William & Mary1
8 Sep 2011
TL;DR: It is shown that careful dependence analysis may allow a fine-grained treatment to synchronizations and reveal redundant computation at the instruction-instance level and compared to existing translations, the new approach can yield speedup of a factor of integers.
Abstract: GPU-to-CPU translation may extend Graphics Processing Units (GPU) programs executions to multi-/many-core CPUs, and hence enable cross-device task migration and promote whole-system synergy. This paper describes some of our findings in treatment to GPU synchronizations during the translation process. We show that careful dependence analysis may allow a fine-grained treatment to synchronizations and reveal redundant computation at the instruction-instance level. Based on thread-level dependence graphs, we present a method to enable such fine-grained treatment automatically. Experiments demonstrate that compared to existing translations, the new approach can yield speedup of a factor of integers.
Book Chapter•10.1007/978-3-642-36036-7_10•
A Study of the Usefulness of Producer/Consumer Synchronization

[...]

Hao Lin1, Hansang Bae1, Samuel P. Midkiff1, Rudolf Eigenmann1, Soohong P. Kim1 •
Purdue University1
8 Sep 2011
TL;DR: This paper compares the performance of hardware and software synchronization schemes to barrier synchronization, and shows that either hardware or software based producer/consumer synchronization can provide applications with superior performance.
Abstract: In the early 1980s, shared memory mini-super-computers had buses and memory whose speeds were relatively fast compared to processor speeds. This led to the widespread use of various producer/consumer (post/wait) synchronization schemes for enforcing data dependences within parallel doacross loops. The rise of the “killer micro”, instruction sets optimized for serial programs, and rapidly increasing processor clock rates driven by Moore’s law, led to special purpose synchronization instructions being replaced by software barriers combined with loop fission (to allow the barriers to enforce dependences.) One cost of this approach is poorer cache behavior because variables on which a dependence exists are now accessed in separate loops. With the advent of the multicore era, producer/consumer synchronization again appears plausible. In this paper we compare the performance of hardware and software synchronization schemes to barrier synchronization, and show that either hardware or software based producer/consumer synchronization can provide applications with superior performance.
Book Chapter•10.1007/978-3-642-36036-7_2•
A Methodology for Fine-Grained Parallelism in JavaScript Applications

[...]

Jeffrey Fifield1, Dirk Grunwald1•
University of Colorado Boulder1
8 Sep 2011
TL;DR: Sluice is described, a library based method for the specification of streaming constructs in JavaScript applications, and it is shown that by taking advantage of the streaming model, this methodology can result in significant performance improvements for compute intensive workloads.
Abstract: JavaScript has long been the dominant language for client-side web development The size and complexity of client-side JavaScript programs continues to grow and now includes applications such as games, office suites, and image editing tools traditionally developed using high performance languages More recently, developers have been expanding the use of JavaScript with standards and implementations for server-side JavaScript These trends are driving a need for high performance JavaScript implementations While the performance of JavaScript implementations is improving, support for creating parallel applications that can take advantage of now ubiquitous parallel hardware remains primitive Pipeline, data, and task parallelism are ways of breaking a program into multiple units of work that can be executed concurrently by parallel hardware These concepts are made explicit in the stream processing model of parallelization Using the streaming model, an algorithm is divided into a set of small independent tasks called kernels that are linked together using first-in first-out data channels The advantage of this approach is that it allows a compiler to effectively map computations to a variety of hardware while freeing programmers from the burden of synchronizing tasks or orchestrating communication between them In this paper we describe Sluice, a library based method for the specification of streaming constructs in JavaScript applications While the use of such a library makes concurrency explicit, it does not easily result in parallel execution We show, however, that by taking advantage of the streaming model, we can dynamically re-compile Sluice programs to target a high performance, multi-threaded stream processing runtime layer The stream processing layer executes computations in a different process and the offloaded tasks communicate with the original program using fast shared memory buffers We show that this methodology can result in significant performance improvements for compute intensive workloads
Book Chapter•10.1007/978-3-642-36036-7_8•
OPELL and PM: A Case Study on Porting Shared Memory Programming Models to Accelerators Architectures

[...]

Joseph Manzano1, Ge Gan1, Juergen Ributzka1, Sunil Shrestha1, Guang R. Gao1 •
University of Delaware1
8 Sep 2011
TL;DR: This paper concentrates on optimizing the partition manager by reducing the number of long latency transactions, which is necessary to reduce the overhead of the OPELL framework.
Abstract: Limits on applications and hardware technologies have put a stop to the frequency race during the 2000s. Designs now can be divided into homogeneous and heterogeneous ones. Homogeneous types are the easiest to use since most toolchains and system software do not need too much of a rewrite. On the other end of the spectrum, there are the type two heterogeneous designs. These designs offer tremendous computational raw power, but at the cost of hardware features that might be necessary or even essential for certain types of system software and programming languages. An example of this architectural design is the Cell processor which exhibits both a heavy core and a group of simple cores designed as a computational engine. Even though the Cell processor is very well known for its accomplishments, it is also well known for its low programmability. Among many efforts to increase its programmability, there is the Open OPELL project. This framework tries to port the OpenMP programming model to the Cell architecture. The OPELL framework is composed of four components: a single source toolchain, a very light SPU kernel, a software cache and a partition / code overlay manager. To reduce the overhead, each of these components can be further optimized. This paper concentrates on optimizing the partition manager by reducing the number of long latency transactions. The contributions of this work are as follows.
Book Chapter•10.1007/978-3-642-36036-7_17•
Scheduling Support for Communicating Parallel Tasks

[...]

Jörg Dümmler1, Thomas Rauber2, Gudula Rünger1•
Chemnitz University of Technology1, University of Bayreuth2
8 Sep 2011
TL;DR: This article considers the parallel programming model of communicating parallel tasks (CM-tasks), which allows both task-internal communication as well as communication between concurrently executed tasks at arbitrary points of their execution, and proposes a corresponding scheduling algorithm and describes how the scheduling is supported by a transformation tool.
Abstract: Task-based approaches are popular for the development of parallel programs for several reasons. They provide a decoupling of the parallel specification from the scheduling and mapping to the execution resources of a specific hardware platform, thus allowing a flexible and individual mapping. For platforms with a distributed address space, the use of parallel tasks, instead of sequential tasks, adds the additional advantage of a structuring of the program into communication domains that can help to reduce the overall communication overhead. In this article, we consider the parallel programming model of communicating parallel tasks (CM-tasks), which allows both task-internal communication as well as communication between concurrently executed tasks at arbitrary points of their execution. We propose a corresponding scheduling algorithm and describe how the scheduling is supported by a transformation tool. An experimental evaluation of several application programs shows that using the CM-task model may lead to significant performance improvements compared to other parallel execution schemes.
Book Chapter•10.1007/978-3-642-36036-7_15•
A Code Merging Optimization Technique for GPU

[...]

Ryan Taylor1, Xiaoming Li1•
University of Delaware1
8 Sep 2011
TL;DR: A GPU usually delivers the highest performance when it is fully utilized, that is, programs running on it are taking full advantage of all the GPU resources, so an ideal program will keep both the ALU units and the memory units busy for the duration of the runtime.
Abstract: A GPU usually delivers the highest performance when it is fully utilized, that is, programs running on it are taking full advantage of all the GPU resources. Two main types of resources on the GPU are the compute engine, i.e., the ALU units, and the data mover, i.e., the memory units. This means that an ideal program will keep both the ALU units and the memory units busy for the duration of the runtime. The vast majority of GPU applications, however, either utilize ALU units but leave memory units idle, which is called ALU bound, or use the memory units but idle ALUs, which is called memory bound, and rarely attempt to take full advantage of both at the same time.
Book Chapter•10.1007/978-3-642-36036-7_5•
A Hybrid Approach to Proving Memory Reference Monotonicity

[...]

Cosmin E. Oancea1, Lawrence Rauchwerger1•
Texas A&M University1
8 Sep 2011
TL;DR: An automated solution based on static construction of access summaries is presented, in which the reference non-linearity problem can be solved for a large number of reference patterns by extracting arbitrarily-shaped predicates that can validate the reference monotonicity property and thus prove loop independence.
Abstract: Array references indexed by non-linear expressions or subscript arrays represent a major obstacle to compiler analysis and to automatic parallelization. Most previous proposed solutions either enhance the static analysis repertoire to recognize more patterns, to infer array-value properties, and to refine the mathematical support, or apply expensive run time analysis of memory reference traces to disambiguate these accesses. This paper presents an automated solution based on static construction of access summaries, in which the reference non-linearity problem can be solved for a large number of reference patterns by extracting arbitrarily-shaped predicates that can (in)validate the reference monotonicity property and thus (dis)prove loop independence. Experiments on six benchmarks show that our general technique for dynamic validation of the monotonicity property can cover a large class of codes, incurs minimal run-time overhead and obtains good speedups.
Book Chapter•10.1007/978-3-642-36036-7_13•
A Mutable Hardware Abstraction to Replace Threads

[...]

Sean Halle1, Sean Halle2, Sean Halle3, Albert Cohen2•
Technical University of Berlin1, École Normale Supérieure2, University of California, Santa Cruz3
8 Sep 2011
TL;DR: An abstraction to alleviate the difficulty of programming with threads is proposed, which makes available a virtual time in which events in different program time-lines are sequentialized.
Abstract: We propose an abstraction to alleviate the difficulty of programming with threads This abstraction is not directly usable by application programmers Instead, application-visible behavior is defined through a semantical plugin, and invoked via a language or library that uses the plugin The main benefit is that parallel language runtimes become simpler to implement, because they use sequential algorithms for the parallel semantics This is possible because the abstraction makes available a virtual time in which events in different program time-lines are sequentialized The parallel semantics relate events in different time-lines via relating the sequentialized versions within the virtual time-line

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve