Conference
Languages and Compilers for Parallel Computing
About: Languages and Compilers for Parallel Computing is an academic conference. The conference publishes majorly in the area(s): Compiler & Optimizing compiler. Over the lifetime, 786 publications have been published by the conference receiving 14007 citations.
Papers published on a yearly basis
Papers
28 Nov 2008
TL;DR: A framework called MCUDA is described, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs and argues that CUDA can be an effective data-parallel programming model for more than just GPU architectures.
Abstract: CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.
240 citations
7 Aug 1991
TL;DR: It is shown how to estimate efficiently the number of distinct cache lines used by a given loop in a nest of loops, and this estimate can be used to guide program transformations such as loop interchange to achieve greater cache effectiveness.
Abstract: In this paper, we consider automatic analysis of a program's cache usage to achieve greater cache effectiveness. We show how to estimate efficiently the number of distinct cache lines used by a given loop in a nest of loops. Given this estimate of the number of cache lines needed, we can estimate the number of cache misses for a nest of loops. Our estimates can be used to guide program transformations such as loop interchange to achieve greater cache effectiveness. We present simulation results that show our estimates are reasonable for simple cases such as matrix multiply. We analyze the array sizes for which our estimates differ from our simulation results, and provide recommendations on how to handle such arrays in practice.
190 citations
2 Nov 2006
TL;DR: An unbalanced tree search benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing, and creates versions of UTS in two parallel languages, OpenMP and Unified Parallel C, using work stealing as the mechanism for reducing load imbalance.
Abstract: This paper presents an unbalanced tree search (UTS) benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe algorithms for building a variety of unbalanced search trees to simulate different forms of load imbalance. We created versions of UTS in two parallel languages, OpenMP and Unified Parallel C (UPC), using work stealing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including shared-memory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPC's shared-memory abstractions. Results show that both UPC and OpenMP can support efficient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory systems. Since dynamic load balancing requires intensive communication, performance portability remains difficult for applications such as UTS and performance degrades on PC clusters. By varying key work stealing parameters, we expose important tradeoffs between the granularity of load balance, the degree of parallelism, and communication costs.
185 citations
1 Aug 2001
TL;DR: This work presents results obtained using STAPL for a molecular dynamics code and a particle transport code, and presents functionality to allow the user to further optimize the code and achieve additional performance gains.
Abstract: The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that utilize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of parallelism, and supports nested parallelism. The library is intended to be general purpose, but emphasizes irregular programs to allow the exploitation of parallelism for applications which use dynamically linked data structures such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms. STAPL provides several different algorithms for some library routines, and selects among them adaptively at runtime. STAPL can replace STL automatically by invoking a preprocessing translation phase. In the applications studied, the performance of translated code was within 5% of the results obtained using STAPL directly. STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code.
163 citations
12 Aug 1993
TL;DR: In this article, a technique for automatic array privatization is presented, which uses data flow analysis of array references to identify privatizable arrays intraprocedurally as well as interprocedural.
Abstract: Array privatization is one of the most effective transformations for the exploitation of parallelism. In this paper, we present a technique for automatic array privatization. Our algorithm uses data flow analysis of array references to identify privatizable arrays intraprocedurally as well as interprocedurally. It employs static and dynamic resolution to determine the last value of a lived private array. We compare the result of automatic array privatization with that of manual array privatization and identify directions for future improvement. To enhance the effectiveness of our algorithm, we develop a goal directly technique to analysis symbolic variables in the present of conditional statements, loops and index arrays.
161 citations
Performance Metrics
| Year | Papers |
|---|---|
| 2019 | 11 |
| 2018 | 13 |
| 2017 | 18 |
| 2016 | 22 |
| 2015 | 19 |
| 2014 | 25 |