Languages and Compilers for Parallel Computing

Conference Tools

Papers published on a yearly basis

Papers

Book Chapter•10.1007/978-3-540-89740-8_2•

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

[...]

John A. Stratton¹, Sam S. Stone¹, Wen-mei W. Hwu¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

28 Nov 2008

TL;DR: A framework called MCUDA is described, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs and argues that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

...read moreread less

Abstract: CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

...read moreread less

240 citations

Book Chapter•10.1007/BFB0038674•

On Estimating and Enhancing Cache Effectiveness

[...]

Jeanne Ferrante, Vivek Sarkar¹, W. Thrash²•Institutions (2)

IBM¹, University of Washington²

7 Aug 1991

TL;DR: It is shown how to estimate efficiently the number of distinct cache lines used by a given loop in a nest of loops, and this estimate can be used to guide program transformations such as loop interchange to achieve greater cache effectiveness.

...read moreread less

Abstract: In this paper, we consider automatic analysis of a program's cache usage to achieve greater cache effectiveness. We show how to estimate efficiently the number of distinct cache lines used by a given loop in a nest of loops. Given this estimate of the number of cache lines needed, we can estimate the number of cache misses for a nest of loops. Our estimates can be used to guide program transformations such as loop interchange to achieve greater cache effectiveness. We present simulation results that show our estimates are reasonable for simple cases such as matrix multiply. We analyze the array sizes for which our estimates differ from our simulation results, and provide recommendations on how to handle such arrays in practice.

...read moreread less

190 citations

Book Chapter•10.1007/978-3-540-72521-3_18•

UTS: an unbalanced tree search benchmark

[...]

Stephen L. Olivier¹, Jun Huan¹, Jinze Liu¹, Jan F. Prins¹, James Dinan², P. Sadayappan², Chau-Wen Tseng³ - Show less +3 more•Institutions (3)

University of North Carolina at Chapel Hill¹, Ohio State University², University of Maryland, College Park³

2 Nov 2006

TL;DR: An unbalanced tree search benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing, and creates versions of UTS in two parallel languages, OpenMP and Unified Parallel C, using work stealing as the mechanism for reducing load imbalance.

...read moreread less

Abstract: This paper presents an unbalanced tree search (UTS) benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe algorithms for building a variety of unbalanced search trees to simulate different forms of load imbalance. We created versions of UTS in two parallel languages, OpenMP and Unified Parallel C (UPC), using work stealing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including shared-memory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPC's shared-memory abstractions. Results show that both UPC and OpenMP can support efficient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory systems. Since dynamic load balancing requires intensive communication, performance portability remains difficult for applications such as UTS and performance degrades on PC clusters. By varying key work stealing parameters, we expose important tradeoffs between the granularity of load balance, the degree of parallelism, and communication costs.

...read moreread less

185 citations

Book Chapter•10.1007/3-540-35767-X_13•

STAPL: an adaptive, generic parallel C++ library

[...]

Ping An¹, Alin Jula¹, Silvius Rus¹, Steven Saunders¹, Tim Smith¹, Gabriel Tanase¹, Nathan Thomas¹, Nancy M. Amato¹, Lawrence Rauchwerger¹ - Show less +5 more•Institutions (1)

Texas A&M University¹

1 Aug 2001

TL;DR: This work presents results obtained using STAPL for a molecular dynamics code and a particle transport code, and presents functionality to allow the user to further optimize the code and achieve additional performance gains.

...read moreread less

Abstract: The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that utilize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of parallelism, and supports nested parallelism. The library is intended to be general purpose, but emphasizes irregular programs to allow the exploitation of parallelism for applications which use dynamically linked data structures such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms. STAPL provides several different algorithms for some library routines, and selects among them adaptively at runtime. STAPL can replace STL automatically by invoking a preprocessing translation phase. In the applications studied, the performance of translated code was within 5% of the results obtained using STAPL directly. STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code.

...read moreread less

163 citations

Book Chapter•10.1007/3-540-45403-9_8•

Automatic Array Privatization

[...]

Peng Tu¹, David Padua¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

12 Aug 1993

TL;DR: In this article, a technique for automatic array privatization is presented, which uses data flow analysis of array references to identify privatizable arrays intraprocedurally as well as interprocedural.

...read moreread less

Abstract: Array privatization is one of the most effective transformations for the exploitation of parallelism. In this paper, we present a technique for automatic array privatization. Our algorithm uses data flow analysis of array references to identify privatizable arrays intraprocedurally as well as interprocedurally. It employs static and dynamic resolution to determine the last value of a lived private array. We compare the result of automatic array privatization with that of manual array privatization and identify directions for future improvement. To enhance the effectiveness of our algorithm, we develop a goal directly technique to analysis symbolic variables in the present of conditional statements, loops and index arrays.

...read moreread less

161 citations

...

Expand

Year	Papers
2019	11
2018	13
2017	18
2016	22
2015	19
2014	25

Conference Tools

Papers published on a yearly basis

Papers

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

On Estimating and Enhancing Cache Effectiveness

UTS: an unbalanced tree search benchmark

STAPL: an adaptive, generic parallel C++ library

Automatic Array Privatization

Performance Metrics