Copperhead: compiling an embedded data parallel language

doi:10.1145/1941553.1941562

Proceedings Article10.1145/1941553.1941562

Copperhead: compiling an embedded data parallel language

Bryan Catanzaro, +2 more

- 12 Feb 2011

- Vol. 46, Iss: 8, pp 47-56

233

TL;DR: The language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code are discussed and the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations are introduced.

Abstract: Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1016/J.PARCO.2011.09.001

PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Andreas Klöckner, +5 more

- 01 Mar 2012

TL;DR: This article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems.

...read moreread less

676

•Proceedings Article•10.1145/3079856.3080256

Plasticine: A Reconfigurable Architecture For Parallel Paterns

Raghu Prabhakar, +8 more

- 24 Jun 2017

TL;DR: This work designs Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns that provide an improvement of up to 76.9× in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.

...read moreread less

287

Proceedings Article•10.1145/1926354.1926358

Accelerating Haskell array codes with multicore GPUs

Manuel M. T. Chakravarty, +4 more

- 23 Jan 2011

TL;DR: This paper proposes a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations in Haskell and embeds this purely functional array language in Haskell with an online code generator for NVIDIA's CUDA GPGPU programming environment.

...read moreread less

283

Proceedings Article•10.1109/PACT.2011.15

A Heterogeneous Parallel Framework for Domain-Specific Languages

Kevin J. Brown, +6 more

- 10 Oct 2011

TL;DR: A new end-to-end system for building, compiling, and executing DSL applications on parallel heterogeneous hardware, the Delite Compiler Framework and Runtime is presented and results comparing the performance of several machine learning applications written in OptiML are presented.

...read moreread less

221

Journal Article•10.1145/2584665

Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages

Arvind K. Sujeeth, +6 more

- 01 Apr 2014

- ACM Transactions in Embedded Computing S...

TL;DR: An overview of the Delite compiler framework and DSLs that have been developed with it is presented and it is shown that they all achieve performance competitive to or exceeding Cpp code.

...read moreread less

181

...

Expand

References

Proceedings Article•10.1145/1401132.1401152

Scalable parallel programming with CUDA

John R. Nickolls, +3 more

- 11 Aug 2008

TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.

...read moreread less

2.3K

Journal Article•10.1162/089976601300014493

Improvements to Platt's SMO Algorithm for SVM Classifier Design

S. Sathiya Keerthi, +3 more

- 01 Mar 2001

- Neural Computation

TL;DR: Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO that perform significantly faster than the original SMO on all benchmark data sets tried.

...read moreread less

2K

•Book

CUDA by Example: An Introduction to General-Purpose GPU Programming

Jason Sanders, +1 more

- 19 Jul 2010

TL;DR: Cuda by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology and details the techniques and trade-offs associated with each key CUDA feature.

...read moreread less

1.7K

•Journal Article•10.1145/1365490.1365500

Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

John R. Nickolls, +3 more

- 01 Mar 2008

- ACM Queue

TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.

...read moreread less

1.4K

Journal Article•10.1145/7902.7903

Data parallel algorithms

W. Daniel Hillis, +1 more

- 01 Dec 1986

- Communications of The ACM

TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

...read moreread less

1K