Proceedings Article10.1145/1941553.1941562
Copperhead: compiling an embedded data parallel language
Bryan Catanzaro,Michael Garland,Kurt Keutzer +2 more
- 12 Feb 2011
- Vol. 46, Iss: 8, pp 47-56
TL;DR: The language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code are discussed and the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations are introduced.
read more
Abstract: Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation
Andreas Klöckner,Nicolas Pinto,Yunsup Lee,Bryan Catanzaro,Paul Ivanov,Ahmed Fasih +5 more
- 01 Mar 2012
TL;DR: This article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems.
676
Plasticine: A Reconfigurable Architecture For Parallel Paterns
Raghu Prabhakar,Yaqi Zhang,David Koeplinger,Matthew Feldman,Tian Zhao,Stefan Hadjis,Ardavan Pedram,Christos Kozyrakis,Kunle Olukotun +8 more
- 24 Jun 2017
TL;DR: This work designs Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns that provide an improvement of up to 76.9× in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.
287
Accelerating Haskell array codes with multicore GPUs
Manuel M. T. Chakravarty,Gabriele Keller,Sean Lee,Trevor L. McDonell,Vinod Grover +4 more
- 23 Jan 2011
TL;DR: This paper proposes a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations in Haskell and embeds this purely functional array language in Haskell with an online code generator for NVIDIA's CUDA GPGPU programming environment.
283
A Heterogeneous Parallel Framework for Domain-Specific Languages
Kevin J. Brown,Arvind K. Sujeeth,Hyouk Joong Lee,Tiark Rompf,Hassan Chafi,Martin Odersky,Kunle Olukotun +6 more
- 10 Oct 2011
TL;DR: A new end-to-end system for building, compiling, and executing DSL applications on parallel heterogeneous hardware, the Delite Compiler Framework and Runtime is presented and results comparing the performance of several machine learning applications written in OptiML are presented.
Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages
Arvind K. Sujeeth,Kevin J. Brown,HyoukJoong Lee,Tiark Rompf,Hassan Chafi,Martin Odersky,Kunle Olukotun +6 more
TL;DR: An overview of the Delite compiler framework and DSLs that have been developed with it is presented and it is shown that they all achieve performance competitive to or exceeding Cpp code.
References
Scalable parallel programming with CUDA
John R. Nickolls,Ian Buck,Michael Garland,Kevin Skadron +3 more
- 11 Aug 2008
TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Improvements to Platt's SMO Algorithm for SVM Classifier Design
TL;DR: Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO that perform significantly faster than the original SMO on all benchmark data sets tried.
•Book
CUDA by Example: An Introduction to General-Purpose GPU Programming
Jason Sanders,Edward Kandrot +1 more
- 19 Jul 2010
TL;DR: Cuda by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology and details the techniques and trade-offs associated with each key CUDA feature.
1.7K
Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?
TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.
1.4K
Data parallel algorithms
W. Daniel Hillis,Guy L. Steele +1 more
TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.
1K