Book Chapter10.1007/978-3-540-89740-8_1
CUDA-Lite: Reducing GPU Programming Complexity
Sain-Zee Ueng,Melvin Lathara,Sara S. Baghsorkhi,Wen-mei W. Hwu +3 more
- 28 Nov 2008
- pp 1-15
TL;DR: The present CUDA-lite, an enhancement to CUDA, is presented and preliminary results that indicate auto-generated code can have performance comparable to hand coding are shown.
read more
Abstract: The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Auto-tuning a high-level language targeted to GPU codes
Scott Grauer-Gray,Lifan Xu,Robert Searles,Sudhee Ayalasomayajula,John Cavazos +4 more
- 13 May 2012
TL;DR: This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.
Polyhedral parallel code generation for CUDA
Sven Verdoolaege,Juan Carlos Juega,Albert Cohen,José Ignacio Gómez,Christian Tenllado,Francky Catthoor +5 more
- 20 Jan 2013
TL;DR: A novel source-to-source compiler called PPCG is presented, which introduces a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs.
431
A GPGPU compiler for memory optimization and parallelism management
Yi Yang,Ping Xiang,Jingfei Kong,Huiyang Zhou +3 more
- 05 Jun 2010
TL;DR: This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU), which addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.
Accelerating SQL database operations on a GPU with CUDA
Peter Bakkum,Kevin Skadron +1 more
- 14 Mar 2010
TL;DR: This paper implements a subset of the SQLite command processor directly on the GPU, reducing the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Seyong Lee,Rudolf Eigenmann +1 more
- 13 Nov 2010
TL;DR: This paper has developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations.
References
•Proceedings Article
A Survey of General-Purpose Computation on Graphics Hardware.
John D. Owens,David Luebke,Naga K. Govindaraju,Mark J. Harris,Jens Krüger,Aaron Lefohn,Timothy John Purcell +6 more
- 01 Jan 2005
TL;DR: The techniques used in mapping general-purpose computation to graphics hardware will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques.
2.1K
A data locality optimizing algorithm
Michael Wolf,Monica S. Lam +1 more
- 01 May 1991
TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Shane Ryoo,Christopher I. Rodrigues,Sara S. Baghsorkhi,Sam S. Stone,David B. Kirk,Wen-mei W. Hwu +5 more
- 20 Feb 2008
TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
NVIDIA cuda software and gpu parallel computing architecture
David B. Kirk
- 21 Oct 2007
TL;DR: This talk will describe NVIDIA's massively multithreaded computing architecture and CUDA software for GPU computing, a scalable, highly parallel architecture that delivers high throughput for data-intensive processing.
476
Efficient utilization of scratch-pad memory in embedded processor applications
Preeti Ranjan Panda,Nikil Dutt,Alexandru Nicolau +2 more
- 17 Mar 1997
TL;DR: This work presents a technique for efficiently exploiting on-chip Scratch-Pad memory by partitioning the application's scalar and array variables into off-chip DRAM and on- chip Scratch -Pad SRAM, with the goal of minimizing the total execution time of embedded applications.