CUDA-Lite: Reducing GPU Programming Complexity

doi:10.1007/978-3-540-89740-8_1

Book Chapter10.1007/978-3-540-89740-8_1

CUDA-Lite: Reducing GPU Programming Complexity

Sain-Zee Ueng, +3 more

- 28 Nov 2008

- pp 1-15

269

TL;DR: The present CUDA-lite, an enhancement to CUDA, is presented and preliminary results that indicate auto-generated code can have performance comparable to hand coding are shown.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1109/INPAR.2012.6339595

Auto-tuning a high-level language targeted to GPU codes

Scott Grauer-Gray, +4 more

- 13 May 2012

TL;DR: This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.

...read moreread less

510

•Journal Article•10.1145/2400682.2400713

Polyhedral parallel code generation for CUDA

Sven Verdoolaege, +5 more

- 20 Jan 2013

TL;DR: A novel source-to-source compiler called PPCG is presented, which introduces a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs.

...read moreread less

431

Proceedings Article•10.1145/1806596.1806606

A GPGPU compiler for memory optimization and parallelism management

Yi Yang, +3 more

- 05 Jun 2010

TL;DR: This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU), which addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.

...read moreread less

348

•Proceedings Article•10.1145/1735688.1735706

Accelerating SQL database operations on a GPU with CUDA

Peter Bakkum, +1 more

- 14 Mar 2010

TL;DR: This paper implements a subset of the SQLite command processor directly on the GPU, reducing the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.

...read moreread less

307

Proceedings Article•10.1109/SC.2010.36

OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Seyong Lee, +1 more

- 13 Nov 2010

TL;DR: This paper has developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations.

...read moreread less

271

...

Expand

References

•Proceedings Article

A Survey of General-Purpose Computation on Graphics Hardware.

John D. Owens, +6 more

- 01 Jan 2005

TL;DR: The techniques used in mapping general-purpose computation to graphics hardware will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques.

...read moreread less

2.1K

•Proceedings Article•10.1145/113445.113449

A data locality optimizing algorithm

Michael Wolf, +1 more

- 01 May 1991

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.

...read moreread less

1.4K

Proceedings Article•10.1145/1345206.1345220

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Shane Ryoo, +5 more

- 20 Feb 2008

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.

...read moreread less

1K

Proceedings Article•10.1145/1296907.1296909

NVIDIA cuda software and gpu parallel computing architecture

David B. Kirk

- 21 Oct 2007

TL;DR: This talk will describe NVIDIA's massively multithreaded computing architecture and CUDA software for GPU computing, a scalable, highly parallel architecture that delivers high throughput for data-intensive processing.

...read moreread less

476

•Proceedings Article•10.5555/787260.787762

Efficient utilization of scratch-pad memory in embedded processor applications

Preeti Ranjan Panda, +2 more

- 17 Mar 1997

TL;DR: This work presents a technique for efficiently exploiting on-chip Scratch-Pad memory by partitioning the application's scalar and array variables into off-chip DRAM and on- chip Scratch -Pad SRAM, with the goal of minimizing the total execution time of embedded applications.

...read moreread less

307