Proceedings Article10.1145/1995896.1995936
Automating GPU computing in MATLAB
Chun-Yu Shei,Pushkar Ratnalikar,Arun Chauhan +2 more
- 31 May 2011
- pp 245-254
21
TL;DR: This work presents a fully automatic source-level compilation technique to exploit a given GPU library for MATLAB, enabling coarse-grained heterogeneous parallelism across CPU and GPU.
read more
Abstract: MATLAB is a popular software platform for scientific and engineering software writers. It offers a high level of abstraction for fundamental mathematical operations and extensive highly optimized domain-specific libraries for several scientific and engineering disciplines. With the recent availability of GPU libraries for MATLAB, it has become possible to easily exploit GPGPUs as coprocessors. However, this requires changing the code by carefully declaring variables that would live on the GPU, breaking the simplicity of the MATLAB programming model.We present a fully automatic source-level compilation technique to exploit a given GPU library for MATLAB, enabling coarse-grained heterogeneous parallelism across CPU and GPU. Our approach is based on empirically characterizing the library's functions, in order to build a comparative model of their performance on the CPU and GPU, which is then used along with a data communication cost model to maximize parallelism by selectively offloading some computation on the GPU. We achieve this by phrasing the problem as a binary integer linear programming problem aimed at minimizing CPU-GPU data movement, and using a hierarchical approach to keep the computational complexity in check. We have implemented our approach in a source-level MATLAB compiler, and present experimental results on a set of MATLAB kernels and applications using the GPUmat library. We show speedups of up to 7 times when the GPU is harnessed, compared to a standalone 8-core CPU.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
Ashwin Prasad,Jayvant Anantpur,Ramaswamy Govindarajan +2 more
- 04 Jun 2011
TL;DR: The design and implementation of MEGHA is presented, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors and a set of compiler optimizations tailored for MATLAB is proposed.
•Book
SemCache: Semantics-Aware Caching for Efficient GPU Offloading
Nabeel Al-Saber,Milind Kulkarni +1 more
- 01 Jun 2016
TL;DR: SemCache is introduced, a semantics-aware GPU cache that automatically manages CPU-GPU communication and dynamically optimizes communication by eliminating redundant transfers using caching.
Fast GPU-Based Seismogram Simulation From Microseismic Events in Marine Environments Using Heterogeneous Velocity Models
TL;DR: In this paper, a novel approach is presented for fast generation of synthetic seismograms due to microseismic events, using heterogeneous marine velocity models, using the Fourier domain pseudo-spectral method which is parallelizable on the graphics processing unit (GPU) cards.
10
Automatic generation of parallel C code for stencil applications written in MATLAB
Johannes Spazier,Steffen Christgau,Bettina Schnor +2 more
- 02 Jun 2016
TL;DR: This paper presents the first compiler that generates native MPI code from MATLAB source and thereby showing significant performance improvements, and presents performance results of an automatic translation from a MATLAB subset into efficient parallelized C code for different architectures: multicores, compute clusters, and GPGPUs.
7
Time-stepping methods for the simulation of the self-assembly of nano-crystals in Matlab on a GPU
Maciek D. Korzec,T. Ahnert +1 more
TL;DR: A time-adaptive SBDF1/SBDF1-2-step method is presented that yields convincing results reflecting the change in timescales during topological changes of the nanostructures.
7
References
Static scheduling algorithms for allocating directed task graphs to multiprocessors
Yu-Kwong Kwok,Ishfaq Ahmad +1 more
TL;DR: A taxonomy that classifies 27 scheduling algorithms and their functionalities into different categories is proposed, with each algorithm explained through an easy-to-understand description followed by an illustrative example to demonstrate its operation.
1.4K
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Shane Ryoo,Christopher I. Rodrigues,Sara S. Baghsorkhi,Sam S. Stone,David B. Kirk,Wen-mei W. Hwu +5 more
- 20 Feb 2008
TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
•Journal Article
An Updated Set of Basic Linear Algebra Subprograms (BLAS)
Susan Blackford,James Demmel,Jack Dongarra,Iain S. Duff,Sven Hammarling,Greg Henry,Michael A. Heroux,Linda Kaufman,Andrew Lumsdaine,A. Petitet,Roldan Pozo,Karin A Remington,Clint Whaley +12 more
TL;DR: In this paper, the authors present a list of the companies that have contributed to the development of the Numerical Algorithms Group (NALG), including Intel, Sandia National Laboratories, and IBM.
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Seyong Lee,Rudolf Eigenmann +1 more
- 13 Nov 2010
TL;DR: This paper has developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations.
Automatic C-to-CUDA code generation for affine programs
Muthu Baskaran,J. Ramanujam,P. Sadayappan +2 more
- 20 Mar 2010
TL;DR: An automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs, that is quite close to hand-optimizedCUDA code and considerably better than the benchmarks' performance on a multicore CPU.