An integer programming framework for optimizing shared memory use on GPUs

doi:10.1145/1854273.1854348

Proceedings Article10.1145/1854273.1854348

An integer programming framework for optimizing shared memory use on GPUs

Wenjing Ma, +1 more

- 11 Sep 2010

- pp 553-554

34

TL;DR: A global (intraprocedural) framework which can model structured control flow, and is not restricted to a single loop nest is presented, which outperforms a recently published heuristic method, and loop transformations also improve performance for many applications.

Abstract: General purpose computing using GPUs is becoming increasingly popular, because of GPU's extremely favorable performance/price ratio. Like standard processors, GPUs also have a memory hierarchy, which must be carefully optimized for in order to achieve efficient execution. Specifically, modern NVIDIA GPUs have a very small programmable cache, referred to as shared memory, accesses to which are nearly 100 to 150 times faster than accesses to the regular device memory. An automatically generated or handwritten CUDA program can explicitly control what variables and array sections are allocated on the shared memory at any point during the execution. This, however, leads to a difficult optimization problem. In this paper, we formulate and solve the shared memory allocation problem as an integer linear programming problem. We present a global (intraprocedural) framework which can model structured control flow, and is not restricted to a single loop nest. We consider allocation of scalars, arrays, and array sections on shared memory. We also briefly show how our framework can suggest useful loop transformations to further improve performance. Our experiments using several non-scientific application show that our integer programming framework outperforms a recently published heuristic method, and our loop transformations also improve performance for many applications.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1145/2597652.2597685

Unified on-chip memory allocation for SIMT architecture

Ari B. Hayes, +1 more

- 10 Jun 2014

TL;DR: Overall, it is discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes concurrency while ensuring good single-thread performance at compile-time.

...read moreread less

41

Proceedings Article•10.1145/2804302.2804313

Type-safe runtime code generation: accelerate to LLVM

Trevor L. McDonell, +3 more

- 30 Aug 2015

TL;DR: This paper discusses the compilation pipeline of Accelerate, a high-performance array language targeting both multicore CPUs and GPUs, where it is able to preserve types from the source language down to a low-level register language in SSA form, and creates a new type-safe interface to the industrial-strength LLVM compiler infrastructure.

...read moreread less

19

Journal Article•10.1016/J.PARCO.2017.11.003

Benchmarking the GPU memory at the warp level

Minquan Fang, +5 more

- 01 Jan 2018

TL;DR: This work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines, and summarizes the optimization guidelines for different types of memories, and builds an optimization framework on GPU memories.

...read moreread less

19

•Journal Article•10.1016/J.COMPCHEMENG.2017.04.024

Combined model predictive control and scheduling with dominant time constant compensation

Logan D. R. Beal, +4 more

- 02 Sep 2017

- Computers & Chemical Engineering

TL;DR: The proposed methods are time-scaling of the linear dynamics based on throughput rates and grade-based objectives for product scheduling based on a mathematical program with complementarity constraints to both control and optimize a product grade schedule.

...read moreread less

17

•Journal Article•10.1109/TC.2016.2604372

Optimizing Data Placement on GPU Memory: A Portable Approach

Guoyang Chen, +3 more

- 01 Mar 2017

- IEEE Transactions on Computers

TL;DR: This article provides a comprehensive description of this method, and presents several extensions that significantly improve the scalability of PORPLE, which include a novel algorithm design for efficiently searching for the best data placements, the use of active profiling for reducing the online-profiling overhead, and a systematic examination of a path-based performance model.

...read moreread less

15

...

Expand

References

Journal Article•10.1111/J.2517-6161.1977.TB01600.X

Maximum likelihood from incomplete data via the EM algorithm

Arthur P. Dempster, +2 more

- 01 Sep 1977

- Journal of the royal statistical society...

55.2K

Algorithms for clustering data

Anil K. Jain, +1 more

- 01 Jan 1988

9.9K

•Book

Algorithms for clustering data

Anil K. Jain, +1 more

- 01 Jan 1988

8.6K

•Proceedings Article•10.5555/977395.977673

LLVM: a compilation framework for lifelong program analysis & transformation

Chris Lattner, +1 more

- 20 Mar 2004

TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.

...read moreread less

5.4K

•Journal Article

Maximum likelihood estimation from incomplete data via the EM algorithm

A. Dempster

- 01 Jan 1977

- Journal of the Royal Statistical Society

4.5K

...

Expand

An integer programming framework for optimizing shared memory use on GPUs

Chat with Paper

AI Agents for this Paper

Citations

Unified on-chip memory allocation for SIMT architecture

Type-safe runtime code generation: accelerate to LLVM

Benchmarking the GPU memory at the warp level

Combined model predictive control and scheduling with dominant time constant compensation

Optimizing Data Placement on GPU Memory: A Portable Approach

References

Maximum likelihood from incomplete data via the EM algorithm

Algorithms for clustering data

Algorithms for clustering data

LLVM: a compilation framework for lifelong program analysis & transformation

Maximum likelihood estimation from incomplete data via the EM algorithm

Related Papers (5)

Rodinia: A benchmark suite for heterogeneous computing

A GPGPU compiler for memory optimization and parallelism management

On-the-fly elimination of dynamic irregularities for GPU computing

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Analyzing CUDA workloads using a detailed GPU simulator