Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism

doi:10.1109/ICPADS.2011.92

Proceedings Article10.1109/ICPADS.2011.92

Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism

Chao-Chin Wu, +3 more

- 07 Dec 2011

- pp 96-103

23

TL;DR: This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP), and proposes a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages.

Abstract: Dynamic programming (DP) is an important computational method for solving a wide variety of discrete optimization problems such as scheduling, string editing, packaging, and inventory management. In general, DP is classified into four categories based on the characteristics of the optimization equation. Because applications that are classified in the same category of DP have similar program behavior, the research community has sought to propose general solutions for parallelizing each category of DP. However, most existing studies focus on running DP on CPU-based parallel systems rather than on accelerating DP algorithms on the graphics processing unit (GPU). This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we propose a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. We realize our approach in a real-world NPDP application -- the optimal matrix parenthesization problem. Experimental results demonstrate our method can achieve a speedup of 13.40 over the previously published GPU algorithm.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

54

Journal Article•10.1109/isca.2016.59

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

Myung Kuk Yoon, +4 more

- 01 Jun 2016

TL;DR: The paper proposes a Virtual Thread (VT) architecture that maximizes thread-level parallelism beyond the GPU scheduling limit by assigning Cooperative Thread Arrays (CTAs) up to the capacity limit and minimizing logic complexity.

...read moreread less

31

•Proceedings Article•10.1145/2660193.2660241

Staged parser combinators for efficient data processing

Manohar Jonnalagedda, +4 more

- 15 Oct 2014

TL;DR: Staging is used, a form of runtime code generation, to dissociate input parsing from parser composition, and eliminate intermediate data structures and computations associated with parser composition at staging time.

...read moreread less

30

Proceedings Article•10.1109/CANDAR.2013.20

Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming

Koji Nakano

- 04 Dec 2013

TL;DR: This paper shows that the dynamic programming to solve the optimal polygon triangulation problem can be implemented in the UMM using the sequential memory access, and proves that any implementation of the dynamic Programming needs Omega(n3/w + n3l/p + nl) time units.

...read moreread less

17

•Dissertation•10.15480/882.1184

Mapping dynamic programming algorithms on graphics processing units

Muhammad Kashif Hanif

- 01 Jan 2014

TL;DR: The parallelism exhibited by the compute intensive tasks is studied and a parallel solution based on the matrix-matrix product method onto GPU is devised and the opportunity to use optimized BLAS library provided by CUDA is explored.

...read moreread less

8

...

Expand

References

Journal Article•10.1016/0022-2836(81)90087-5

Identification of common molecular subsequences.

Temple F. Smith, +1 more

- 25 Mar 1981

- Journal of Molecular Biology

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

...read moreread less

11.3K

The Landscape of Parallel Computing Research: A View from Berkeley

Krste Asanovic, +10 more

- 18 Dec 2006

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

...read moreread less

2.4K

Journal Article•10.1109/MM.2010.41

The GPU Computing Era

John R. Nickolls, +1 more

- 01 Mar 2010

- IEEE Micro

TL;DR: The rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications are described.

...read moreread less

1K

Parallel Prefix Sum (Scan) with CUDA

Mark J. Harris

- 01 Jan 2011

TL;DR: The water needs of this region have changed in recent years from being primarily for agricultural purposes to domestic and industrial uses now, and the needs of these industries have changed as well.

...read moreread less

788

•Proceedings Article•10.1109/IPDPS.2009.5160931

An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases

Lukasz Ligowski, +1 more

- 23 May 2009

TL;DR: This work presents an efficient implementation of the Smith Waterman algorithm for sequence alignment on the Nvidia GPU, reaching more than 70% of theoretical hardware performance.

...read moreread less

180