Proceedings Article10.1109/ICPADS.2011.92
Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism
Chao-Chin Wu,Jenn-Yang Ke,Heshan Lin,Wu-chun Feng +3 more
- 07 Dec 2011
- pp 96-103
23
TL;DR: This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP), and proposes a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages.
read more
Abstract: Dynamic programming (DP) is an important computational method for solving a wide variety of discrete optimization problems such as scheduling, string editing, packaging, and inventory management. In general, DP is classified into four categories based on the characteristics of the optimization equation. Because applications that are classified in the same category of DP have similar program behavior, the research community has sought to propose general solutions for parallelizing each category of DP. However, most existing studies focus on running DP on CPU-based parallel systems rather than on accelerating DP algorithms on the graphics processing unit (GPU). This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we propose a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. We realize our approach in a real-world NPDP application -- the optimal matrix parenthesization problem. Experimental results demonstrate our method can achieve a speedup of 13.40 over the previously published GPU algorithm.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Optimization Techniques for GPU Programming
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
54
Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
Myung Kuk Yoon,Keun‐Soo Kim,Phil Lee,Won Woo Ro,Murali Annavaram +4 more
- 01 Jun 2016
TL;DR: The paper proposes a Virtual Thread (VT) architecture that maximizes thread-level parallelism beyond the GPU scheduling limit by assigning Cooperative Thread Arrays (CTAs) up to the capacity limit and minimizing logic complexity.
31
Staged parser combinators for efficient data processing
Manohar Jonnalagedda,Thierry Coppey,Sandro Stucki,Tiark Rompf,Martin Odersky +4 more
- 15 Oct 2014
TL;DR: Staging is used, a form of runtime code generation, to dissociate input parsing from parser composition, and eliminate intermediate data structures and computations associated with parser composition at staging time.
Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming
Koji Nakano
- 04 Dec 2013
TL;DR: This paper shows that the dynamic programming to solve the optimal polygon triangulation problem can be implemented in the UMM using the sequential memory access, and proves that any implementation of the dynamic Programming needs Omega(n3/w + n3l/p + nl) time units.
17
Mapping dynamic programming algorithms on graphics processing units
Muhammad Kashif Hanif
- 01 Jan 2014
TL;DR: The parallelism exhibited by the compute intensive tasks is studied and a parallel solution based on the matrix-matrix product method onto GPU is devised and the opportunity to use optimized BLAS library provided by CUDA is explored.
References
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
11.3K
The Landscape of Parallel Computing Research: A View from Berkeley
Krste Asanovic,Ras Bodik,Bryan Catanzaro,Joseph Gebis,Parry Husbands,Kurt Keutzer,David A. Patterson,William Plishker,John Shalf,Samuel Williams,Katherine Yelick +10 more
- 18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
The GPU Computing Era
TL;DR: The rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications are described.
1K
Parallel Prefix Sum (Scan) with CUDA
Mark J. Harris
- 01 Jan 2011
TL;DR: The water needs of this region have changed in recent years from being primarily for agricultural purposes to domestic and industrial uses now, and the needs of these industries have changed as well.
788
An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases
Lukasz Ligowski,Witold R. Rudnicki +1 more
- 23 May 2009
TL;DR: This work presents an efficient implementation of the Smith Waterman algorithm for sequence alignment on the Nvidia GPU, reaching more than 70% of theoretical hardware performance.