Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization

doi:10.1109/ICPADS.2012.17

Proceedings Article10.1109/ICPADS.2012.17

Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization

Chao-Chin Wu, +2 more

- 17 Dec 2012

- pp 45-52

16

TL;DR: This work proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages, and realizes the optimal matrix parenthesization problem.

Abstract: Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of pro-cessing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Sub problems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of sub problems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application â" the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

High Performance Computing

emontmej

- 29 Jun 2018

TL;DR: The key elements of the Core Program will be described including the construction of a UK e-Science Grid and the need to develop a data architecture for the Grid that will allow federated access to relational databases as well as flat files.

...read moreread less

323

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

54

Patent

Method and apparatus of route guidance

Iljya Kalai, +4 more

- 07 Dec 2011

TL;DR: In this paper, a system and a method for route guidance on a user device are described, where each map partition may contain road geometries, road names, road network topology, or any other information needed to provide turn-by-turn navigation or driving directions within the partition.

...read moreread less

50

High Performance Computing

Martin Morgan, +2 more

- 01 Jan 2010

TL;DR: The megafly network, despite using fewer virtual channels for deadlock avoidance than a dragonfly, performs as well as a fully connected 1-D dragonfly network and QoS can completely mitigate application performance variability while causing minimal slowdown to the background network traffic.

...read moreread less

14

Journal Article•10.1287/IJOC.2016.0693

Using GPU Computing for Solving the Two-Dimensional Guillotine Cutting Problem

Marco Antonio Boschetti, +2 more

- 28 Jun 2016

- Informs Journal on Computing

TL;DR: This paper investigates the application of GPU computing to the two-dimensional guillotine cutting problem, solved by dynamic programming, and shows the effectiveness of the dynamic programming approach based on GPU computing for this problem.

...read moreread less

11

...

Expand

References

Journal Article•10.1016/0022-2836(81)90087-5

Identification of common molecular subsequences.

Temple F. Smith, +1 more

- 25 Mar 1981

- Journal of Molecular Biology

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

...read moreread less

11.3K

The Landscape of Parallel Computing Research: A View from Berkeley

Krste Asanovic, +10 more

- 18 Dec 2006

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

...read moreread less

2.4K

Journal Article•10.1109/MM.2010.41

The GPU Computing Era

John R. Nickolls, +1 more

- 01 Mar 2010

- IEEE Micro

TL;DR: The rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications are described.

...read moreread less

1K

•Proceedings Article•10.1109/IPDPS.2009.5160931

An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases

Lukasz Ligowski, +1 more

- 23 May 2009

TL;DR: This work presents an efficient implementation of the Smith Waterman algorithm for sequence alignment on the Nvidia GPU, reaching more than 70% of theoretical hardware performance.

...read moreread less

180

•Journal Article•10.1093/BIOINFORMATICS/15.6.440

Fast evaluation of internal loops in RNA secondary structure prediction.

Rune B. Lyngsø, +2 more

- 01 Jun 1999

- Bioinformatics

TL;DR: A new method to evaluate internal loops utilizing currently used energy rules based on free energy rules for stacking and loop formation is presented, which reduces the time complexity of this part of the structure prediction from O(n4) to O (n3), thus reducing the overall complexity to O( n3).

...read moreread less

164