Proceedings Article10.1109/ICPADS.2012.17
Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization
Chao-Chin Wu,Kai-Cheng Wei,Ting-Hong Lin +2 more
- 17 Dec 2012
- pp 45-52
16
TL;DR: This work proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages, and realizes the optimal matrix parenthesization problem.
read more
Abstract: Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of pro-cessing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Sub problems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of sub problems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application â" the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
High Performance Computing
emontmej
- 29 Jun 2018
TL;DR: The key elements of the Core Program will be described including the construction of a UK e-Science Grid and the need to develop a data architecture for the Grid that will allow federated access to relational databases as well as flat files.
323
Optimization Techniques for GPU Programming
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
54
Patent
Method and apparatus of route guidance
Iljya Kalai,Keith Ito,Andrey Yuryevich Ulanov,Andrew T. Miller,Michael Siliski +4 more
- 07 Dec 2011
TL;DR: In this paper, a system and a method for route guidance on a user device are described, where each map partition may contain road geometries, road names, road network topology, or any other information needed to provide turn-by-turn navigation or driving directions within the partition.
50
High Performance Computing
Martin Morgan,Seth Falcon,Fred Hutchinson +2 more
- 01 Jan 2010
TL;DR: The megafly network, despite using fewer virtual channels for deadlock avoidance than a dragonfly, performs as well as a fully connected 1-D dragonfly network and QoS can completely mitigate application performance variability while causing minimal slowdown to the background network traffic.
Using GPU Computing for Solving the Two-Dimensional Guillotine Cutting Problem
TL;DR: This paper investigates the application of GPU computing to the two-dimensional guillotine cutting problem, solved by dynamic programming, and shows the effectiveness of the dynamic programming approach based on GPU computing for this problem.
11
References
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
11.3K
The Landscape of Parallel Computing Research: A View from Berkeley
Krste Asanovic,Ras Bodik,Bryan Catanzaro,Joseph Gebis,Parry Husbands,Kurt Keutzer,David A. Patterson,William Plishker,John Shalf,Samuel Williams,Katherine Yelick +10 more
- 18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
The GPU Computing Era
TL;DR: The rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications are described.
1K
An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases
Lukasz Ligowski,Witold R. Rudnicki +1 more
- 23 May 2009
TL;DR: This work presents an efficient implementation of the Smith Waterman algorithm for sequence alignment on the Nvidia GPU, reaching more than 70% of theoretical hardware performance.
Fast evaluation of internal loops in RNA secondary structure prediction.
TL;DR: A new method to evaluate internal loops utilizing currently used energy rules based on free energy rules for stacking and loop formation is presented, which reduces the time complexity of this part of the structure prediction from O(n4) to O (n3), thus reducing the overall complexity to O( n3).
Related Papers (5)
Pengyu Wang,Lu Zhang,Chao Li,Minyi Guo +3 more
- 01 May 2019
Wenjing Ma,Gagan Agrawal +1 more
- 11 Sep 2010