Journal Article10.1109/isca.2016.59
Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
Myung Kuk Yoon,Keun‐Soo Kim,Phil Lee,Won Woo Ro,Murali Annavaram +4 more
- 01 Jun 2016
31
TL;DR: The paper proposes a Virtual Thread (VT) architecture that maximizes thread-level parallelism beyond the GPU scheduling limit by assigning Cooperative Thread Arrays (CTAs) up to the capacity limit and minimizing logic complexity.
read more
Abstract: Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity
TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.
24
Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing
TL;DR: Experiments show that the combination of multiple sub-resource borrowing techniques enhances the total throughput by up to 26 and 9.5 percent on average over the baseline spatial multitasking GPU.
22
Survey on the run‐time systems of enterprise application integration platforms focusing on performance
TL;DR: An evaluation of nine open‐source integration platforms, which represent the state‐of‐the‐art, provide support to the integration patterns, and follow the pipes‐and‐filters architectural style are suggested.
21
WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs
Keunsoo Kim,Won Woo Ro +1 more
- 27 Mar 2018
TL;DR: This paper proposes warp instruction reuse to allow such repeated warp instructions to reuse previous computation results instead of actually executing the instructions, and proposes warp register reuse which allows identical warp register values to share a single physical register through register renaming.
21
FineReg: fine-grained register file management for augmenting GPU throughput
Yunho Oh,Myung Kuk Yoon,William J. Song,Won Woo Ro +3 more
- 20 Oct 2018
TL;DR: This paper proposes a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs by reducing the effective size of per-CTA registers.
15
References
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
- 04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Programming Massively Parallel Processors. A Hands-on Approach
TL;DR: This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs).
1.9K
Analyzing CUDA workloads using a detailed GPU simulator
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt +4 more
- 26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
NVIDIA Tesla: A Unified Graphics and Computing Architecture
TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?
TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.
1.4K