Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

doi:10.1109/isca.2016.59

Journal Article10.1109/isca.2016.59

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

Myung Kuk Yoon, +4 more

- 01 Jun 2016

31

TL;DR: The paper proposes a Virtual Thread (VT) architecture that maximizes thread-level parallelism beyond the GPU scheduling limit by assigning Cooperative Thread Arrays (CTAs) up to the capacity limit and minimizing logic complexity.

Abstract: Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/J.JPDC.2018.11.012

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

Mahmoud Khairy, +3 more

- 01 May 2019

- Journal of Parallel and Distributed Comp...

TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.

...read moreread less

24

Journal Article•10.1109/LCA.2018.2889042

Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing

Jiho Kim, +4 more

- 01 Jan 2019

- IEEE Computer Architecture Letters

TL;DR: Experiments show that the combination of multiple sub-resource borrowing techniques enhances the total throughput by up to 26 and 9.5 percent on average over the baseline spatial multitasking GPU.

...read moreread less

22

Journal Article•10.1002/SPE.2670

Survey on the run‐time systems of enterprise application integration platforms focusing on performance

Daniela L. Freire, +3 more

- 01 Mar 2019

- Software - Practice and Experience

TL;DR: An evaluation of nine open‐source integration platforms, which represent the state‐of‐the‐art, provide support to the integration patterns, and follow the pipes‐and‐filters architectural style are suggested.

...read moreread less

21

Proceedings Article•10.1109/HPCA.2018.00041

WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs

Keunsoo Kim, +1 more

- 27 Mar 2018

TL;DR: This paper proposes warp instruction reuse to allow such repeated warp instructions to reuse previous computation results instead of actually executing the instructions, and proposes warp register reuse which allows identical warp register values to share a single physical register through register renaming.

...read moreread less

21

Proceedings Article•10.1109/MICRO.2018.00037

FineReg: fine-grained register file management for augmenting GPU throughput

Yunho Oh, +3 more

- 20 Oct 2018

TL;DR: This paper proposes a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs by reducing the effective size of per-CTA registers.

...read moreread less

15

...

Expand

References

•Proceedings Article•10.1109/IISWC.2009.5306797

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

- 04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

3.2K

•Journal Article•10.12694/SCPE.V11I3.654

Programming Massively Parallel Processors. A Hands-on Approach

Jie Cheng

- 01 Jan 2010

- Scalable Computing: Practice and Experie...

TL;DR: This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs).

...read moreread less

1.9K

•Proceedings Article•10.1109/ISPASS.2009.4919648

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

- 26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

1.8K

Journal Article•10.1109/MM.2008.31

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Erik Lindholm, +3 more

- 01 Mar 2008

- IEEE Micro

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

1.6K

•Journal Article•10.1145/1365490.1365500

Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

John R. Nickolls, +3 more

- 01 Mar 2008

- ACM Queue

TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.

...read moreread less

1.4K

...

Expand