DaCache: Memory Divergence-Aware GPU Cache Management

doi:10.1145/2751205.2751239

Proceedings Article10.1145/2751205.2751239

DaCache: Memory Divergence-Aware GPU Cache Management

Bin Wang, +3 more

- 08 Jun 2015

- pp 89-98

33

TL;DR: A Divergence-Aware Cache (DaCache) management that can orchestrate L1D cache management and warp scheduling together for GPGPUs and achieves 40% performance improvement over the baseline GPU and outperforms two state-of-the-art thrashing-resistant techniques RRIP and DIP by 40% and 24.9%, respectively.

Abstract: The lock-step execution model of GPU requires a warp to have the data blocks for all its threads before execution. However, there is a lack of salient cache mechanisms that can recognize the need of managing GPU cache blocks at the warp level for increasing the number of warps ready for execution. In addition, warp scheduling is very important for GPU-specific cache management to reduce both intra- and inter-warp conflicts and maximize data locality. In this paper, we propose a Divergence-Aware Cache (DaCache) management that can orchestrate L1D cache management and warp scheduling together for GPGPUs. In DaCache, the insertion position of an incoming data block depends on the fetching warp's scheduling priority. Blocks of warps with lower priorities are inserted closer to the LRU position of the LRU-chain so that they have shorter lifetime in cache. This fine-grained insertion policy is extended to prioritize coherent loads over divergent loads so that coherent loads are less vulnerable to both inter- and intra-warp thrashing. DaCache also adopts a constrained replacement policy with L1D bypassing to sustain a good supply of Fully Cached Warps (FCW), along with a dynamic mechanism to adjust FCW during runtime. Our experiments demonstrate that DaCache achieves 40.4% performance improvement over the baseline GPU and outperforms two state-of-the-art thrashing-resistant techniques RRIP and DIP by 40% and 24.9%, respectively.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1109/ISCA.2018.00074

The locality descriptor: a holistic cross-layer abstraction to express data locality in GPUs

Nandita Vijaykumar, +4 more

- 02 Jun 2018

TL;DR: The Locality Descriptor is proposed, a crossl-ayer abstraction to explicitly express and exploit data locality in GPUs that improves performance by 26.6% on average when exploiting reuse-based locality in the cache hierarchy, and by 53.7% when exploiting N UMA locality in a NUMA memory system.

...read moreread less

71

Proceedings Article•10.1109/ISCA.2018.00025

Scheduling page table walks for irregular GPU applications

Seunghee Shin, +6 more

- 02 Jun 2018

TL;DR: This work discovers that the order of servicing GPU's address translation requests plays a key role in determining the amount of translation overhead experienced by an application, and shows that better forward progress is achieved by prioritizing translation requests from the instructions that require less work to service their address translation needs.

...read moreread less

66

Proceedings Article•10.1145/2967938.2967947

OAWS: Memory Occlusion Aware Warp Scheduling

Bin Wang, +2 more

- 11 Sep 2016

TL;DR: This work has designed memory Occlusion Aware Warp Scheduling (OAWS) that can dynamically predict the demand of MSHR entries of divergent memory instructions, and maximize the number of concurrent warps such that their aggregate MSHR consumptions are within the MSHR capacity.

...read moreread less

27

Journal Article•10.1016/J.JPDC.2018.11.012

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

Mahmoud Khairy, +3 more

- 01 May 2019

- Journal of Parallel and Distributed Comp...

TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.

...read moreread less

24

Proceedings Article•10.1145/3404397.3404400

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs

Jiya Su, +6 more

- 17 Aug 2020

TL;DR: CapelliniSp TRSV is proposed, a thread-level synchronization-free SpTRSV algorithm that can achieve very good performance on the most popular sparse matrix storage, compressed sparse row (CSR) format, and thus users do not need to conduct format conversion.

...read moreread less

15

...

Expand

References

•Proceedings Article•10.1109/IISWC.2009.5306797

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

- 04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

3.2K

Journal Article•10.1147/SJ.52.0078

A study of replacement algorithms for a virtual-storage computer

Laszlo A. Belady

- 01 Jun 1966

- Ibm Systems Journal

TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.

...read moreread less

1.8K

•Proceedings Article•10.1109/ISPASS.2009.4919648

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

- 26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

1.8K

Journal Article•10.1109/MM.2008.31

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Erik Lindholm, +3 more

- 01 Mar 2008

- IEEE Micro

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

1.6K

Proceedings Article•10.1145/1815961.1815971

High performance cache replacement using re-reference interval prediction (RRIP)

Aamer Jaleel, +3 more

- 19 Jun 2010

TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.

...read moreread less

829

...

Expand

DaCache: Memory Divergence-Aware GPU Cache Management

Chat with Paper

AI Agents for this Paper

Citations

The locality descriptor: a holistic cross-layer abstraction to express data locality in GPUs

Scheduling page table walks for irregular GPU applications

OAWS: Memory Occlusion Aware Warp Scheduling

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs

References

Rodinia: A benchmark suite for heterogeneous computing

A study of replacement algorithms for a virtual-storage computer

Analyzing CUDA workloads using a detailed GPU simulator

NVIDIA Tesla: A Unified Graphics and Computing Architecture

High performance cache replacement using re-reference interval prediction (RRIP)

Related Papers (5)

Rodinia: A benchmark suite for heterogeneous computing

Analyzing CUDA workloads using a detailed GPU simulator

Cache-Conscious Wavefront Scheduling

Divergence-aware warp scheduling

A quantitative study of irregular programs on GPUs