Proceedings Article10.1145/2751205.2751239
DaCache: Memory Divergence-Aware GPU Cache Management
Bin Wang,Weikuan Yu,Xian-He Sun,Xinning Wang +3 more
- 08 Jun 2015
- pp 89-98
33
TL;DR: A Divergence-Aware Cache (DaCache) management that can orchestrate L1D cache management and warp scheduling together for GPGPUs and achieves 40% performance improvement over the baseline GPU and outperforms two state-of-the-art thrashing-resistant techniques RRIP and DIP by 40% and 24.9%, respectively.
read more
Abstract: The lock-step execution model of GPU requires a warp to have the data blocks for all its threads before execution. However, there is a lack of salient cache mechanisms that can recognize the need of managing GPU cache blocks at the warp level for increasing the number of warps ready for execution. In addition, warp scheduling is very important for GPU-specific cache management to reduce both intra- and inter-warp conflicts and maximize data locality. In this paper, we propose a Divergence-Aware Cache (DaCache) management that can orchestrate L1D cache management and warp scheduling together for GPGPUs. In DaCache, the insertion position of an incoming data block depends on the fetching warp's scheduling priority. Blocks of warps with lower priorities are inserted closer to the LRU position of the LRU-chain so that they have shorter lifetime in cache. This fine-grained insertion policy is extended to prioritize coherent loads over divergent loads so that coherent loads are less vulnerable to both inter- and intra-warp thrashing. DaCache also adopts a constrained replacement policy with L1D bypassing to sustain a good supply of Fully Cached Warps (FCW), along with a dynamic mechanism to adjust FCW during runtime. Our experiments demonstrate that DaCache achieves 40.4% performance improvement over the baseline GPU and outperforms two state-of-the-art thrashing-resistant techniques RRIP and DIP by 40% and 24.9%, respectively.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
The locality descriptor: a holistic cross-layer abstraction to express data locality in GPUs
Nandita Vijaykumar,Eiman Ebrahimi,Kevin Hsieh,Phillip B. Gibbons,Onur Mutlu +4 more
- 02 Jun 2018
TL;DR: The Locality Descriptor is proposed, a crossl-ayer abstraction to explicitly express and exploit data locality in GPUs that improves performance by 26.6% on average when exploiting reuse-based locality in the cache hierarchy, and by 53.7% when exploiting N UMA locality in a NUMA memory system.
71
Scheduling page table walks for irregular GPU applications
Seunghee Shin,Guilherme Cox,Mark Oskin,Gabriel H. Loh,Yan Solihin,Abhishek Bhattacharjee,Arkaprava Basu +6 more
- 02 Jun 2018
TL;DR: This work discovers that the order of servicing GPU's address translation requests plays a key role in determining the amount of translation overhead experienced by an application, and shows that better forward progress is achieved by prioritizing translation requests from the instructions that require less work to service their address translation needs.
66
OAWS: Memory Occlusion Aware Warp Scheduling
Bin Wang,Yue Zhu,Weikuan Yu +2 more
- 11 Sep 2016
TL;DR: This work has designed memory Occlusion Aware Warp Scheduling (OAWS) that can dynamically predict the demand of MSHR entries of divergent memory instructions, and maximize the number of concurrent warps such that their aggregate MSHR consumptions are within the MSHR capacity.
27
A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity
TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.
24
CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs
Jiya Su,Feng Zhang,Weifeng Liu,Bingsheng He,Ruofan Wu,Xiaoyong Du,Rujia Wang +6 more
- 17 Aug 2020
TL;DR: CapelliniSp TRSV is proposed, a thread-level synchronization-free SpTRSV algorithm that can achieve very good performance on the most popular sparse matrix storage, compressed sparse row (CSR) format, and thus users do not need to conduct format conversion.
15
References
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
- 04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
A study of replacement algorithms for a virtual-storage computer
TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.
Analyzing CUDA workloads using a detailed GPU simulator
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt +4 more
- 26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
NVIDIA Tesla: A Unified Graphics and Computing Architecture
TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
High performance cache replacement using re-reference interval prediction (RRIP)
Aamer Jaleel,Kevin B. Theobald,Simon C. Steely,Joel Emer +3 more
- 19 Jun 2010
TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.
829
Related Papers (5)
Timothy G. Rogers,Mike O'Connor,Tor M. Aamodt +2 more
- 01 Dec 2012
Timothy G. Rogers,Mike O'Connor,Tor M. Aamodt +2 more
- 07 Dec 2013
Martin Burtscher,Rupesh Nasre,Keshav Pingali +2 more
- 04 Nov 2012