Arbitrary Modulus Indexing
Jeff Diamond,Donald S. Fussell,Stephen W. Keckler +2 more
- 13 Dec 2014
- pp 140-152
TL;DR: A new scheme called Arbitrary Modulus Indexing (AMI) is introduced that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs.
read more
Abstract: Modern high performance processors require memory systems that can provide access to data at a rate that is well matched to the processor's computation rate. Common to such systems is the organization of memory into local high speed memory banks that can be accessed in parallel. Associative look up of values is made efficient through indexing instead of associative memories. These techniques lose effectiveness when data locations are not mapped uniformly to the banks or cache locations, leading to bottlenecks that arise from excess demand on a subset of locations. Address mapping is most easily performed by indexing the banks using a mod (2 N) indexing scheme, but such schemes interact poorly with the memory access patterns of many computations, making resource conflicts a significant memory system bottleneck. Previous work has assumed that prime moduli are the best choices to alleviate conflicts and has concentrated on finding efficient implementations for them. In this paper, we introduce a new scheme called Arbitrary Modulus Indexing (AMI) that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs. We also demonstrate that, for a memory-intensive workload on a modern replay-style GPU architecture, prime moduli are not in general the best choices for memory bank and cache set mappings. Applying AMI to set of memory intensive benchmarks eliminates 98% of bank and set conflicts, resulting in an average speedup of 24% over an aggressive baseline system and a 64% average reduction in memory system replays at reasonable implementation cost.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity
TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.
24
(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies
Hsin-Jung Yang,Kermin Fleming,Felix Winterstein,Michael Adler,Joel Emer +4 more
- 01 Sep 2015
TL;DR: An initial exploration of methods for automating the construction of application-specific memory hierarchies is performed, and it is demonstrated, by examining both hand-assembled and HLS-compiled benchmarks, that the application-optimized memory system can improve pre-existing application runtime by 25% on average.
14
Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing
Kyu Yeun Kim,Woongki Baek +1 more
TL;DR: The experimental results show that the ACI schemes are promising in that they continue to provide significant performance gains even when additional indexing latency occurs due to the hardware complexity and the baseline cache is enhanced with high associativity or large capacity.
13
Dynamic Associativity Management in Tiled CMPs by Runtime Adaptation of Fellow Sets
TL;DR: The proposed technique called FS-DAM dynamically creates fellow- groups based on the current set loads ensuring that the heavily used sets are evenly distributed among all the fellow-groups, which increases the utilization of the cache and hence improves performance.
8
Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs
Gert-Jan van den Braak,Juan Gómez-Luna,José María González-Linares,Henk Corporaal,Nicolás Guil +4 more
TL;DR: This paper explores the use of configurable bit-vector and bitwise XOR-based hash functions to evenly distribute memory addresses of the access patterns over the memory banks, reducing the number of bank conflicts.
7
References
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
- 04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Analyzing CUDA workloads using a detailed GPU simulator
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt +4 more
- 26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
GPUWattch: enabling energy optimizations in GPGPUs
Jingwen Leng,Tayler Hetherington,Ahmed ElTantawy,Syed Zohaib Gilani,Nam Sung Kim,Tor M. Aamodt,Vijay Janapa Reddi +6 more
- 23 Jun 2013
TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.
Demystifying GPU microarchitecture through microbenchmarking
Henry Wong,Misel-Myrto Papadopoulou,Maryam Sadooghi-Alvandi,Andreas Moshovos +3 more
- 28 Mar 2010
TL;DR: This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU, exposing undocumented features that impact program performance and correctness.
Energy-efficient mechanisms for managing thread context in throughput processors
Mark Gebhart,Daniel R. Johnson,David Tarjan,Stephen W. Keckler,William J. Dally,Erik Lindholm,Kevin Skadron +6 more
- 04 Jun 2011
TL;DR: Two complementary techniques for reducing energy on massively-threaded processors such as GPUs are presented and it is shown that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively.