Arbitrary Modulus Indexing

doi:10.1109/MICRO.2014.13

Open AccessProceedings Article10.1109/MICRO.2014.13

Arbitrary Modulus Indexing

Jeff Diamond, +2 more

- 13 Dec 2014

- pp 140-152

22

TL;DR: A new scheme called Arbitrary Modulus Indexing (AMI) is introduced that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs.

Abstract: Modern high performance processors require memory systems that can provide access to data at a rate that is well matched to the processor's computation rate. Common to such systems is the organization of memory into local high speed memory banks that can be accessed in parallel. Associative look up of values is made efficient through indexing instead of associative memories. These techniques lose effectiveness when data locations are not mapped uniformly to the banks or cache locations, leading to bottlenecks that arise from excess demand on a subset of locations. Address mapping is most easily performed by indexing the banks using a mod (2 N) indexing scheme, but such schemes interact poorly with the memory access patterns of many computations, making resource conflicts a significant memory system bottleneck. Previous work has assumed that prime moduli are the best choices to alleviate conflicts and has concentrated on finding efficient implementations for them. In this paper, we introduce a new scheme called Arbitrary Modulus Indexing (AMI) that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs. We also demonstrate that, for a memory-intensive workload on a modern replay-style GPU architecture, prime moduli are not in general the best choices for memory bank and cache set mappings. Applying AMI to set of memory intensive benchmarks eliminates 98% of bank and set conflicts, resulting in an average speedup of 24% over an aggressive baseline system and a 64% average reduction in memory system replays at reasonable implementation cost.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/J.JPDC.2018.11.012

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

Mahmoud Khairy, +3 more

- 01 May 2019

- Journal of Parallel and Distributed Comp...

TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.

...read moreread less

24

Journal Article•10.1145/3009971

(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies

Hsin-Jung Yang, +4 more

- 01 Sep 2015

TL;DR: An initial exploration of methods for automating the construction of application-specific memory hierarchies is performed, and it is demonstrated, by examining both hand-assembled and HLS-compiled benchmarks, that the application-optimized memory system can improve pre-existing application runtime by 25% on average.

...read moreread less

14

Journal Article•10.1016/J.MICPRO.2016.01.003

Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing

Kyu Yeun Kim, +1 more

- 01 Jun 2016

- Microprocessors and Microsystems

TL;DR: The experimental results show that the ACI schemes are promising in that they continue to provide significant performance gains even when additional indexing latency occurs due to the hardware complexity and the baseline cache is enhanced with high associativity or large capacity.

...read moreread less

13

Journal Article•10.1109/TPDS.2017.2657512

Dynamic Associativity Management in Tiled CMPs by Runtime Adaptation of Fellow Sets

Shirshendu Das, +1 more

- 01 Aug 2017

- IEEE Transactions on Parallel and Distri...

TL;DR: The proposed technique called FS-DAM dynamically creates fellow- groups based on the current set loads ensuring that the heavily used sets are evenly distributed among all the fellow-groups, which increases the utilization of the cache and hence improves performance.

...read moreread less

8

Journal Article•10.1109/TC.2015.2479595

Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs

Gert-Jan van den Braak, +4 more

- 01 Jul 2016

- IEEE Transactions on Computers

TL;DR: This paper explores the use of configurable bit-vector and bitwise XOR-based hash functions to evenly distribute memory addresses of the access patterns over the memory banks, reducing the number of bank conflicts.

...read moreread less

7

...

Expand

References

•Proceedings Article•10.1109/IISWC.2009.5306797

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

- 04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

3.2K

•Proceedings Article•10.1109/ISPASS.2009.4919648

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

- 26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

1.8K

Proceedings Article•10.1145/2485922.2485964

GPUWattch: enabling energy optimizations in GPGPUs

Jingwen Leng, +6 more

- 23 Jun 2013

TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.

...read moreread less

621

Proceedings Article•10.1109/ISPASS.2010.5452013

Demystifying GPU microarchitecture through microbenchmarking

Henry Wong, +3 more

- 28 Mar 2010

TL;DR: This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU, exposing undocumented features that impact program performance and correctness.

...read moreread less

540

•Proceedings Article•10.1145/2000064.2000093

Energy-efficient mechanisms for managing thread context in throughput processors

Mark Gebhart, +6 more

- 04 Jun 2011

TL;DR: Two complementary techniques for reducing energy on massively-threaded processors such as GPUs are presented and it is shown that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively.

...read moreread less

320