Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

doi:10.1145/3007787.3001159

Journal Article10.1145/3007787.3001159

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Kevin Hsieh, +7 more

- 18 Jun 2016

- Vol. 44, Iss: 3, pp 204-216

288

TL;DR: Extensive evaluations across a variety of modern memory-intensive GPU workloads show that TOM significantly improves performance compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

Abstract: Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer.Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping.Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1145/2749469.2750386

A scalable processing-in-memory accelerator for parallel graph processing

Junwhan Ahn, +4 more

- 13 Jun 2015

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.

...read moreread less

937

Proceedings Article•10.1145/3123939.3124544

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

Vivek Seshadri, +9 more

- 14 Oct 2017

TL;DR: Ambit is proposed, an Accelerator-in-Memory for bulk bitwise operations that largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area).

...read moreread less

614

Proceedings Article•10.1145/3173162.3173177

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand, +10 more

- 19 Mar 2018

TL;DR: This work comprehensively analyzes the energy and performance impact of data movement for several widely-used Google consumer workloads, and finds that processing-in-memory (PIM) can significantly reduceData movement for all of these workloads by performing part of the computation close to memory.

...read moreread less

368

Proceedings Article•10.1109/HPCA.2017.54

GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks

Lifeng Nai, +5 more

- 01 Feb 2017

TL;DR: GraphPIM is presented, a full-stack solution for graph computing that achieves higher performance using PIM functionality and an extension to PIM operations that can further bring performance benefits for more graph applications.

...read moreread less

318

Proceedings Article•10.1109/ICCD.2016.7753257

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation

Kevin Hsieh, +6 more

- 01 Oct 2016

TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.

...read moreread less

272

...

Expand

References

•Book

Principles and Practices of Interconnection Networks

William J. Dally, +1 more

- 01 Jan 2004

TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.

...read moreread less

3.6K

•Proceedings Article•10.1109/IISWC.2009.5306797

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

- 04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

3.2K

•Proceedings Article•10.1109/ISPASS.2009.4919648

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

- 26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

1.8K

•Proceedings Article•10.1145/339647.339668

Memory access scheduling

Scott Rixner, +4 more

- 01 May 2000

TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.

...read moreread less

1.1K

Proceedings Article•10.1145/2749469.2750386

A scalable processing-in-memory accelerator for parallel graph processing

Junwhan Ahn, +4 more

- 13 Jun 2015

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.

...read moreread less

937