Journal Article10.1145/3007787.3001159
Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems
Kevin Hsieh,Eiman Ebrahimi,Gwangsun Kim,Niladrish Chatterjee,Mike O'Connor,Nandita Vijaykumar,Onur Mutlu,Stephen W. Keckler +7 more
- 18 Jun 2016
- Vol. 44, Iss: 3, pp 204-216
288
TL;DR: Extensive evaluations across a variety of modern memory-intensive GPU workloads show that TOM significantly improves performance compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
read more
Abstract: Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer.Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping.Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A scalable processing-in-memory accelerator for parallel graph processing
Junwhan Ahn,Sungpack Hong,Sungjoo Yoo,Onur Mutlu,Kiyoung Choi +4 more
- 13 Jun 2015
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology
Vivek Seshadri,Donghyuk Lee,Thomas Mullins,Hasan Hassan,Amirali Boroumand,Jeremie S. Kim,Michael Kozuch,Onur Mutlu,Phillip B. Gibbons,Todd C. Mowry +9 more
- 14 Oct 2017
TL;DR: Ambit is proposed, an Accelerator-in-Memory for bulk bitwise operations that largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area).
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks
Amirali Boroumand,Saugata Ghose,Youngsok Kim,Rachata Ausavarungnirun,Eric Shiu,Rahul Thakur,Dae Hyun Kim,Aki Kuusela,Allan Knies,Parthasarathy Ranganathan,Onur Mutlu +10 more
- 19 Mar 2018
TL;DR: This work comprehensively analyzes the energy and performance impact of data movement for several widely-used Google consumer workloads, and finds that processing-in-memory (PIM) can significantly reduceData movement for all of these workloads by performing part of the computation close to memory.
368
GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks
Lifeng Nai,Ramyad Hadidi,Jaewoong Sim,Hyojong Kim,Pranith Kumar,Hyesoon Kim +5 more
- 01 Feb 2017
TL;DR: GraphPIM is presented, a full-stack solution for graph computing that achieves higher performance using PIM functionality and an extension to PIM operations that can further bring performance benefits for more graph applications.
318
Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation
Kevin Hsieh,Samira Khan,Nandita Vijaykumar,Kevin K. Chang,Amirali Boroumand,Saugata Ghose,Onur Mutlu +6 more
- 01 Oct 2016
TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.
272
References
•Book
Principles and Practices of Interconnection Networks
William J. Dally,Brian Towles +1 more
- 01 Jan 2004
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
3.6K
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
- 04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Analyzing CUDA workloads using a detailed GPU simulator
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt +4 more
- 26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Memory access scheduling
Scott Rixner,William J. Dally,Ujval J. Kapasi,Peter Mattson,John D. Owens +4 more
- 01 May 2000
TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
A scalable processing-in-memory accelerator for parallel graph processing
Junwhan Ahn,Sungpack Hong,Sungjoo Yoo,Onur Mutlu,Kiyoung Choi +4 more
- 13 Jun 2015
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.