Proceedings Article10.1109/CLUSTER.2017.42
Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems
Yingchao Huang,Dong Li +1 more
- 01 Sep 2017
- pp 166-177
14
TL;DR: This paper introduces performance modeling techniques to predict performance of various data placements on GPU, and introduces a series of techniques to model critical performance factors that cause performance variation across data placement.
read more
Abstract: A heterogeneous memory system (HMS) consists of multiple memory components with different properties. GPU is a representative architecture with HMS. It is challenging to decide optimal placement of data objects on HMS because of the large exploration space and complicated memory hierarchy on HMS. In this paper, we introduce performance modeling techniques to predict performance of various data placements on GPU. In essence, our models quantify and capture implicit performance correlation between different data placements. Given the memory access information and performance of a sample data placement, our models predict performance for other data placements based on the quantified correlation. We reveal critical performance factors that cause performance variation across data placements. Those factors include instruction replay, addressing mode, hardware queuing delay of memory requests, off-chip memory access latency, and caching effects. Those factors, which are often not sufficiently considered in the existing performance models, can significantly impact modeling accuracy. We introduce a series of techniques to model those factors. We extensively evaluate our models with a variety of benchmarks with various data placements. Our models are able to quantitatively predict the benefit or performance loss of data placements.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Is Data Placement Optimization Still Relevant on Newer GPUs
Abdullah Shahneous Bari,Larisa Stoltzfus,Pei-Hung Lin,Chunhua Liao,Murali Emani,Barbara Chapman +5 more
- 01 Nov 2018
TL;DR: A set of experiments is designed to explore the relevance of data placement optimizations on several generations of NVIDIA GPUs, including Kepler, Maxwell, Pascal, and Volta, and show that newer generations of GPUs are less sensitive to data placement optimization compared to older ones.
12
Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness
Zheng Xie,Jie Liu,Jiajia Li,Dong Li +3 more
- 25 Feb 2023
TL;DR: In this article , a load balance-aware page management system, named Merchandiser, is proposed to solve the problem of load imbalance among tasks in task-parallel HPC applications.
6
Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading
TL;DR: OAO is presented, a compiler-based approach to automatically translate shared-memory OpenMP data-parallel programs to run on heterogeneous multicores through OpenMP offloading directives, allowing programmers to continue using a single-source-based programming language that they are familiar with while benefiting from the heterogeneous performance.
5
XUnified: A Framework for Guiding Optimal Use of GPU Unified Memory
01 Jan 2022
TL;DR: XUnified as mentioned in this paper is an advice controller that combines the offline training with the online adaptation to guide the optimal use of unified memory and discrete memory for various applications at run-time.
4
3D photonics as enabling technology for deep 3D DRAM stacking
Sebastian Werner,Pouya Fotouhi,Xian Xiao,Marjan Fariborz,S. J. Ben Yoo,George Michelogiannakis,Dilip Vasudevan +6 more
- 30 Sep 2019
TL;DR: This paper proposes a hierarchical approach to stacking 3D DRAM to tens of layers by utilizing sub-stacks which are optically-interconnected to a memory interface on the processor die, and shows that photonics could be a key enabler for deep-3DDRAM offering at least 2× interconnect area savings compared to TSVs for the same bandwidth with comparable performance and less power.
References
Analyzing CUDA workloads using a detailed GPU simulator
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt +4 more
- 26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Scalable high performance main memory system using phase-change memory technology
Moinuddin K. Qureshi,Vijayalakshmi Srinivasan,Jude A. Rivers +2 more
- 20 Jun 2009
TL;DR: This paper analyzes a PCM-based hybrid main memory system using an architecture level model of PCM and proposes simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.
1.5K
•Journal Article
Modern Information Retrieval : A Brief Overview
TL;DR: This article is a brief overview of the key advances in the field of Information Retrieval, and a description of where the state-of-the-art is at inThe field.
Memory access scheduling
Scott Rixner,William J. Dally,Ujval J. Kapasi,Peter Mattson,John D. Owens +4 more
- 01 May 2000
TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Shane Ryoo,Christopher I. Rodrigues,Sara S. Baghsorkhi,Sam S. Stone,David B. Kirk,Wen-mei W. Hwu +5 more
- 20 Feb 2008
TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.