Journal Article10.1016/J.VLSI.2018.12.006
Efficient automatic parallelization of a single GPU program for a multiple GPU system
3
TL;DR: This work explores hardware support to efficiently parallelize a single GPU code for execution on multiple GPUs and proposes a data-location aware thread block scheduler to schedule the thread blocks on the GPU that has most of its input data.
read more
About: This article is published in Integration. The article was published on 01 May 2019. The article focuses on the topics: Scheduling (computing) & Automatic parallelization.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
PMLES : a hybrid open MP CUDA source code for LES of turbulent flows
TL;DR: PMLES is presented, a new OpenMP CUDA Fortran solver for complex turbulent flows at high Reynolds numbers and large computational domains (about 1 × 108 cells), using a single GPU card.
•Posted Content
EPSR++: An Open Source Empirical Potential Structure Refinement Neutron Data Analysis Framework Supporting Parallel Across Computer Cluster Nodes and GPU Hardware Acceleration
Changli Ma,He Cheng,Taisen Zuo,Guisheng Jiao,Zehua Han +4 more
- 12 Apr 2019
TL;DR: An open source framework EPSR++ is introduced that can be paralleled across nodes within a computer cluster and supports GPU acceleration, and the framework is programmed in C++ object-oriented language, thereby users can define special simulation box, atoms, molecules and random motion patterns conveniently for their analysis.
NeuDATool: An open source neutron data analysis tools, supporting GPU hardware acceleration, and across-computer cluster nodes parallel
TL;DR: An open source framework NeuDATool, which is programmed in the object-oriented language C++, can be paralleled across nodes within a computer cluster, and supports GPU acceleration, is proposed and tested.
References
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
- 04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Analyzing CUDA workloads using a detailed GPU simulator
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt +4 more
- 26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
CACTI 6.0: A Tool to Model Large Caches
Naveen Muralimanohar,Rajeev Balasubramonian,Norman P. Jouppi +2 more
- 01 Jan 2009
TL;DR: This report details the analytical model assumed for the newly added modules along with their validation analysis of CACTI 6.0, a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches.
1K
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing
John A. Stratton,Christopher I. Rodrigues,I-Jui Sung,Nady Obeid,Li-Wen Chang,Nasser Anssari,Geng Daniel Liu,Wen-mei W. Hwu +7 more
- 01 Jan 2012
TL;DR: By including versions of varying levels of optimization of the same fundamental algorithm, the Parboil benchmarks present opportunities to demonstrate tools and architectures that help programmers get the most out of their parallel hardware.
830
Operating system support for improving data locality on CC-NUMA compute servers
Ben Verghese,Scott W. Devine,Anoop Gupta,Mendel Rosenblum +3 more
- 01 Sep 1996
TL;DR: The experiments show that dynamic page migration and replication can substantially increase application performance, as much as 30%, and reduce contention for resources in the NUMA memory system.