Efficient automatic parallelization of a single GPU program for a multiple GPU system

doi:10.1016/J.VLSI.2018.12.006

Journal Article10.1016/J.VLSI.2018.12.006

Efficient automatic parallelization of a single GPU program for a multiple GPU system

Matam Kiran Kumar, +2 more

- 01 May 2019

- Integration

- Vol. 66, pp 35-43

3

TL;DR: This work explores hardware support to efficiently parallelize a single GPU code for execution on multiple GPUs and proposes a data-location aware thread block scheduler to schedule the thread blocks on the GPU that has most of its input data.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.36884/JAFM.13.04.30698

PMLES : a hybrid open MP CUDA source code for LES of turbulent flows

Jean Monteiro de Pinho, +1 more

- 01 Jul 2020

- Journal of Applied Fluid Mechanics

TL;DR: PMLES is presented, a new OpenMP CUDA Fortran solver for complex turbulent flows at high Reynolds numbers and large computational domains (about 1 × 108 cells), using a single GPU card.

...read moreread less

3

•Posted Content

EPSR++: An Open Source Empirical Potential Structure Refinement Neutron Data Analysis Framework Supporting Parallel Across Computer Cluster Nodes and GPU Hardware Acceleration

Changli Ma, +4 more

- 12 Apr 2019

TL;DR: An open source framework EPSR++ is introduced that can be paralleled across nodes within a computer cluster and supports GPU acceleration, and the framework is programmed in C++ object-oriented language, thereby users can define special simulation box, atoms, molecules and random motion patterns conveniently for their analysis.

...read moreread less

1

•Journal Article•10.1063/1674-0068/CJCP2005077

NeuDATool: An open source neutron data analysis tools, supporting GPU hardware acceleration, and across-computer cluster nodes parallel

Changli Ma, +5 more

- 27 Dec 2020

- Chinese Journal of Chemical Physics

TL;DR: An open source framework NeuDATool, which is programmed in the object-oriented language C++, can be paralleled across nodes within a computer cluster, and supports GPU acceleration, is proposed and tested.

...read moreread less

References

•Proceedings Article•10.1109/IISWC.2009.5306797

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

- 04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

3.2K

•Proceedings Article•10.1109/ISPASS.2009.4919648

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

- 26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

1.8K

CACTI 6.0: A Tool to Model Large Caches

Naveen Muralimanohar, +2 more

- 01 Jan 2009

TL;DR: This report details the analytical model assumed for the newly added modules along with their validation analysis of CACTI 6.0, a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches.

...read moreread less

1K

Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing

John A. Stratton, +7 more

- 01 Jan 2012

TL;DR: By including versions of varying levels of optimization of the same fundamental algorithm, the Parboil benchmarks present opportunities to demonstrate tools and architectures that help programmers get the most out of their parallel hardware.

...read moreread less

830

Proceedings Article•10.1145/237090.237205

Operating system support for improving data locality on CC-NUMA compute servers

Ben Verghese, +3 more

- 01 Sep 1996

TL;DR: The experiments show that dynamic page migration and replication can substantially increase application performance, as much as 30%, and reduce contention for resources in the NUMA memory system.

...read moreread less

300