High Performance Stencil Code Algorithms for GPGPUs

doi:10.1016/J.PROCS.2011.04.221

Open AccessJournal Article10.1016/J.PROCS.2011.04.221

High Performance Stencil Code Algorithms for GPGPUs

Andreas Schäfer, +1 more

- 01 Jan 2011

- Vol. 4, pp 2027-2036

63

TL;DR: This paper represents the first successful application of temporal blocking for 3D stencils on GPGPUs and thereby exceeds previous results by a considerable margin and is also the first paper to study stencil codes on Fermi.

Abstract: In this paper we investigate how stencil computations can be implemented on state-of-the-art general purpose graphics processing units (GPGPUs). Stencil codes can be found at the core of many numerical solvers and physical simulation codes and are therefore of particular interest to scientific computing research. GPGPUs have gained a lot of attention recently because of their superior floating point performance and memory bandwidth. Nevertheless, especially memory bound stencil codes have proven to be challenging for GPGPUs, yielding lower than to be expected speedups. We chose the Jacobi method as a standard benchmark to evaluate a set of algorithms on NVIDIA's latest Fermi chipset. One of our fastest algorithms is a parallel wavefront update. It exploits the enlarged on-chip shared memory to perform two time step updates per sweep. To the best of our knowledge, it represents the first successful applicationof temporal blocking for 3D stencils on GPGPUs and thereby exceeds previous results by a considerable margin. It is also the first paper to study stencil codes on Fermi.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1002/NLA.1808

Improving the arithmetic intensity of multigrid with the help of polynomial smoothers

Pieter Ghysels, +4 more

- 01 Mar 2012

- Numerical Linear Algebra With Applicatio...

TL;DR: A polynomial Chebyshev smoother is proposed, which is implemented using cache‐aware tiling, to increase the arithmetic intensity of a multigrid V‐cycle and observes optimal performance for higher degrees of the smoother.

...read moreread less

39

•Proceedings Article•10.1109/CLUSTER.2012.72

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

John Jenkins, +4 more

- 24 Sep 2012

TL;DR: This work presents an MPI data type-processing system capable of efficiently processing arbitrary data types directly on the GPU, and presents a means for converting conventional data type representations into a GPU-amenable format.

...read moreread less

36

•Proceedings Article•10.1109/ICCD.2016.7753287

Tuning Stencil codes in OpenCL for FPGAs

Qi Jia, +1 more

- 07 Sep 2016

TL;DR: This paper explores OpenCL code optimizations for stencil computations on FPGAs in both the Single-Task and NDRange modes and proposes tuning processes that can achieve up to two orders of magnitude performance improvement over the naïve kernels.

...read moreread less

36

•Journal Article•10.1142/S0218126614300025

A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUs

Sparsh Mittal

- 18 Jun 2014

- Journal of Circuits, Systems, and Comput...

TL;DR: In this paper, several architectural and system-level techniques proposed for managing and leveraging GPU caches are surveyed and motivated to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.

...read moreread less

35

Journal Article•10.1002/CPE.3417

Adaptation of fluid model EULAG to graphics processing unit architecture

Krzysztof Rojek, +9 more

- 25 Mar 2015

- Concurrency and Computation: Practice an...

TL;DR: A method is proposed, which ensures a comprehensive analysis of the resource consumption of the EULAG algorithm, including data transfers between host and global memory, global and shared memories, as well as GPU occupancy, which shows a promising increase in terms of computational efficiency.

...read moreread less

26

...

Expand

References

•Book

Theory of Self-Reproducing Automata

John von Neumann, +1 more

- 01 Jan 1966

TL;DR: This invention relates to prefabricated buildings and comprises a central unit having a peripheral section therearound to form a main residential part defined by an assembly of juxtaposed roofing and facing trusses.

...read moreread less

5.7K

Journal Article•10.1016/0020-0271(69)90026-6

Theory of self-reproducing automata: John von Neumann (edited by A.W. Burks). University of Illinois Press, Urbana, 1966. xiii + 388pp., $10.00

H.P. Edmundson

- 01 Oct 1969

- Information Storage and Retrieval

TL;DR: This invention relates to prefabricated buildings and comprises a central unit having a peripheral section therearound to form a main residential part.

...read moreread less

3.1K

•Journal Article•10.1145/1498765.1498785

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, +2 more

- 01 Apr 2009

- Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

2.6K

Monograph•10.4028/WWW.SCIENTIFIC.NET/RC.35

Fundamentals of Solidification: Fourth Revised Edition

W. Kurz, +1 more

- 13 Feb 1998

- Retrospective Collection

1.5K

Proceedings Article•10.1109/SC.2010.2

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Anthony Nguyen, +4 more

- 13 Nov 2010

TL;DR: A novel 3.

...read moreread less

331

...

Expand

High Performance Stencil Code Algorithms for GPGPUs

Chat with Paper

AI Agents for this Paper

Citations

Improving the arithmetic intensity of multigrid with the help of polynomial smoothers

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

Tuning Stencil codes in OpenCL for FPGAs

A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUs

Adaptation of fluid model EULAG to graphics processing unit architecture

References

Theory of Self-Reproducing Automata

Theory of self-reproducing automata: John von Neumann (edited by A.W. Burks). University of Illinois Press, Urbana, 1966. xiii + 388pp., $10.00

Roofline: an insightful visual performance model for multicore architectures

Fundamentals of Solidification: Fourth Revised Edition

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Related Papers (5)

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Roofline: an insightful visual performance model for multicore architectures

Cache oblivious stencil computations

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers