Efficient 3D stencil computations using CUDA

doi:10.1016/J.PARCO.2013.08.002

Journal Article10.1016/J.PARCO.2013.08.002

Efficient 3D stencil computations using CUDA

Marcin Krotkiewski, +1 more

- 01 Oct 2013

- Vol. 39, Iss: 10, pp 533-548

32

TL;DR: It is demonstrated that in the implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that the method of reading the data is close to optimal in terms of memory bandwidth usage.

Abstract: We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

54

Journal Article•10.1002/2017GL074293

Pore Fluid Extraction by Reactive Solitary Waves in 3‐D

Samuel Omlin, +2 more

- 28 Sep 2017

- Geophysical Research Letters

TL;DR: In this paper, the authors present results from high-resolution simulations using Graphic Processing Unit parallel processing with a model that includes both viscous (de)compaction and reaction-induced porosity changes.

...read moreread less

38

•Journal Article•10.1109/TPDS.2014.2384047

Implementation of the DWT in a GPU through a Register-based Strategy

Pablo Enfedaque, +2 more

- 01 Dec 2015

- IEEE Transactions on Parallel and Distri...

TL;DR: Experimental results indicate that the proposed method is, at least, four times faster than the best GPU implementation of the DWT found in the literature, and theoretical analysis coincide with experimental tests in proving that the execution times achieved by the proposed implementation are close to the GPU's performance limits.

...read moreread less

36

Journal Article•10.1002/CPE.3417

Adaptation of fluid model EULAG to graphics processing unit architecture

Krzysztof Rojek, +9 more

- 25 Mar 2015

- Concurrency and Computation: Practice an...

TL;DR: A method is proposed, which ensures a comprehensive analysis of the resource consumption of the EULAG algorithm, including data transfers between host and global memory, global and shared memories, as well as GPU occupancy, which shows a promising increase in terms of computational efficiency.

...read moreread less

26

Proceedings Article•10.1145/3126908.3126920

Tessellating stencils

Liang Yuan, +3 more

- 12 Nov 2017

TL;DR: This paper proposes a novel two-level tessellation scheme for stencil computations that performs up to 12% better than the existing highly concurrent schemes for the 3d27p stencil.

...read moreread less

25

...

Expand

References

•Proceedings Article•10.1145/1654059.1654078

Implementing sparse matrix-vector multiplication on throughput-oriented processors

Nathan Bell, +1 more

- 14 Nov 2009

TL;DR: This work explores SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes, including structured grid and unstructured mesh matrices.

...read moreread less

1K

Proceedings Article•10.1145/1198555.1198795

Linear algebra operators for GPU implementation of numerical algorithms

Jens Krüger, +1 more

- 01 Jul 2003

TL;DR: This work proposes a stream model for arithmetic operations on vectors and matrices that exploits the intrinsic parallelism and efficient communication on modern GPUs and introduces a framework for the implementation of linear algebra operators on programmable graphics processors (GPUs), thus providing the building blocks for the design of more complex numerical algorithms.

...read moreread less

762

•Proceedings Article•10.1109/SC.2008.5214359

Benchmarking GPUs to tune dense linear algebra

Volkov, +1 more

- 01 Jan 2008

TL;DR: It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.

...read moreread less

711

•Journal Article•10.1016/J.PARCO.2009.12.005

Towards dense linear algebra for hybrid GPU accelerated manycore systems

Stanimire Tomov, +2 more

- 01 Jun 2010

TL;DR: The need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components possesses is motivated, and the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems is envisioned.

...read moreread less

500

Proceedings Article•10.1109/IPDPS.2011.70

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures

Matthias Christen, +2 more

- 16 May 2011

TL;DR: This work presents a code generation and auto-tuning framework for stencil computations targeted at multi- and many core processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a specification of the stencil operation and a parallelization and optimization strategy, and leverages the auto tuning methodology to optimize strategy-dependent parameters for the given hardware architecture.

...read moreread less

387