Journal Article10.1016/J.PARCO.2013.08.002
Efficient 3D stencil computations using CUDA
Marcin Krotkiewski,Marcin Dabrowski +1 more
- 01 Oct 2013
- Vol. 39, Iss: 10, pp 533-548
32
TL;DR: It is demonstrated that in the implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that the method of reading the data is close to optimal in terms of memory bandwidth usage.
read more
Abstract: We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Optimization Techniques for GPU Programming
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
54
Pore Fluid Extraction by Reactive Solitary Waves in 3‐D
TL;DR: In this paper, the authors present results from high-resolution simulations using Graphic Processing Unit parallel processing with a model that includes both viscous (de)compaction and reaction-induced porosity changes.
38
Implementation of the DWT in a GPU through a Register-based Strategy
TL;DR: Experimental results indicate that the proposed method is, at least, four times faster than the best GPU implementation of the DWT found in the literature, and theoretical analysis coincide with experimental tests in proving that the execution times achieved by the proposed implementation are close to the GPU's performance limits.
Adaptation of fluid model EULAG to graphics processing unit architecture
Krzysztof Rojek,Milosz Ciznicki,Bogdan Rosa,Piotr Kopta,Michal Kulczewski,Krzysztof Kurowski,Zbigniew P. Piotrowski,Lukasz Szustak,Damian K. Wójcik,Roman Wyrzykowski +9 more
TL;DR: A method is proposed, which ensures a comprehensive analysis of the resource consumption of the EULAG algorithm, including data transfers between host and global memory, global and shared memories, as well as GPU occupancy, which shows a promising increase in terms of computational efficiency.
26
Tessellating stencils
Liang Yuan,Yunquan Zhang,Guo Peng,Shan Huang +3 more
- 12 Nov 2017
TL;DR: This paper proposes a novel two-level tessellation scheme for stencil computations that performs up to 12% better than the existing highly concurrent schemes for the 3d27p stencil.
25
References
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Nathan Bell,Michael Garland +1 more
- 14 Nov 2009
TL;DR: This work explores SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes, including structured grid and unstructured mesh matrices.
Linear algebra operators for GPU implementation of numerical algorithms
Jens Krüger,Rüdiger Westermann +1 more
- 01 Jul 2003
TL;DR: This work proposes a stream model for arithmetic operations on vectors and matrices that exploits the intrinsic parallelism and efficient communication on modern GPUs and introduces a framework for the implementation of linear algebra operators on programmable graphics processors (GPUs), thus providing the building blocks for the design of more complex numerical algorithms.
762
Benchmarking GPUs to tune dense linear algebra
TL;DR: It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Stanimire Tomov,Jack Dongarra,Marc Baboulin +2 more
- 01 Jun 2010
TL;DR: The need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components possesses is motivated, and the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems is envisioned.
500
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
Matthias Christen,Olaf Schenk,Helmar Burkhart +2 more
- 16 May 2011
TL;DR: This work presents a code generation and auto-tuning framework for stencil computations targeted at multi- and many core processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a specification of the stencil operation and a parallelization and optimization strategy, and leverages the auto tuning methodology to optimize strategy-dependent parameters for the given hardware architecture.
387