High Performance Stencil Code Algorithms for GPGPUs
Andreas Schäfer,Dietmar Fey +1 more
- 01 Jan 2011
- Vol. 4, pp 2027-2036
TL;DR: This paper represents the first successful application of temporal blocking for 3D stencils on GPGPUs and thereby exceeds previous results by a considerable margin and is also the first paper to study stencil codes on Fermi.
read more
Abstract: In this paper we investigate how stencil computations can be implemented on state-of-the-art general purpose graphics processing units (GPGPUs). Stencil codes can be found at the core of many numerical solvers and physical simulation codes and are therefore of particular interest to scientific computing research. GPGPUs have gained a lot of attention recently because of their superior floating point performance and memory bandwidth. Nevertheless, especially memory bound stencil codes have proven to be challenging for GPGPUs, yielding lower than to be expected speedups. We chose the Jacobi method as a standard benchmark to evaluate a set of algorithms on NVIDIA's latest Fermi chipset. One of our fastest algorithms is a parallel wavefront update. It exploits the enlarged on-chip shared memory to perform two time step updates per sweep. To the best of our knowledge, it represents the first successful applicationof temporal blocking for 3D stencils on GPGPUs and thereby exceeds previous results by a considerable margin. It is also the first paper to study stencil codes on Fermi.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Improving the arithmetic intensity of multigrid with the help of polynomial smoothers
TL;DR: A polynomial Chebyshev smoother is proposed, which is implemented using cache‐aware tiling, to increase the arithmetic intensity of a multigrid V‐cycle and observes optimal performance for higher degrees of the smoother.
39
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments
John Jenkins,James Dinan,Pavan Balaji,Nagiza F. Samatova,Rajeev Thakur +4 more
- 24 Sep 2012
TL;DR: This work presents an MPI data type-processing system capable of efficiently processing arbitrary data types directly on the GPU, and presents a means for converting conventional data type representations into a GPU-amenable format.
Tuning Stencil codes in OpenCL for FPGAs
Qi Jia,Huiyang Zhou +1 more
- 07 Sep 2016
TL;DR: This paper explores OpenCL code optimizations for stencil computations on FPGAs in both the Single-Task and NDRange modes and proposes tuning processes that can achieve up to two orders of magnitude performance improvement over the naïve kernels.
A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUs
TL;DR: In this paper, several architectural and system-level techniques proposed for managing and leveraging GPU caches are surveyed and motivated to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.
Adaptation of fluid model EULAG to graphics processing unit architecture
Krzysztof Rojek,Milosz Ciznicki,Bogdan Rosa,Piotr Kopta,Michal Kulczewski,Krzysztof Kurowski,Zbigniew P. Piotrowski,Lukasz Szustak,Damian K. Wójcik,Roman Wyrzykowski +9 more
TL;DR: A method is proposed, which ensures a comprehensive analysis of the resource consumption of the EULAG algorithm, including data transfers between host and global memory, global and shared memories, as well as GPU occupancy, which shows a promising increase in terms of computational efficiency.
26
References
•Book
Theory of Self-Reproducing Automata
John von Neumann,Arthur W. Burks +1 more
- 01 Jan 1966
TL;DR: This invention relates to prefabricated buildings and comprises a central unit having a peripheral section therearound to form a main residential part defined by an assembly of juxtaposed roofing and facing trusses.
5.7K
Theory of self-reproducing automata: John von Neumann (edited by A.W. Burks). University of Illinois Press, Urbana, 1966. xiii + 388pp., $10.00
TL;DR: This invention relates to prefabricated buildings and comprises a central unit having a peripheral section therearound to form a main residential part.
3.1K
Roofline: an insightful visual performance model for multicore architectures
TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Anthony Nguyen,Nadathur Satish,Jatin Chhugani,Changkyu Kim,Pradeep Dubey +4 more
- 13 Nov 2010
TL;DR: A novel 3.
331