TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Abstract: Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.
TL;DR: LIKWID as mentioned in this paper is a set of command-line utilities that address four key problems: probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and toggling hardware prefetchers.
Abstract: Exploiting the performance of today's processors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command-line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and toggling hardware prefetchers. An API for using the performance counting features from user code is also included. We clearly state the differences to the widely used PAPI interface. To demonstrate the capabilities of the tool set we show the influence of thread pinning on performance using the well-known OpenMP STREAM triad benchmark, and use the affinity and hardware counter tools to study the performance of a stencil code specifically optimized to utilize shared caches on multicore chips.
TL;DR: This work presents a code generation and auto-tuning framework for stencil computations targeted at multi- and many core processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a specification of the stencil operation and a parallelization and optimization strategy, and leverages the auto tuning methodology to optimize strategy-dependent parameters for the given hardware architecture.
Abstract: Stencil calculations comprise an important class of kernels in many scientific computing applications ranging from simple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. In such types of solvers, stencil kernels are often the dominant part of the computation, and an efficient parallel implementation of the kernel is therefore crucial in order to reduce the time to solution. However, in the current complex hardware micro architectures, meticulous architecture-specific tuning is required to elicit the machine's full compute power. We present a code generation and auto-tuning framework \textsc{Patus} for stencil computations targeted at multi- and many core processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a specification of the stencil operation and a parallelization and optimization strategy, and leverages the auto tuning methodology to optimize strategy-dependent parameters for the given hardware architecture.
TL;DR: This work shows the influence of thread pinning on performance using the well-known OpenMP STREAM triad benchmark, and uses the affinity and hardware counter tools to study the performance of a stencil code specifically optimized to utilize shared caches on multicore chips.
Abstract: Exploiting the performance of today's processors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command-line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and toggling hardware prefetchers. An API for using the performance counting features from user code is also included. We clearly state the differences to the widely used PAPI interface. To demonstrate the capabilities of the tool set we show the influence of thread pinning on performance using the well-known OpenMP STREAM triad benchmark, and use the affinity and hardware counter tools to study the performance of a stencil code specifically optimized to utilize shared caches on multicore chips.
TL;DR: In this article, the authors compare different finite-difference schemes for two-dimensional (2D) acoustic frequency-domain forward modelling based on staggered-grid stencils.
Abstract: SUMMARY
We compare different finite-difference schemes for two-dimensional (2-D) acoustic frequency-domain forward modelling. The schemes are based on staggered-grid stencils of various accuracy and grid rotation strategies to discretize the derivatives of the wave equation. A combination of two staggered-grid stencils on the classical Cartesian coordinate system and the 45° rotated grid is the basis of the so-called mixed-grid stencil. This method is compared with a parsimonious staggered-grid method based on a fourth-order approximation of the first derivative operator. Averaging of the mass acceleration can be incorporated in the two stencils. Sponge-like perfectly matched layer absorbing boundary conditions are also examined for each stencil and shown to be effective.
The deduced numerical stencils are examined for both the wavelength content and azimuthal variation. The accuracy of the fourth-order staggered-grid stencil is slightly superior in terms of phase velocity dispersion to that of the mixed-grid stencil when averaging of the mass acceleration term is applied to the staggered-grid stencil.
For fourth-order derivative approximations, the classical staggered-grid geometry leads to a stencil that incorporates 13 grid nodes. The mixed-grid approach combines only nine grid nodes. In both cases, wavefield solutions are computed using a direct matrix solver based on an optimized multifrontal method. For this 2-D geometry, the staggered-grid strategy is significantly less efficient in terms of memory and CPU time requirements because of the enlarged bandwidth of the impedance matrix and increased number of coefficients in the discrete stencil.
Therefore, the mixed-grid approach should be suggested as the routine scheme for 2-D acoustic wave propagation modelling in the frequency domain.