SIMD

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1109/JPROC.2004.840301•

The Design and Implementation of FFTW3

[...]

Matteo Frigo¹, Steven G. Johnson²•Institutions (2)

IBM¹, Massachusetts Institute of Technology²

24 Jan 2005

TL;DR: It is shown that such an approach can yield an implementation of the discrete Fourier transform that is competitive with hand-optimized libraries, and the software structure that makes the current FFTW3 version flexible and adaptive is described.

...read moreread less

Abstract: FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.

...read moreread less

5,976 citations

Journal Article•10.1109/TC.1972.5009071•

Some Computer Organizations and Their Effectiveness

[...]

Michael J. Flynn¹•Institutions (1)

Johns Hopkins University¹

01 Sep 1972-IEEE Transactions on Computers

TL;DR: A hierarchical model of computer organizations is developed, based on a tree model using request/service type resources as nodes, which indicates that saturation develops when the fraction of task time spent locked out approaches 1/n, where n is the number of processors.

...read moreread less

Abstract: A hierarchical model of computer organizations is developed, based on a tree model using request/service type resources as nodes. Two aspects of the model are distinguished: logical and physical. General parallel- or multiple-stream organizations are examined as to type and effectiveness?especially regarding intrinsic logical difficulties. The overlapped simplex processor (SISD) is limited by data dependencies. Branching has a particularly degenerative effect. The parallel processors [single-instruction stream-multiple-data stream (SIMD)] are analyzed. In particular, a nesting type explanation is offered for Minsky's conjecture?the performance of a parallel processor increases as log M instead of M (the number of data stream processors). Multiprocessors (MIMD) are subjected to a saturation syndrome based on general communications lockout. Simplified queuing models indicate that saturation develops when the fraction of task time spent locked out (L/E) approaches 1/n, where n is the number of processors. Resources sharing in multiprocessors can be used to avoid several other classic organizational problems.

...read moreread less

2,148 citations

Journal Article•10.1109/MM.2008.31•

NVIDIA Tesla: A Unified Graphics and Computing Architecture

[...]

Erik Lindholm¹, John R. Nickolls¹, S. Oberman¹, John S. Montrym¹•Institutions (1)

Nvidia¹

01 Mar 2008-IEEE Micro

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

Abstract: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

1,682 citations

Journal Article•10.1145/3007787.3001177•

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

[...]

Yu-Hsin Chen¹, Joel Emer², Vivienne Sze¹•Institutions (2)

Massachusetts Institute of Technology¹, Nvidia²

18 Jun 2016

TL;DR: A novel dataflow, called row-stationary (RS), is presented, that minimizes data movement energy consumption on a spatial architecture and can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine local storage, direct inter-PE communication and spatial parallelism.

...read moreread less

Abstract: Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy.In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.

...read moreread less

1,332 citations

Proceedings Article•10.1145/2491956.2462176•

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

[...]

Jonathan Ragan-Kelley¹, Connelly Barnes², Andrew Adams¹, Sylvain Paris², Frédo Durand¹, Saman Amarasinghe¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, Adobe Systems²

16 Jun 2013

TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.

...read moreread less

Abstract: Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

...read moreread less

1,262 citations

...

Expand

Year	Papers
2025	37
2024	60
2023	95
2022	233
2021	125
2020	181

Topic Tools

Papers published on a yearly basis

Papers

The Design and Implementation of FFTW3

Some Computer Organizations and Their Effectiveness

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Related Topics (5)

Performance Metrics