TL;DR: This paper focuses on how these missing operations can be implemented using either the existing SWAR hardware or even conventional 32-bit integer instructions, and offers a few new challenges for compiler optimization.
Abstract: Although SIMD (Single Instruction stream Multiple Data stream) parallel computers have existed for decades, it is only in the past few years that a new version of SIMD has evolved: SIMD Within A Register (SWAR). Unlike other styles of SIMD hardware, SWAR models are tuned to be integrated within conventional microprocessors, using their existing memory reference and instruction handling mechanisms, with the primary goal of improving the speed of specific multimedia operations.
Because the SWAR implementations for various microprocessors vary widely and each is missing instructions for some SWAR operations that are needed to support a more general, portable, high-level SIMD execution model, this paper focuses on how these missing operations can be implemented using either the existing SWAR hardware or even conventional 32-bit integer instructions. In addition, SWAR offers a few new challenges for compiler optimization, and these are briefly introduced.
TL;DR: This thesis will define a general-purpose SWAR (SIMD Within A Register) programming model that will be implemented for multiple target architectures: initially as compatible libraries, then as optimizing compilers accepting a simple high-level parallel language.
Abstract: Recent extensions to microprocessor instruction sets are intended to speed-up multimedia algorithms by allowing SIMD parallel processing over multiple data fields within each processor register. These extensions, while effectively supporting hand-coding of some multimedia tasks, do not directly support a high-level parallel programming model. Unfortunately, the extensions vary widely across different processor families, making portability difficult to achieve. Even within one set of extensions, each operation is supported only for certain field widths, and the widths supported are different for different operations. This thesis will define a general-purpose SWAR (SIMD Within A Register) programming model. This model will be implemented for multiple target architectures: initially as compatible libraries, then as optimizing compilers accepting a simple high-level parallel language. The new SWAR libraries and compiler technology should enable a much wider range of applications to achieve speed-up through SIMD execution using COTS microprocessors.
TL;DR: A set of simple SWAR instruction set extensions are proposed for this purpose and are shown to significantly reduce instruction count in core parallel bit stream algorithms, often providing a 3X or better improvement.
Abstract: Parallel bit stream algorithms exploit the SWAR (SIMD within a register) capabilities of commodity processors in high-performance text processing applications such as UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression matching. Direct architectural support for these algorithms in future SWAR instruction sets could further increase performance as well as simplifying the programming task. A set of simple SWAR instruction set extensions are proposed for this purpose based on the principle of systematic support for inductive doubling as an algorithmic technique. These extensions are shown to significantly reduce instruction count in core parallel bit stream algorithms, often providing a 3X or better improvement. The extensions are also shown to be useful for SWAR programming in other application areas, including providing a systematic treatment for horizontal operations. An implementation model for these extensions involves relatively simple circuitry added to the operand fetch components in a pipelined processor.
TL;DR: A more formal description of the SWARC language is provided, the organization of the current version of the Scc compiler is described, and the implementation of optimizations within this framework are discussed.
Abstract: Last year, we discussed the issues surrounding the development of languages and compilers for a general, portable, high-level SIMD Within A Register (SWAR) execution model. In a first effort to provide such a language and a framework for further research on this form of parallel processing, we proposed the vector-based language SWARC, and an experimental module compiler for this language, called Scc, which targeted IA32+MMX-based architectures.
Since that time, we have worked to expand the types of targets that Scc supports and to include optimizations based on both vector processing and enhanced hardware support for SWAR. This paper provides a more formal description of the SWARC language, describes the organization of the current version of the Scc compiler, and discusses the implementation of optimizations within this framework.
TL;DR: An average overall speedup of nearly four times is achieved (compared to an unoptimized standard implementation that uses conventional serial processing) and the approach maintains the rendering quality of a standard serial implementation of the Marching Cubes.