About: SSE3 is a research topic. Over the lifetime, 34 publications have been published within this topic receiving 934 citations. The topic is also known as: SSE3 & SSE 3.
TL;DR: The streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications and makes a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.
Abstract: This paper describes the streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications. In implementing the SSE, the Pentium III developers made a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.
TL;DR: The MediaBreeze architecture is proposed, which uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.) and provides a better performance than a 16-way processor with current SIMD extensions.
Abstract: Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.
TL;DR: Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors.
Abstract: This paper aims to provide a quantitative understanding of the performance of DSP and multimedia applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors. We evaluate the performance of the VLIW paradigm using Texas Instruments Inc.'s TMS320C62xx processor and the SIMD paradigm using Intel's Pentium II processor (with MMX) on a set of DSP and media benchmarks. Tradeoffs in superscalar performance are evaluated with a combination of measurements on Pentium II and simulation experiments on the SimpleScalar simulator. Our benchmark suite includes kernels (filtering, autocorrelation, and dot product) and applications (audio effects, G.711 speech coding, and speech compression). Optimized assembly libraries and compiler intrinsics were used to create the SIMD and VLIW code. We used the hardware performance counters on the Pentium II and the stand-alone simulator for the C62xx to obtain the execution cycle counts. In comparison to non-SIMD Pentium II performance, the SIMD version exhibits a speedup ranging from 1.0 to 5.5 while the speedup of the VLIW version ranges from 0.63 to 9.0. The benchmarks are seen to contain large amounts of available parallelism, however, most of it is inter-iteration parallelism. Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications.
TL;DR: The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.
Abstract: Recently a GPU has acquired programmability to perform general purpose computation fast by running ten thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on NVIDIA CUDA architecture. The experimental results on GeForce 8800GTX show that the proposed algorithm runs maximum 15.69 (resp., 32.88) times faster than the sgemv routine in NVIDIA's BIAS library CUBLAS 1.1 (resp., Intel Math Kernel Library 9.1 on one-core of 2.0 GHz Intel Xeon E5335 CPU with SSE3 SIMD instructions) for matrices with order 16 to 12800. The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.
TL;DR: In this paper, an efficient parallelisation of the Smith-Waterman sequence alignment algorithm using parallel processing in the form of SIMD (Single-Instruction, Multiple-Data) technology is presented.
Abstract: Sequence alignment and sequence database similarity searching are among the most important and challenging task in bio informatics, and are used for several purposes, including protein function prediction. An efficient parallelisation of the Smith-Waterman sequence alignment algorithm using parallel processing in the form of SIMD (Single-Instruction, Multiple-Data) technology is presented. The method has been implementation using the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel's latest microprocessors, but the method can also be implemented using similar technology existing in other modern microprocessors. Near eight-fold speed-up relative to the fastest previously an optimised eight-way parallel processing approach achieved know non-parallel Smith-Waterman implementation on the same hardware. A speed of about 200 million cell updates per second has been obtained on a single Intel Pentium III 500MHz microprocessor.