SSE3

Topic Tools

Papers

Implementing streaming SIMD extensions on the Pentium III processor

[...]

S.K. Raman¹, Vladimir Pentkovski, J. Keshava•Institutions (1)

01 Jul 2000-IEEE Micro

TL;DR: The streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications and makes a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.

...read moreread less

Abstract: This paper describes the streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications. In implementing the SSE, the Pentium III developers made a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.

...read moreread less

209 citations

Journal Article•10.1109/TC.2003.1223637•

Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements

[...]

D. Talla¹, Lizy K. John², Doug Burger²•Institutions (2)

Texas Instruments¹, University of Texas at Austin²

01 Aug 2003-IEEE Transactions on Computers

TL;DR: The MediaBreeze architecture is proposed, which uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.) and provides a better performance than a 16-way processor with current SIMD extensions.

...read moreread less

Abstract: Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.

...read moreread less

121 citations

Proceedings Article•10.1109/ICCD.2000.878283•

Evaluating signal processing and multimedia applications on SIMD, VLIW and superscalar architectures

[...]

Deependra Talla, Lizy K. John, V. Lapinskii¹, Brian L. Evans•Institutions (1)

University of Texas at Austin¹

1 Jan 2000

TL;DR: Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors.

...read moreread less

Abstract: This paper aims to provide a quantitative understanding of the performance of DSP and multimedia applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors. We evaluate the performance of the VLIW paradigm using Texas Instruments Inc.'s TMS320C62xx processor and the SIMD paradigm using Intel's Pentium II processor (with MMX) on a set of DSP and media benchmarks. Tradeoffs in superscalar performance are evaluated with a combination of measurements on Pentium II and simulation experiments on the SimpleScalar simulator. Our benchmark suite includes kernels (filtering, autocorrelation, and dot product) and applications (audio effects, G.711 speech coding, and speech compression). Optimized assembly libraries and compiler intrinsics were used to create the SIMD and VLIW code. We used the hardware performance counters on the Pentium II and the stand-alone simulator for the C62xx to obtain the execution cycle counts. In comparison to non-SIMD Pentium II performance, the SIMD version exhibits a speedup ranging from 1.0 to 5.5 while the speedup of the VLIW version ranges from 0.63 to 9.0. The benchmarks are seen to contain large amounts of available parallelism, however, most of it is inter-iteration parallelism. Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications.

...read moreread less

76 citations

Proceedings Article•10.1109/IPDPS.2008.4536350•

Faster matrix-vector multiplication on GeForce 8800GTX

[...]

Noriyuki Fujimoto¹•Institutions (1)

Osaka University¹

14 Apr 2008

TL;DR: The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.

...read moreread less

Abstract: Recently a GPU has acquired programmability to perform general purpose computation fast by running ten thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on NVIDIA CUDA architecture. The experimental results on GeForce 8800GTX show that the proposed algorithm runs maximum 15.69 (resp., 32.88) times faster than the sgemv routine in NVIDIA's BIAS library CUBLAS 1.1 (resp., Intel Math Kernel Library 9.1 on one-core of 2.0 GHz Intel Xeon E5335 CPU with SSE3 SIMD instructions) for matrices with order 16 to 12800. The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.

...read moreread less

70 citations

Patent•

Determination of optimal local sequence alignment similarity score

[...]

Torbjørn Rognes

27 Sep 2001

TL;DR: In this paper, an efficient parallelisation of the Smith-Waterman sequence alignment algorithm using parallel processing in the form of SIMD (Single-Instruction, Multiple-Data) technology is presented.

...read moreread less

Abstract: Sequence alignment and sequence database similarity searching are among the most important and challenging task in bio informatics, and are used for several purposes, including protein function prediction. An efficient parallelisation of the Smith-Waterman sequence alignment algorithm using parallel processing in the form of SIMD (Single-Instruction, Multiple-Data) technology is presented. The method has been implementation using the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel's latest microprocessors, but the method can also be implemented using similar technology existing in other modern microprocessors. Near eight-fold speed-up relative to the fastest previously an optimised eight-way parallel processing approach achieved know non-parallel Smith-Waterman implementation on the same hardware. A speed of about 200 million cell updates per second has been obtained on a single Intel Pentium III 500MHz microprocessor.

...read moreread less

55 citations

...

Expand

Year	Papers
2017	1
2016	2
2015	1
2014	5
2013	1
2012	3

Topic Tools

Papers

Implementing streaming SIMD extensions on the Pentium III processor

Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements

Evaluating signal processing and multimedia applications on SIMD, VLIW and superscalar architectures

Faster matrix-vector multiplication on GeForce 8800GTX

Determination of optimal local sequence alignment similarity score

Related Topics (5)

Performance Metrics