SSE2

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1109/40.526924•

MMX technology extension to the Intel architecture

[...]

Alexander D. Peleg¹, Uri Weiser¹•Institutions (1)

Intel¹

01 Aug 1996-IEEE Micro

TL;DR: MMX technology extends the Intel architecture to improve the performance of multimedia, communications, and other numeric-intensive applications by introducing data types and instructions to the IA that exploit the parallelism in these applications.

...read moreread less

Abstract: Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU.

...read moreread less

580 citations

Proceedings Article•10.1145/2503210.2503232•

Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

[...]

Richard M. Yoo¹, Christopher J. Hughes¹, Konrad K. Lai¹, Ravi Rajwar¹•Institutions (1)

Intel¹

17 Nov 2013

TL;DR: The first hardware implementation of Intel TSX is evaluated using a set of high-performance computing (HPC) workloads, and it is demonstrated that applying IntelTSX to these workloads can provide significant performance improvements.

...read moreread less

Abstract: Intel has recently introduced Intel® Transactional Synchronization Extensions (Intel® TSX) in the Intel 4th Generation Core™ Processors. With Intel TSX, a processor can dynamically determine whether threads need to serialize through lock-protected critical sections. In this paper, we evaluate the first hardware implementation of Intel TSX using a set of high-performance computing (HPC) workloads, and demonstrate that applying Intel TSX to these workloads can provide significant performance improvements. On a set of real-world HPC workloads, applying Intel TSX provides an average speedup of 1.41x. When applied to a parallel user-level TCP/IP stack, Intel TSX provides 1.31x average bandwidth improvement on network intensive applications. We also demonstrate the ease with which we were able to apply Intel TSX to the various workloads.

...read moreread less

290 citations

Journal Article•10.1109/40.865866•

Implementing streaming SIMD extensions on the Pentium III processor

[...]

S.K. Raman¹, Vladimir Pentkovski, J. Keshava•Institutions (1)

Intel¹

01 Jul 2000-IEEE Micro

TL;DR: The streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications and makes a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.

...read moreread less

Abstract: This paper describes the streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications. In implementing the SSE, the Pentium III developers made a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.

...read moreread less

209 citations

Journal Article•10.1023/A:1014230429447•

Automatic Intra-Register Vectorization for the Intel® Architecture

[...]

Aart J. C. Bik¹, Milind B. Girkar¹, Paul M. Grey¹, Xinmin Tian¹•Institutions (1)

Intel¹

01 Apr 2002-International Journal of Parallel Programming

TL;DR: A detailed overview of the automatic vectorization methods used by the high-performance Intel® C++/Fortran compiler together with an experimental validation of their effectiveness are provided.

...read moreread less

Abstract: Recent extensions to the Intel® Architecture feature the SIMD technique to enhance the performance of computational intensive applications that perform the same operation on different elements in a data set. To date, much of the code that exploits these extensions has been hand-coded. The task of the programmer is substantially simplified, however, if a compiler does this exploitation automatically. The high-performance Intel® C++/Fortran compiler supports automatic translation of serial loops into code that uses the SIMD extensions to the Intel® Architecture. This paper provides a detailed overview of the automatic vectorization methods used by this compiler together with an experimental validation of their effectiveness.

...read moreread less

169 citations

Journal Article•10.1186/1756-0500-1-107•

SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2

[...]

Adam M. Szalkowski¹, Adam M. Szalkowski², Christian Ledergerber², Philipp Krähenbühl², Christophe Dessimoz¹, Christophe Dessimoz² - Show less +2 more•Institutions (2)

Swiss Institute of Bioinformatics¹, ETH Zurich²

29 Oct 2008-BMC Research Notes

TL;DR: benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4.

...read moreread less

Abstract: Background: We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and ×86 architectures. The paper describes swps3 and compares its performances with several other implementations. Findings: Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. Conclusion: The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences.

...read moreread less

133 citations

...

Expand

Year	Papers
2018	2
2017	7
2016	5
2015	11
2014	10
2013	11

Topic Tools

Papers published on a yearly basis

Papers

MMX technology extension to the Intel architecture

Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

Implementing streaming SIMD extensions on the Pentium III processor

Automatic Intra-Register Vectorization for the Intel® Architecture

SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2

Related Topics (5)

Performance Metrics