About: SSE2 is a research topic. Over the lifetime, 176 publications have been published within this topic receiving 5048 citations. The topic is also known as: SSE2 & SSE 2.
TL;DR: MMX technology extends the Intel architecture to improve the performance of multimedia, communications, and other numeric-intensive applications by introducing data types and instructions to the IA that exploit the parallelism in these applications.
Abstract: Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU.
TL;DR: The first hardware implementation of Intel TSX is evaluated using a set of high-performance computing (HPC) workloads, and it is demonstrated that applying IntelTSX to these workloads can provide significant performance improvements.
Abstract: Intel has recently introduced Intel® Transactional Synchronization Extensions (Intel® TSX) in the Intel 4th Generation Core™ Processors. With Intel TSX, a processor can dynamically determine whether threads need to serialize through lock-protected critical sections. In this paper, we evaluate the first hardware implementation of Intel TSX using a set of high-performance computing (HPC) workloads, and demonstrate that applying Intel TSX to these workloads can provide significant performance improvements. On a set of real-world HPC workloads, applying Intel TSX provides an average speedup of 1.41x. When applied to a parallel user-level TCP/IP stack, Intel TSX provides 1.31x average bandwidth improvement on network intensive applications. We also demonstrate the ease with which we were able to apply Intel TSX to the various workloads.
TL;DR: The streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications and makes a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.
Abstract: This paper describes the streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications. In implementing the SSE, the Pentium III developers made a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.
TL;DR: A detailed overview of the automatic vectorization methods used by the high-performance Intel® C++/Fortran compiler together with an experimental validation of their effectiveness are provided.
Abstract: Recent extensions to the Intel® Architecture feature the SIMD technique to enhance the performance of computational intensive applications that perform the same operation on different elements in a data set. To date, much of the code that exploits these extensions has been hand-coded. The task of the programmer is substantially simplified, however, if a compiler does this exploitation automatically. The high-performance Intel® C++/Fortran compiler supports automatic translation of serial loops into code that uses the SIMD extensions to the Intel® Architecture. This paper provides a detailed overview of the automatic vectorization methods used by this compiler together with an experimental validation of their effectiveness.
TL;DR: benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4.
Abstract: Background: We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and ×86 architectures. The paper describes swps3 and compares its performances with several other implementations. Findings: Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. Conclusion: The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences.