TL;DR: A parallel compiler for the Brook streaming language with aggressive data and computation transformations that effectively reduces memory footprint, exploits parallelism, and circumvents phase-ordering issues is developed.
Abstract: Multicore processors are about to become prevalent in the PC world. Meanwhile, over 90% of the computing cycles are estimated to be consumed by streaming media applications (Rixner et al., 1998). Although stream programming exposes parallelism naturally, we found that achieving high performance on multiprocessors is challenging. Therefore, we develop a parallel compiler for the Brook streaming language with aggressive data and computation transformations. First, we formulate fifteen Brook stream operators in terms of systems of inequalities. Our compiler optimizes the modeled operators to improve memory footprint and performance. Second, the stream computation including both kernels and operators is mapped to the affine partitioning model by modeling each kernel as an implicit loop nest over stream elements. Note that our general abstraction is not limited to Brook. Our modeling and transformations yield high performance on uniprocessors as well. The geometric mean of speedups is 4.7 on ten streaming applications on a Xeon. On multiprocessors, we show that exploiting the standard intra-kernel data parallelism is inferior to our general modeling. The former yields a speedup of 1.5 for ten applications on a 4-way Xeon, while the latter achieves a speedup of 6.4 over the same baseline. We show that our compiler effectively reduces memory footprint, exploits parallelism, and circumvents phase-ordering issues.
TL;DR: This article consists of a collection of slides from the author's conference presentation on Intel's Core product line's microarchitecture, a new foundation for Intel architecture-based mobile, desktop, and server processors that incorporates advanced innovations which optimize performance over a range of market segments.
Abstract: This article consists of a collection of slides from the author's conference presentation on Intel's Core product line's microarchitecture, a new foundation for Intel architecture-based mobile, desktop, and server processors that incorporates advanced innovations which optimize performance over a range of market segments. Some of the specific topics discussed include: the special features and system specifications of Intel Core; memory management and prefetching capabilities; system performance and flexibility; multithreading capabilities; and a summary of key features and processing facilities.
TL;DR: This paper evaluates and characterize the most time consuming parts in the execution path of the memory registration function using the Read Time Stamp Counter (RDTSC) instruction, and presents results using Linux hugepage support to shorten the time of registering a memory region.
Abstract: To leverage high speed interconnects like InfiniBand it is important to minimize the communication overhead. The most interfering overhead is the registration of communication memory.
In this paper, we present our analysis of the memory registration process inside the Mellanox InfiniBand driver and possible ways out of this bottleneck. We evaluate and characterize the most time consuming parts in the execution path of the memory registration function using the Read Time Stamp Counter (RDTSC) instruction. We present measurements on AMD Opteron and Intel Xeon systems with different types of Host Channel Adapters for PCI-X and PCI-Express. Finally, we conclude with first results using Linux hugepage support to shorten the time of registering a memory region.
TL;DR: The results show that the MPICH implementation greatly outperforms the PVM HMMER implementation, and the SSE2 implementation also lends greater computational power at no cost to the user.
Abstract: Due to the ever-increasing size of sequence databases it has become clear that faster techniques must be employed to effectively perform biological sequence analysis in a reasonable amount of time. Exploiting the inherent parallelism between sequences is a common strategy. In this paper we enhance both the fine-grained and course-grained parallelism within the HMMER sequence analysis suite. Our strategies are complementary to one another and, where necessary, can be used as drop-in replacements to the strategies already provided within HMMER. We use conventional processors (Intel Pentium IV Xeon) as well as the freely available MPICH parallel programming environment. Our results show that the MPICH implementation greatly outperforms the PVM HMMER implementation, and our SSE2 implementation also lends greater computational power at no cost to the user.
TL;DR: An improvement to a limited resource Montgomery multiplier design, the MWR2MM algorithm proposed by Tenca and Koc, which is suitable for implementation on large FPGAs and can be scaled to utilize available FPGA multipliers, CLB logic and frequencies of operation.
Abstract: The RSA algorithm is the standard for public-key cryptography today, with Montgomery multiplication the most common mechanism of implementation due to modulo operations using a bitwise shift in place of a division operation. Several Montgomery designs have been proposed for ASIC and FPGA implementation based on limited resource availability to satisfy the computational burden. FPGAs are now available that have large configurable logic resources in addition to dedicated high-speed ALU logic for operations such as multiplication. This paper describes an improvement to a limited resource Montgomery multiplier design, the MWR2MM algorithm proposed by Tenca and Koc, which is suitable for implementation on large FPGAs. The design can be scaled to utilize available FPGA multipliers, CLB logic and frequencies of operation. Implementation and design choices are discussed for an RSA core based on this design, and a comparison against the OpenSSL open source cryptographic library is given. Our results show a 1024-bit RSA core on a 100MHz Virtex2 Pro 100 FPGA platform to be 3.13x faster than an equivalent software implementation on a 2.8 GHz Intel Xeon PC workstation.
TL;DR: A real-time simulator for large power networks based on commercial-off-the-shelf products and the RT-LAB platform developed by Opal-RT Technologies Inc and the latest advances in hardware-in- the-loop simulation are discussed.
Abstract: This paper presents a real-time simulator for large power networks based on commercial-off-the-shelf products and the RT-LAB platform developed by Opal-RT Technologies Inc. This platform uses Pentium-, Xeon-, or Opteron-based PCs (multi-CPUs and/or dual-core configurations) or even Xilinx FPGA cards for computational engines, and infiniband communication fabric for fast inter-PC communication. The real-time PCs run under well-known operating systems QNX or RedHawk Linux, while the main control interface is either Simulink software from MathWorks or Lab VIEW software from National Instruments. The paper demonstrates the real-time simulation of a complete, single-pole, 12-pulse, HVDC system on a 2.2 GHz, dual-CPU, dual-core, Opteron PC with an under 15-microsecond time step. Also demonstrated are real-time simulations of complex power system devices like SVCs, STATCOMs, and more general power systems like the Kundur network. The paper also discusses the latest advances in hardware-in-the-loop simulation, including directly programming devices like a PMSM drive in an FPGA card. This feature is enabled in RT-LAB with the Xilinx System Generator, a Simulink blockset. Such FPGA targeting diminishes further the leap between prototype and production-type controller systems because an FPGA card can implement rapid control functions along with fast protection systems, like IGBT-current protection, of a real controller
TL;DR: The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers.
Abstract: The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.
TL;DR: The lattice Boltzmann method (LBM) performance on commodity “off-the-shelf” clusters with Intel Xeon processors, tailored HPC systems, and a NEC SX8 vector system is discussed.
Abstract: Publisher Summary The chapter discusses the lattice Boltzmann method (LBM) performance on commodity “off-the-shelf” clusters with Intel Xeon processors, tailored HPC systems, and a NEC SX8 vector system. The chapter describes the main architectural differences and comments on single processor performance as well as optimization strategies. The parallel performance of a large scale simulation running on up to 2000 processors, providing 2 TFlop/s of sustained performance is evaluated and presented in the chapter. In the past decade, the LBM has been established as an alternative for the numerical simulation of incompressible flows. One major reason for the success of LBM is the simplicity of its core algorithm that allows both easy adaption to complex application scenarios as well as extension to additional physical or chemical effects. Because LBM is a direct method, the use of extensive computer resources is often mandatory. Thus, LBM has attracted a lot of attention in the high-performance computing (HFC) community. An important feature of many LBM codes is that the core algorithm can be reduced to a few manageable subroutines, facilitating deep performance analysis followed by precise code and data layout optimization.
TL;DR: The experiences of the DIII-D programming staff in adapting Linux based Intel computing hardware for use in real-time data acquisition and feedback control systems are described.
TL;DR: This paper demonstrates the real-time simulation of complete single-pole 12-pulse HVDC system on dual-CPU, dual-core 2.2 GHz Opteron PC under 15 microseconds time step and the latest advances in hardware-in-the-loop simulation like the possibility to program from within Simulink some controllers or devices, like a PMSM drive, directly in an FPGA card.
Abstract: This paper presents a real-time simulator for large power network based on Custom-Of-The-Shelf technologies, all embedded in the RT-LAB real-time simulation platform. This platform uses Pentium, Xeon, Opteron-based PCs (multi-CPUs and/or dual-core configurations) or even Xilinx FPGA cards for computational engines and InfiniBand communication fabric for fast inter-PCs communications. The real-time PCs runs under well-known operating systems QNX or RedHawk Linux while the main user control interface is either Simulink or LabView. The paper demonstrates the real-time simulation of complete single-pole 12-pulse HVDC system on dual-CPU, dual-core 2.2 GHz Opteron PC under 15 microseconds time step. It also demonstrates the real-time simulation of complex power system devices like SVC, STATCOM and more general power systems like the Kundur network. The paper also discusses the latest advances in Hardware-In-the-Loop simulation like the possibility to program from within Simulink some controllers or devices, like a PMSM drive, directly in an FPGA card. This feature is enabled in RT-LAB by the use of Xilinx System Generator, a Simulink blockset. This FPGA targeting diminishes further the leap between prototype and production-type controller systems because the FPGA can implement rapidcontrol functions along with fast protection systems, like IGBT-current protection, of real controller.
TL;DR: A full hardware implementation of the K-means clustering algorithm has been designed and implemented in reconfigurable hardware that clusters 512k documents rapidly and outperforms the equivalent software by a factor of 328.
Abstract: High-performance document clustering systems enable similar documents to be automatically organized into groups. In the past, the large amount of computational time needed to cluster documents prevented practical use of such systems with a large number of documents. A full hardware implementation of the K-means clustering algorithm has been designed and implemented in reconfigurable hardware that clusters 512k documents rapidly. This implementation, uses four parallel cosine distance metrics to cluster document vectors that each have 4000 dimensions. The synthesized hardware runs on the Field Programmable Port Extender (FPX) platform at a clock rate of 80 MHz. Although the clock rate on the Xilinx VirtexE 2000 is slower than a CPU, the implementation runs 26 times faster than an algorithmically equivalent software algorithm running on an Intel 3.60 GHz Xeon. The same architecture was used to synthesize a faster and larger design for the Xilinx Virtex4 LX200. This larger implementation can contain up to 25 parallel cosine distance metrics. The implementation synthesized with a clock rate of 250 Mhz and outperforms the equivalent software by a factor of 328.
TL;DR: This article shows how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels, and presents runtime functions for the modeling of two-phase realizations that can predict the executionTime both for communication operations in isolation and in the context of application programs.
Abstract: MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.
TL;DR: The implementation and evaluation of a stochastic simulation algorithm (SSA) called "first reaction method" on an FPGA-based biochemical simulator that achieves high throughput by consecutively throwing data into deeply-pipelined floating point arithmetic units and distributing multiple simulators for parallel execution.
Abstract: Stochastic simulation of biochemical systems has become one of major approaches to study life processes as system, yet is a computational challenge to run the simulation due to its vast calculation cost. This paper shows the implementation and evaluation of a stochastic simulation algorithm (SSA) called ?First Reaction Method? on an FPGA-based biochemical simulator. It achieves high throughput by (1) consecutively throwing data into deeply-pipelined floating point arithmetic units, and (2) by distruibuting multiple simulators for parallel execution. As the result of evaluation on an FPGA-based simulation platform called ReC-SiP2, the simulator outperforms execution on Xeon 2.80 GHz by approximately 80 times, even with large-scale bio-chemical systems.
TL;DR: Combination of vectorization and the block six-step FFT algorithm is shown to effectively improve performance and the performance results for one-dimensional FFTs on dual-core Intel Xeon processors are reported.
Abstract: In the present paper, an implementation of a parallel one-dimensional fast Fourier transform (FFT) using Streaming SIMD Extensions 3 (SSE3) instructions on dual-core processors is proposed. Combination of vectorization and the block six-step FFT algorithm is shown to effectively improve performance. The performance results for one-dimensional FFTs on dual-core Intel Xeon processors are reported. We successfully achieved performance of approximately 2006MFLOPS on a dual-core Intel Xeon PC (2.8GHz, two CPUs, four cores) and approximately 3492 MFLOPS on a dual-core Intel Xeon 5150 PC (2.66GHz, two CPUs, four cores) for a 220-point FFT.
TL;DR: This paper describes a 95 W dual-core 64-bit Xeonreg MP processor implemented in a 65 nm 8 metal layer process that supports the Intelreg Extended Memory 64 Technology and the Hyper-Threading Technology.
Abstract: This paper describes a 95 W dual-core 64-bit Xeonreg MP processor implemented in a 65 nm 8 metal layer process. Each processor core has a unified 1MB L2 cache and supports the Intelreg Extended Memory 64 Technology and the Hyper-Threading Technology. The shared L3 cache has extensive RAS features including the Intelreg Cache Safe Technology and Error Correction Codes (ECC). The processor is designed and optimized to operate at a 95W thermal design power envelope at the target product frequency. The front-side bus operates at 667 MT/s or 800 MT/s in a 3 load topology that is compatible with existing platforms.
TL;DR: The results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes.
Abstract: We study the performance of high-speed interconnects using a set of communication micro-benchmarks The goal is to identify certain limiting factors and bottlenecks with these interconnects Our micro-benchmarks are based on dense communication patterns with different communicating partners and varying degrees of these partners We tested our micro-benchmarks on five platforms: an IBM system of 68-node 16-way Power3, interconnected by a SP switch2; another IBM system of 264-node 4-way Power PC 604e, interconnected by an SP switch; a Compaq cluster of 128-node 4-way ES40/EV67 processor, interconnected by an Quadrics interconnect; an Intel cluster of 16-node dual-CPU Xeon, interconnected by an Quadrics interconnect; and a cluster of 22-node Sun Ultra Sparc, interconnected by an Ethernet network Our results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes
TL;DR: Between 2001 and 2005, the Swathbuckler wide-swath SAR real-time image formation multinational project evolved a system architecture to continually process 40 KM strips into high resolution (<1 m) imagery that drove the supercomputer cost down from over $1 M to under $100 K.
Abstract: Between 2001 and 2005, the Swathbuckler wide-swath SAR real-time image formation multinational project evolved a system architecture to continually process 40 KM strips into high resolution (<1 m) imagery. The rapid advance of COTS memory, I/O, and processor technology drove the supercomputer cost down from over $1 M to under $100 K. This paper discusses the key technology improvements driving the affordable solution. In particular, the 8 gigabytes of memory attached to standard dual Xeon server nodes arranged in a standard cluster greatly simplified the previously daunting task of SAR image formation.
TL;DR: This study proposes asynchronous MPI, a simple and effective parallel programming model for SMP clusters, to reimplement the High PerformanceLinpack benchmark, and can achieve significant improvements in performance with a minimal programming effort.
Abstract: This study proposes asynchronous MPI, a simple and effective parallel programming model for SMP clusters, to reimplement the High PerformanceLinpack benchmark. The proposed model forces processors of an SMP node to work in different phases, thereby avoiding unneccessary communication and computation bottlenecks. As a result, we can achieve significant improvements in performance with a minimal programming effort. In comparison with a de-facto flat MPI solution, our algorithm can yield a 20.6% performance improvement for a 16-node cluster of Xeon dual-processor SMPs.
TL;DR: In this article, the authors describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to O(10 5 ) by solving two large-scale sparse Lyapunov equations via a coupled LR-ADI iteration with (super- )linear convergence.
Abstract: We describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to O(10 5 ). The major computational task in this approach is the solution of two large-scale sparse Lyapunov equations, performed via a coupled LR-ADI iteration with (super- )linear convergence. Experimental results on a cluster of Intel Xeon processors illustrate the efficacy of our parallel model reduction algorithm.
TL;DR: The design and integration methods successfully used in a multi-threaded dual core 65nm Xeonreg Processor are described.
Abstract: The success of building a complex multi-billion transistor processor is very dependent on robust and silicon proven design and integration methods. The complexity of 65nm process and striving for best in class performance with aggressive time to market schedule put a heavy emphasis on innovative design and integration methods to enable working silicon. In this paper, we describe the design and integration methods successfully used in a multi-threaded dual core 65nm Xeon® Processor.Index Terms---Design Methods; Integration; processor; Xeon®.
TL;DR: This paper describes the software architecture that boots and manages these clusters, and shows the ways that this architecture has been used to greatly improve the operation of the nodes, with particular emphasis on improvements in boot-time performance, scalability, and reliability.
Abstract: In the last year LANL has constructed a 1408-node AMD Opteron cluster, a 1024-node Intel P4 Xeon cluster, a 256-node AMD Opteron cluster and two 128-node Intel P4 Xeon clusters. Each of these clusters is controlled by one front-end node, and each cluster needs only one disk in the front-end node for production operations. In this paper we describe the software architecture that boots and manages these clusters. This software architecture represents a clean break from the way that clusters have been set up for the last 14 years. We show the ways that this architecture has been used to greatly improve the operation of the nodes, with particular emphasis on improvements in boot-time performance, scalability, and reliability.
TL;DR: This paper proposes the FPGA-based Hierarchical-SIMD (H- SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA), which comprises high-level instructions implemented as FPGAs functions of coarse-grain SIMD tasks to facilitate ease of program development, code portability and high performance.
Abstract: FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1], [2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.
TL;DR: A comparative study of Human, Mouse, and Arabidopsis genes to determine if mRNA secondary structure increases with more complex organisms and single-value means of each transcriptome were calculated as a simple average of all transcripts.
Abstract: We performed a comparative study of Human, Mouse, and Arabidopsis genes to determine if mRNA secondary structure increases with more complex organisms. Calculating the secondary structures of a large number of genes in a transcriptome has a high degree of parallelism, and is suitable to implement on multiple computers. The analysis in this investigation was done on a cluster of eighteen (18) Intel x86- based Windows 2000 workstations. These were linked together as a Network of Workstations (NOW) configuration. A Dell Power Edge 4600 Intel Xeon Windows 2000 dual processor server controlled the NOW workstations and was used for file storage. Single-value means of each transcriptome were calculated as a simple average of all transcripts. This computational research effort is of biomedical interest to provide computational tools to analyze and characterize proteins and mRNAs.
TL;DR: Results show that a critical part of the TRT LUT-Hough algorithm implemented in VHDL runs on the FPGA co-processor ~4 times faster than on the more or less modern CPU and the whole algorithm runs ~2 times faster.
Abstract: In the scope of this thesis one of the possible approaches to acceleration the tracking algorithms using the hybrid FPGA/CPU systems has been investigated. The TRT LUT-Hough algorithm - one of the tracking algorithms for ATLAS Level2 trigger - is selected for this purpose. It is a Look-Up Table (LUT) based Hough transform algorithm for Transition Radiation Tracker (TRT). The algorithm was created keeping in mind the B-physic's tasks: fast search for low-pT tracks in entire TRT volume. Such a full subdetector scan requires a lot of computational power. Hybrid implementation of the algorithm (when the most time consuming part of algorithm is accelerated by FPGA co-processor and all other parts are running on a general purpose CPU) is integrated in the same software framework as a C++ implementation for comparison. Identical physical results are obtained for both the CPU and the Hybrid implementations. Timing measurements results show that a critical part, is implemented in VHDL runs on the FPGA co-processor ~4 times faster than on the more or less modern CPU (Intel Xeon 2.4 GHz ) and the whole algorithm runs ~2 times faster.
TL;DR: The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of six leading supercomputers.
Abstract: The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of six leading supercomputers SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, NEC SX-8 and IBM Blue Gene/L. These six systems use also six different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, NEC IXS and IBM Blue Gen/L Torus). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on five of these systems.
TL;DR: A multi-threaded dual-core Xeonreg MP processor with 16 MB of L3 cache and operating at a top frequency of 3.4 GHz has been developed using a non-traditional SOC design methodology on a 65 nm process technology.
Abstract: A multi-threaded dual-core Xeonreg MP processor with 16 MB of L3 cache and operating at a top frequency of 3.4 GHz has been developed using a non-traditional SOC design methodology on a 65 nm process technology. The design methodology embodied highly controlled, customized, and high impact changes to the underlying pre-existing processor cores resulting in performance and functionality that approaches a fully custom design while maintaining high re-use of the existing processor core. This paper presents the key design methodologies and the challenges.
TL;DR: A collection of slides from the author's conference presentation on the Tulsa Processor from Intel discuss the special features of the Tulsa processor, applications for its use, processing capabilities, and targeted markets for its deployment.
Abstract: This article consists of a collection of slides from the author's conference presentation on the Tulsa Processor from Intel. Some of the specific topics discussed include: the special features of the Tulsa processor; applications for its use; processing capabilities; targeted markets for its deployment; paths to multi-core designs; options for multiple core processing; the Tulsa engineering experience based on product implementation and use; system architecture; and tested performance output results.
TL;DR: It is shown that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively and successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.
Abstract: In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. We vectorized FFT kernels using Intel's Streaming SIMD Extensions 2 (SSE2) instructions. We show that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively. Performance results of three-dimensional FFTs on a dual Xeon 2.8 GHz PC SMP cluster are reported. We successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.
TL;DR: A new algorithm is proposed that is effective for objects that are shared among threads but are not contended for in SMP environments and can remove the overhead of the serialization between lock and other non-lock operations and avoid the latency of complex atomic operations.
Abstract: We propose a new algorithm that is effective for objects that are shared among threads but are not contended for in SMP environments. We can remove the overhead of the serialization between lock and other non-lock operations and avoid the latency of complex atomic operations in most cases. We established the safety of the algorithm by using a software tool called Spin. The experimental results from our benchmarking on an SMP machine using Intel Xeon processors revealed that our algorithm could significantly improve efficiency by 80% on average compared to using complex atomic instruction.