Top 38 papers published in the topic of Xeon in 2006

Showing papers on "Xeon published in 2006"

Proceedings Article•10.1109/CGO.2006.13•

Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

[...]

Shih-Wei Liao¹, Zhaohui Du¹, Gansha Wu¹, Guei-Yuan Lueh¹•Institutions (1)

26 Mar 2006

TL;DR: A parallel compiler for the Brook streaming language with aggressive data and computation transformations that effectively reduces memory footprint, exploits parallelism, and circumvents phase-ordering issues is developed.

...read moreread less

Abstract: Multicore processors are about to become prevalent in the PC world. Meanwhile, over 90% of the computing cycles are estimated to be consumed by streaming media applications (Rixner et al., 1998). Although stream programming exposes parallelism naturally, we found that achieving high performance on multiprocessors is challenging. Therefore, we develop a parallel compiler for the Brook streaming language with aggressive data and computation transformations. First, we formulate fifteen Brook stream operators in terms of systems of inequalities. Our compiler optimizes the modeled operators to improve memory footprint and performance. Second, the stream computation including both kernels and operators is mapped to the affine partitioning model by modeling each kernel as an implicit loop nest over stream elements. Note that our general abstraction is not limited to Brook. Our modeling and transformations yield high performance on uniprocessors as well. The geometric mean of speedups is 4.7 on ten streaming applications on a Xeon. On multiprocessors, we show that exploiting the standard intra-kernel data parallelism is inferior to our general modeling. The former yields a speedup of 1.5 for ten applications on a 4-way Xeon, while the latter achieves a speedup of 6.4 over the same baseline. We show that our compiler effectively reduces memory footprint, exploits parallelism, and circumvents phase-ordering issues.

...read moreread less

114 citations

Proceedings Article•10.1109/HOTCHIPS.2006.7477876•

Inside Intel® Core microarchitecture

[...]

Jack Doweck¹•Institutions (1)

Intel¹

1 Aug 2006

TL;DR: This article consists of a collection of slides from the author's conference presentation on Intel's Core product line's microarchitecture, a new foundation for Intel architecture-based mobile, desktop, and server processors that incorporates advanced innovations which optimize performance over a range of market segments.

...read moreread less

Abstract: This article consists of a collection of slides from the author's conference presentation on Intel's Core product line's microarchitecture, a new foundation for Intel architecture-based mobile, desktop, and server processors that incorporates advanced innovations which optimize performance over a range of market segments. Some of the specific topics discussed include: the special features and system specifications of Intel Core; memory management and prefetching capabilities; system performance and flexibility; multithreading capabilities; and a summary of key features and processing facilities.

...read moreread less

101 citations

Book Chapter•10.1007/11823285_13•

Analysis of the memory registration process in the mellanox infiniband software stack

[...]

Frank Mietke¹, Robert Rex¹, Robert Baumgartl¹, Torsten Mehlan¹, Torsten Hoefler¹, Wolfgang Rehm¹ - Show less +2 more•Institutions (1)

Chemnitz University of Technology¹

28 Aug 2006

TL;DR: This paper evaluates and characterize the most time consuming parts in the execution path of the memory registration function using the Read Time Stamp Counter (RDTSC) instruction, and presents results using Linux hugepage support to shorten the time of registering a memory region.

...read moreread less

Abstract: To leverage high speed interconnects like InfiniBand it is important to minimize the communication overhead. The most interfering overhead is the registration of communication memory. In this paper, we present our analysis of the memory registration process inside the Mellanox InfiniBand driver and possible ways out of this bottleneck. We evaluate and characterize the most time consuming parts in the execution path of the memory registration function using the Read Time Stamp Counter (RDTSC) instruction. We present measurements on AMD Opteron and Intel Xeon systems with different types of Host Channel Adapters for PCI-X and PCI-Express. Finally, we conclude with first results using Linux hugepage support to shorten the time of registering a memory region.

...read moreread less

52 citations

Proceedings Article•10.1109/AINA.2006.68•

Accelerating the HMMER sequence analysis suite using conventional processors

[...]

John Paul Walters¹, B. Qudah¹, Vipin Chaudhary¹•Institutions (1)

Wayne State University¹

18 Apr 2006

TL;DR: The results show that the MPICH implementation greatly outperforms the PVM HMMER implementation, and the SSE2 implementation also lends greater computational power at no cost to the user.

...read moreread less

Abstract: Due to the ever-increasing size of sequence databases it has become clear that faster techniques must be employed to effectively perform biological sequence analysis in a reasonable amount of time. Exploiting the inherent parallelism between sequences is a common strategy. In this paper we enhance both the fine-grained and course-grained parallelism within the HMMER sequence analysis suite. Our strategies are complementary to one another and, where necessary, can be used as drop-in replacements to the strategies already provided within HMMER. We use conventional processors (Intel Pentium IV Xeon) as well as the freely available MPICH parallel programming environment. Our results show that the MPICH implementation greatly outperforms the PVM HMMER implementation, and our SSE2 implementation also lends greater computational power at no cost to the user.

...read moreread less

40 citations

Proceedings Article•10.1109/FPL.2006.311207•

A Scalable Architecture for RSA Cryptography on Large FPGAs

[...]

E. Michalski¹, Duncan A. Buell¹•Institutions (1)

University of South Carolina¹

1 Aug 2006

TL;DR: An improvement to a limited resource Montgomery multiplier design, the MWR2MM algorithm proposed by Tenca and Koc, which is suitable for implementation on large FPGAs and can be scaled to utilize available FPGA multipliers, CLB logic and frequencies of operation.

...read moreread less

Abstract: The RSA algorithm is the standard for public-key cryptography today, with Montgomery multiplication the most common mechanism of implementation due to modulo operations using a bitwise shift in place of a division operation. Several Montgomery designs have been proposed for ASIC and FPGA implementation based on limited resource availability to satisfy the computational burden. FPGAs are now available that have large configurable logic resources in addition to dedicated high-speed ALU logic for operations such as multiplication. This paper describes an improvement to a limited resource Montgomery multiplier design, the MWR2MM algorithm proposed by Tenca and Koc, which is suitable for implementation on large FPGAs. The design can be scaled to utilize available FPGA multipliers, CLB logic and frequencies of operation. Implementation and design choices are discussed for an RSA core based on this design, and a comparison against the OpenSSL open source cryptographic library is given. Our results show a 1024-bit RSA core on a 100MHz Virtex2 Pro 100 FPGA platform to be 3.13x faster than an equivalent software implementation on a 2.8 GHz Intel Xeon PC workstation.

...read moreread less

25 citations

Proceedings Article•10.1109/IECON.2006.347675•

InfiniBand-Based Real-Time Simulation of HVDC, STATCOM, and SVC Devices with Commercial-Off-The-Shelf PCs and FPGAs

[...]

Christian Dufour, Simon Abourida, Jean Belanger, Vincent Lapointe

1 Nov 2006

TL;DR: A real-time simulator for large power networks based on commercial-off-the-shelf products and the RT-LAB platform developed by Opal-RT Technologies Inc and the latest advances in hardware-in- the-loop simulation are discussed.

...read moreread less

Abstract: This paper presents a real-time simulator for large power networks based on commercial-off-the-shelf products and the RT-LAB platform developed by Opal-RT Technologies Inc. This platform uses Pentium-, Xeon-, or Opteron-based PCs (multi-CPUs and/or dual-core configurations) or even Xilinx FPGA cards for computational engines, and infiniband communication fabric for fast inter-PC communication. The real-time PCs run under well-known operating systems QNX or RedHawk Linux, while the main control interface is either Simulink software from MathWorks or Lab VIEW software from National Instruments. The paper demonstrates the real-time simulation of a complete, single-pole, 12-pulse, HVDC system on a 2.2 GHz, dual-CPU, dual-core, Opteron PC with an under 15-microsecond time step. Also demonstrated are real-time simulations of complex power system devices like SVCs, STATCOMs, and more general power systems like the Kundur network. The paper also discusses the latest advances in hardware-in-the-loop simulation, including directly programming devices like a PMSM drive in an FPGA card. This feature is enabled in RT-LAB with the Xilinx System Generator, a Simulink blockset. Such FPGA targeting diminishes further the leap between prototype and production-type controller systems because an FPGA card can implement rapid control functions along with fast protection systems, like IGBT-current protection, of a real controller

...read moreread less

23 citations

Proceedings Article•10.5555/1898699.1898849•

Performance evaluation of supercomputers using HPCC and IMB benchmarks

[...]

Subhash Saini¹, Robert Ciotti¹, Brian T. N. Gunney², Thomas E. Spelce², Alice Koniges², Don Dossa², Panagiotis Adamidis³, Rolf Rabenseifner³, Sunil R. Tiyyagura³, Matthias S. Mueller, Rod Fatoohi⁴ - Show less +7 more•Institutions (4)

Ames Research Center¹, Lawrence Livermore National Laboratory², University of Stuttgart³, San Jose State University⁴

25 Apr 2006

TL;DR: The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers.

...read moreread less

Abstract: The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.

...read moreread less

21 citations

Book Chapter•10.1016/B978-044452206-1/50005-7•

Towards Optimal Performance for Lattice Boltzmann Applications on Terascale Computers

[...]

Gerhard Wellein, Peter Lammers, Georg Hager, S. Donath, Thomas Zeiser - Show less +1 more

1 Jan 2006

TL;DR: The lattice Boltzmann method (LBM) performance on commodity “off-the-shelf” clusters with Intel Xeon processors, tailored HPC systems, and a NEC SX8 vector system is discussed.

...read moreread less

Abstract: Publisher Summary The chapter discusses the lattice Boltzmann method (LBM) performance on commodity “off-the-shelf” clusters with Intel Xeon processors, tailored HPC systems, and a NEC SX8 vector system. The chapter describes the main architectural differences and comments on single processor performance as well as optimization strategies. The parallel performance of a large scale simulation running on up to 2000 processors, providing 2 TFlop/s of sustained performance is evaluated and presented in the chapter. In the past decade, the LBM has been established as an alternative for the numerical simulation of incompressible flows. One major reason for the success of LBM is the simplicity of its core algorithm that allows both easy adaption to complex application scenarios as well as extension to additional physical or chemical effects. Because LBM is a direct method, the use of extensive computer resources is often mandatory. Thus, LBM has attracted a lot of attention in the high-performance computing (HFC) community. An important feature of many LBM codes is that the core algorithm can be reduced to a few manageable subroutines, facilitating deep performance analysis followed by precise code and data layout optimization.

...read moreread less

19 citations

Journal Article•10.1016/J.FUSENGDES.2006.04.008•

Real-time data acquisition and feedback control using Linux Intel computers

[...]

B.G. Penaflor¹, J.R. Ferron¹, D.A. Piglowski¹, Robert D. Johnson¹, M.L. Walker¹ - Show less +1 more•Institutions (1)

General Atomics¹

01 Jul 2006-Fusion Engineering and Design

TL;DR: The experiences of the DIII-D programming staff in adapting Linux based Intel computing hardware for use in real-time data acquisition and feedback control systems are described.

...read moreread less

18 citations

Proceedings Article•10.1109/ISIE.2006.295884•

InfiniBand-Based Real-Time Simulation of HVDC, STATCOM and SVC Devices with Custom-Of-The-Shelf PCs and FPGAs

[...]

Christian Dufour, Simon Abourida, Jean Belanger

9 Jul 2006

TL;DR: This paper demonstrates the real-time simulation of complete single-pole 12-pulse HVDC system on dual-CPU, dual-core 2.2 GHz Opteron PC under 15 microseconds time step and the latest advances in hardware-in-the-loop simulation like the possibility to program from within Simulink some controllers or devices, like a PMSM drive, directly in an FPGA card.

...read moreread less

Abstract: This paper presents a real-time simulator for large power network based on Custom-Of-The-Shelf technologies, all embedded in the RT-LAB real-time simulation platform. This platform uses Pentium, Xeon, Opteron-based PCs (multi-CPUs and/or dual-core configurations) or even Xilinx FPGA cards for computational engines and InfiniBand communication fabric for fast inter-PCs communications. The real-time PCs runs under well-known operating systems QNX or RedHawk Linux while the main user control interface is either Simulink or LabView. The paper demonstrates the real-time simulation of complete single-pole 12-pulse HVDC system on dual-CPU, dual-core 2.2 GHz Opteron PC under 15 microseconds time step. It also demonstrates the real-time simulation of complex power system devices like SVC, STATCOM and more general power systems like the Kundur network. The paper also discusses the latest advances in Hardware-In-the-Loop simulation like the possibility to program from within Simulink some controllers or devices, like a PMSM drive, directly in an FPGA card. This feature is enabled in RT-LAB by the use of Xilinx System Generator, a Simulink blockset. This FPGA targeting diminishes further the leap between prototype and production-type controller systems because the FPGA can implement rapidcontrol functions along with fast protection systems, like IGBT-current protection, of real controller.

...read moreread less

18 citations

Proceedings Article•10.1109/FPL.2006.311245•

High Speed Document Clustering in Reconfigurable Hardware

[...]

G. Covington¹, Charles L.G. Comstock¹, Andrew Levine¹, John W. Lockwood¹, Young H. Cho¹ - Show less +1 more•Institutions (1)

University of Washington¹

1 Aug 2006

TL;DR: A full hardware implementation of the K-means clustering algorithm has been designed and implemented in reconfigurable hardware that clusters 512k documents rapidly and outperforms the equivalent software by a factor of 328.

...read moreread less

Abstract: High-performance document clustering systems enable similar documents to be automatically organized into groups. In the past, the large amount of computational time needed to cluster documents prevented practical use of such systems with a large number of documents. A full hardware implementation of the K-means clustering algorithm has been designed and implemented in reconfigurable hardware that clusters 512k documents rapidly. This implementation, uses four parallel cosine distance metrics to cluster document vectors that each have 4000 dimensions. The synthesized hardware runs on the Field Programmable Port Extender (FPX) platform at a clock rate of 80 MHz. Although the clock rate on the Xilinx VirtexE 2000 is slower than a CPU, the implementation runs 26 times faster than an algorithmically equivalent software algorithm running on an Intel 3.60 GHz Xeon. The same architecture was used to synthesize a faster and larger design for the Xilinx Virtex4 LX200. This larger implementation can contain up to 25 parallel cosine distance metrics. The implementation synthesized with a clock rate of 250 Mhz and outperforms the equivalent software by a factor of 328.

...read moreread less

Journal Article•10.1007/S10586-006-9740-9•

Optimizing MPI collective communication by orthogonal structures

[...]

Matthias Kühnemann¹, Thomas Rauber², Gudula Rünger¹•Institutions (2)

Chemnitz University of Technology¹, University of Bayreuth²

01 Jul 2006-Cluster Computing

TL;DR: This article shows how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels, and presents runtime functions for the modeling of two-phase realizations that can predict the executionTime both for communication operations in isolation and in the context of application programs.

...read moreread less

Abstract: MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.

...read moreread less

Proceedings Article•10.1109/FPL.2006.311218•

An FPGA Implementation of High Throughput Stochastic Simulator for Large-Scale Biochemical Systems

[...]

Masato Yoshimi¹, Yasunori Osana¹, Y. Iwaoka¹, Yuri Nishikawa¹, Toshinori Kojima¹, Akira Funahashi, Noriko Hiroi, Yuichiro Shibata², Naoki Iwanaga², H. Kitano, Hideharu Amano¹ - Show less +7 more•Institutions (2)

Keio University¹, Nagasaki University²

1 Dec 2006

TL;DR: The implementation and evaluation of a stochastic simulation algorithm (SSA) called "first reaction method" on an FPGA-based biochemical simulator that achieves high throughput by consecutively throwing data into deeply-pipelined floating point arithmetic units and distributing multiple simulators for parallel execution.

...read moreread less

Abstract: Stochastic simulation of biochemical systems has become one of major approaches to study life processes as system, yet is a computational challenge to run the simulation due to its vast calculation cost. This paper shows the implementation and evaluation of a stochastic simulation algorithm (SSA) called ?First Reaction Method? on an FPGA-based biochemical simulator. It achieves high throughput by (1) consecutively throwing data into deeply-pipelined floating point arithmetic units, and (2) by distruibuting multiple simulators for parallel execution. As the result of evaluation on an FPGA-based simulation platform called ReC-SiP2, the simulator outperforms execution on Xeon 2.80 GHz by approximately 80 times, even with large-scale bio-chemical systems.

...read moreread less

Book Chapter•10.1007/978-3-540-75755-9_135•

An implementation of parallel 1-D FFT using SSE3 instructions on dual-core processors

[...]

Daisuke Takahashi¹•Institutions (1)

University of Tsukuba¹

18 Jun 2006

TL;DR: Combination of vectorization and the block six-step FFT algorithm is shown to effectively improve performance and the performance results for one-dimensional FFTs on dual-core Intel Xeon processors are reported.

...read moreread less

Abstract: In the present paper, an implementation of a parallel one-dimensional fast Fourier transform (FFT) using Streaming SIMD Extensions 3 (SSE3) instructions on dual-core processors is proposed. Combination of vectorization and the block six-step FFT algorithm is shown to effectively improve performance. The performance results for one-dimensional FFTs on dual-core Intel Xeon processors are reported. We successfully achieved performance of approximately 2006MFLOPS on a dual-core Intel Xeon PC (2.8GHz, two CPUs, four cores) and approximately 3492 MFLOPS on a dual-core Intel Xeon 5150 PC (2.66GHz, two CPUs, four cores) for a 220-point FFT.

...read moreread less

Proceedings Article•10.1109/ASSCC.2006.357840•

A 65nm 95W Dual-Core Multi-Threaded Xeon® Processor with L3 Cache

[...]

Simon M. Tam¹, Stefan Rusu¹, J. Chang¹, Sujal Vora¹, B. Cherkauer¹, David J. Ayers¹ - Show less +2 more•Institutions (1)

Intel¹

1 Nov 2006

TL;DR: This paper describes a 95 W dual-core 64-bit Xeonreg MP processor implemented in a 65 nm 8 metal layer process that supports the Intelreg Extended Memory 64 Technology and the Hyper-Threading Technology.

...read moreread less

Abstract: This paper describes a 95 W dual-core 64-bit Xeonreg MP processor implemented in a 65 nm 8 metal layer process. Each processor core has a unified 1MB L2 cache and supports the Intelreg Extended Memory 64 Technology and the Hyper-Threading Technology. The shared L3 cache has extensive RAS features including the Intelreg Cache Safe Technology and Error Correction Codes (ECC). The processor is designed and optimized to operate at a 95W thermal design power envelope at the target product frequency. The front-side bus operates at 667 MT/s or 800 MT/s in a 3 load topology that is compatible with existing platforms.

...read moreread less

Journal Article•10.1016/J.PARCO.2006.09.007•

Performance evaluation of high-speed interconnects using dense communication patterns

[...]

Rod Fatoohi¹, Ken Kardys¹, Sumy Koshy¹, Soundarya Sivaramakrishnan¹, Jeffrey S. Vetter² - Show less +1 more•Institutions (2)

San Jose State University¹, Oak Ridge National Laboratory²

1 Dec 2006

TL;DR: The results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes.

...read moreread less

Abstract: We study the performance of high-speed interconnects using a set of communication micro-benchmarks The goal is to identify certain limiting factors and bottlenecks with these interconnects Our micro-benchmarks are based on dense communication patterns with different communicating partners and varying degrees of these partners We tested our micro-benchmarks on five platforms: an IBM system of 68-node 16-way Power3, interconnected by a SP switch2; another IBM system of 264-node 4-way Power PC 604e, interconnected by an SP switch; a Compaq cluster of 128-node 4-way ES40/EV67 processor, interconnected by an Quadrics interconnect; an Intel cluster of 16-node dual-CPU Xeon, interconnected by an Quadrics interconnect; and a cluster of 22-node Sun Ultra Sparc, interconnected by an Ethernet network Our results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes

...read moreread less

Proceedings Article•10.1109/RADAR.2006.1631841•

Swathbuckler: wide swath SAR system architecture

[...]

R.W. Linderman¹•Institutions (1)

Air Force Research Laboratory¹

24 Apr 2006

TL;DR: Between 2001 and 2005, the Swathbuckler wide-swath SAR real-time image formation multinational project evolved a system architecture to continually process 40 KM strips into high resolution (<1 m) imagery that drove the supercomputer cost down from over $1 M to under $100 K.

...read moreread less

Abstract: Between 2001 and 2005, the Swathbuckler wide-swath SAR real-time image formation multinational project evolved a system architecture to continually process 40 KM strips into high resolution (<1 m) imagery. The rapid advance of COTS memory, I/O, and processor technology drove the supercomputer cost down from over $1 M to under $100 K. This paper discusses the key technology improvements driving the affordable solution. In particular, the 8 gigabytes of memory attached to standard dual Xeon server nodes arranged in a standard cluster greatly simplified the previously daunting task of SAR image formation.

...read moreread less

Journal Article•10.2197/IPSJDC.2.598•

Improving Linpack Performance on SMP Clusters with Asynchronous MPI Programming

[...]

Ta Quoc Viet¹, Tsutomu Yoshinaga¹•Institutions (1)

University of Electro-Communications¹

15 Sep 2006-Ipsj Digital Courier

TL;DR: This study proposes asynchronous MPI, a simple and effective parallel programming model for SMP clusters, to reimplement the High PerformanceLinpack benchmark, and can achieve significant improvements in performance with a minimal programming effort.

...read moreread less

Abstract: This study proposes asynchronous MPI, a simple and effective parallel programming model for SMP clusters, to reimplement the High PerformanceLinpack benchmark. The proposed model forces processors of an SMP node to work in different phases, thereby avoiding unneccessary communication and computation bottlenecks. As a result, we can achieve significant improvements in performance with a minimal programming effort. In comparison with a de-facto flat MPI solution, our algorithm can yield a 20.6% performance improvement for a 16-node cluster of Xeon dual-processor SMPs.

...read moreread less

Journal Article•

Parallel algorithms for balanced truncation model reduction of sparse systems

[...]

José M. Badía, Peter Benner, Rafael Mayo, Enrique S. Quintana-Ortí

01 Jan 2006-Lecture Notes in Computer Science

TL;DR: In this article, the authors describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to O(10 5 ) by solving two large-scale sparse Lyapunov equations via a coupled LR-ADI iteration with (super- )linear convergence.

...read moreread less

Abstract: We describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to O(10 5 ). The major computational task in this approach is the solution of two large-scale sparse Lyapunov equations, performed via a coupled LR-ADI iteration with (super- )linear convergence. Experimental results on a cluster of Intel Xeon processors illustrate the efficacy of our parallel model reduction algorithm.

...read moreread less

Proceedings Article•10.1145/1233501.1233626•

Design and integration methods for a multi-threaded dual core 65nm Xeon® processor

[...]

Raj Varada¹, Mysore Sriram¹, Kris Chou¹, James Guzzo¹•Institutions (1)

Intel¹

5 Nov 2006

TL;DR: The design and integration methods successfully used in a multi-threaded dual core 65nm Xeonreg Processor are described.

...read moreread less

Abstract: The success of building a complex multi-billion transistor processor is very dependent on robust and silicon proven design and integration methods. The complexity of 65nm process and striving for best in class performance with aggressive time to market schedule put a heavy emphasis on innovative design and integration methods to enable working silicon. In this paper, we describe the design and integration methods successfully used in a multi-threaded dual core 65nm Xeon® Processor.Index Terms---Design Methods; Integration; processor; Xeon®.

...read moreread less

Journal Article•10.1007/S11227-006-7956-3•

How to build a fast and reliable 1024 node cluster with only one disk

[...]

Erik Hendriks¹, Ronald G. Minnich²•Institutions (2)

Google¹, Los Alamos National Laboratory²

01 May 2006-The Journal of Supercomputing

TL;DR: This paper describes the software architecture that boots and manages these clusters, and shows the ways that this architecture has been used to greatly improve the operation of the nodes, with particular emphasis on improvements in boot-time performance, scalability, and reliability.

...read moreread less

Abstract: In the last year LANL has constructed a 1408-node AMD Opteron cluster, a 1024-node Intel P4 Xeon cluster, a 256-node AMD Opteron cluster and two 128-node Intel P4 Xeon clusters. Each of these clusters is controlled by one front-end node, and each cluster needs only one disk in the front-end node for production operations. In this paper we describe the software architecture that boots and manages these clusters. This software architecture represents a clean break from the way that clusters have been set up for the last 14 years. We show the ways that this architecture has been used to greatly improve the operation of the nodes, with particular emphasis on improvements in boot-time performance, scalability, and reliability.

...read moreread less

Journal Article•10.1093/IETISY/E89-D.2.639•

A Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers*This work was supported in part by the US Department of Energy under grant DE-FG02-03CH11171.

[...]

Xizhen Xu¹, Sotirios G. Ziavras¹•Institutions (1)

New Jersey Institute of Technology¹

01 Feb 2006-IEICE Transactions on Information and Systems

TL;DR: This paper proposes the FPGA-based Hierarchical-SIMD (H- SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA), which comprises high-level instructions implemented as FPGAs functions of coarse-grain SIMD tasks to facilitate ease of program development, code portability and high performance.

...read moreread less

Abstract: FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1], [2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.

...read moreread less

Proceedings Article•10.1109/GRC.2006.1635885•

Whole transcriptome mRNA secondary structure analysis using distributed computation

[...]

J. Yoo, D. Digby, A. Davis, W. Seffens

10 May 2006

TL;DR: A comparative study of Human, Mouse, and Arabidopsis genes to determine if mRNA secondary structure increases with more complex organisms and single-value means of each transcriptome were calculated as a simple average of all transcripts.

...read moreread less

Abstract: We performed a comparative study of Human, Mouse, and Arabidopsis genes to determine if mRNA secondary structure increases with more complex organisms. Calculating the secondary structures of a large number of genes in a transcriptome has a high degree of parallelism, and is suitable to implement on multiple computers. The analysis in this investigation was done on a cluster of eighteen (18) Intel x86- based Windows 2000 workstations. These were linked together as a Network of Workstations (NOW) configuration. A Dell Power Edge 4600 Intel Xeon Windows 2000 dual processor server controlled the NOW workstations and was used for file storage. Single-value means of each transcriptome were calculated as a simple average of all transcripts. This computational research effort is of biomedical interest to provide computational tools to analyze and characterize proteins and mRNAs.

...read moreread less

Dissertation•

Using FPGA Co-processors for Improving the execution Speed of Pattern Recognition Algorithms in ATLAS LVL2 Trigger

[...]

Andrei Khomich

1 Jan 2006

TL;DR: Results show that a critical part of the TRT LUT-Hough algorithm implemented in VHDL runs on the FPGA co-processor ~4 times faster than on the more or less modern CPU and the whole algorithm runs ~2 times faster.

...read moreread less

Abstract: In the scope of this thesis one of the possible approaches to acceleration the tracking algorithms using the hybrid FPGA/CPU systems has been investigated. The TRT LUT-Hough algorithm - one of the tracking algorithms for ATLAS Level2 trigger - is selected for this purpose. It is a Look-Up Table (LUT) based Hough transform algorithm for Transition Radiation Tracker (TRT). The algorithm was created keeping in mind the B-physic's tasks: fast search for low-pT tracks in entire TRT volume. Such a full subdetector scan requires a lot of computational power. Hybrid implementation of the algorithm (when the most time consuming part of algorithm is accelerated by FPGA co-processor and all other parts are running on a general purpose CPU) is integrated in the same software framework as a C++ implementation for comparison. Identical physical results are obtained for both the CPU and the Hybrid implementations. Timing measurements results show that a critical part, is implemented in VHDL runs on the FPGA co-processor ~4 times faster than on the more or less modern CPU (Intel Xeon 2.4 GHz ) and the whole algorithm runs ~2 times faster.

...read moreread less

Performance Comparison of Cray X1 and Cray Opteron Cluster with Other Leading Platforms using HPCC and IMB Benchmarks

[...]

Subhash Saini, Rolf Rabenseifner, Brian T. N. Gunney, Thomas E. Spelce, Alice Koniges, Don Dossa, Panagiotis Adamidis, Robert Ciotti, Sunil R. Tiyyagura, Matthias A. Müller, Rod Fatoohi, Moffett Field, One Washington Square - Show less +9 more

1 Jan 2006

...read moreread less

Abstract: The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of six leading supercomputers SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, NEC SX-8 and IBM Blue Gene/L. These six systems use also six different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, NEC IXS and IBM Blue Gen/L Torus). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on five of these systems.

...read moreread less

Proceedings Article•10.1109/SOCC.2006.283884•

SOC Design Challenges in a Multi-threaded 65nm Dual Core Xeon® MP Processor

[...]

R. Varada¹, S. Tarn¹, J. Benoit¹, K. Chou¹•Institutions (1)

Intel¹

1 Sep 2006

TL;DR: A multi-threaded dual-core Xeonreg MP processor with 16 MB of L3 cache and operating at a top frequency of 3.4 GHz has been developed using a non-traditional SOC design methodology on a 65 nm process technology.

...read moreread less

Abstract: A multi-threaded dual-core Xeonreg MP processor with 16 MB of L3 cache and operating at a top frequency of 3.4 GHz has been developed using a non-traditional SOC design methodology on a 65 nm process technology. The design methodology embodied highly controlled, customized, and high impact changes to the underlying pre-existing processor cores resulting in performance and functionality that approaches a fully custom design while maintaining high re-use of the existing processor core. This paper presents the key design methodologies and the challenges.

...read moreread less

Proceedings Article•10.1109/HOTCHIPS.2006.7477873•

The tulsa processor: A dual core large shared-cache Intel® Xeon processor 7000 sequence for the MP server market segment

[...]

Jeffrey D. Gilbert¹, Stephen H. Hunt¹, Daniel Gunadi¹, Ganapati Srinivas¹•Institutions (1)

Intel¹

1 Aug 2006

TL;DR: A collection of slides from the author's conference presentation on the Tulsa Processor from Intel discuss the special features of the Tulsa processor, applications for its use, processing capabilities, and targeted markets for its deployment.

...read moreread less

Abstract: This article consists of a collection of slides from the author's conference presentation on the Tulsa Processor from Intel. Some of the specific topics discussed include: the special features of the Tulsa processor; applications for its use; processing capabilities; targeted markets for its deployment; paths to multi-core designs; options for multiple core processing; the Tulsa engineering experience based on product implementation and use; system architecture; and tested performance output results.

...read moreread less

Ensuring Fast Implementations of Symmetric Ciphers on the Intel Pentium 4 and Beyond

[...]

Ed Dawson, Matt Henricksen

1 Jan 2006

Journal Article•

An implementation of parallel 3-D FFT using short vector SIMD instructions on clusters of PCs

[...]

Daisuke Takahashi, Taisuke Boku, Mitsuhisa Sato

01 Jan 2006-Lecture Notes in Computer Science

TL;DR: It is shown that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively and successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.

...read moreread less

Abstract: In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. We vectorized FFT kernels using Intel's Streaming SIMD Extensions 2 (SSE2) instructions. We show that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively. Performance results of three-dimensional FFTs on a dual Xeon 2.8 GHz PC SMP cluster are reported. We successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.

...read moreread less

Journal Article•10.2197/IPSJDC.2.759•

Efficient Lock Algorithm for Shared Objects in SMP Environments

[...]

Takeshi Ogasawara¹, Takeshi Ogasawara², Hideaki Komatsu², Toshio Nakatani²•Institutions (2)

University of Tokyo¹, IBM²

15 Dec 2006-Ipsj Digital Courier

TL;DR: A new algorithm is proposed that is effective for objects that are shared among threads but are not contended for in SMP environments and can remove the overhead of the serialization between lock and other non-lock operations and avoid the latency of complex atomic operations.

...read moreread less

Abstract: We propose a new algorithm that is effective for objects that are shared among threads but are not contended for in SMP environments. We can remove the overhead of the serialization between lock and other non-lock operations and avoid the latency of complex atomic operations in most cases. We established the safety of the algorithm by using a software tool called Spin. The experimental results from our benchmarking on an SMP machine using Intel Xeon processors revealed that our algorithm could significantly improve efficiency by 80% on average compared to using complex atomic instruction.

...read moreread less