Top 38 papers published in the topic of Xeon in 2005

Showing papers on "Xeon published in 2005"

Proceedings Article•10.1145/1065944.1065947•

Compiler techniques for high performance sequentially consistent java programs

[...]

Zehra Sura¹, Xing Fang², Chi-Leung Wong³, Samuel P. Midkiff², Jaejin Lee⁴, David Padua³ - Show less +2 more•Institutions (4)

IBM¹, Purdue University², University of Illinois at Urbana–Champaign³, Seoul National University⁴

15 Jun 2005

TL;DR: This paper describes co-operating escape, thread structure, and delay set analyses that enable high performance for sequentially consistent programs.

...read moreread less

Abstract: The rise of Java, C#, and other explicitly parallel languages has increased the importance of compiling for different software memory models. This paper describes co-operating escape, thread structure, and delay set analyses that enable high performance for sequentially consistent programs.We compare the performance of a set of Java programs compiled for sequential consistency (SC) with the performance of the same programs compiled for weak consistency. For SC, we observe a slowdown of 10% on average for an architecture based on the Intel Xeon processor, and 26% on average for an architecture based on the IBM Power3.

...read moreread less

106 citations

Journal Article•10.1109/TVLSI.2005.857179•

Customizable elliptic curve cryptosystems

[...]

Ray C. C. Cheung¹, N.J. Telle¹, Wayne Luk¹, Peter Y. K. Cheung¹•Institutions (1)

Imperial College London¹

01 Sep 2005-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.

...read moreread less

Abstract: This paper presents a method for producing hardware designs for elliptic curve cryptography (ECC) systems over the finite field GF(2/sup m/), using the optimal normal basis for the representation of numbers. Our field multiplier design is based on a parallel architecture containing multiple m-bit serial multipliers; by changing the number of such serial multipliers, designers can obtain implementations with different tradeoffs in speed, size and level of security. A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements. To facilitate performance characterization, we have developed a parametric model for estimating the number of cycles for our generic ECC architecture. The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.

...read moreread less

95 citations

Journal Article•10.1021/CI050126L•

Parallelizing a Molecular Dynamics Algorithm on a Multiprocessor Workstation Using OpenMP

[...]

Konstantin B. Tarmyshov¹, Florian Müller-Plathe¹•Institutions (1)

International University, Cambodia¹

11 Oct 2005-Journal of Chemical Information and Modeling

TL;DR: The atomistic molecular dynamics program YASP has been parallelized for shared-memory computer architectures; most of the sequential FORTRAN code was kept; parallel constructs were inserted as compiler directives using the OpenMP standard.

...read moreread less

Abstract: The atomistic molecular dynamics program YASP has been parallelized for shared-memory computer architectures. Parallelization was restricted to the most CPU-time-consuming parts: neighbor-list construction, calculation of nonbonded, angle and dihedral forces, and constraints. Most of the sequential FORTRAN code was kept; parallel constructs were inserted as compiler directives using the OpenMP standard. Only in the case of the neighbor list did the data structure have to be changed. The parallel code achieves a useful speedup over the sequential version for systems of several thousand atoms and above. On an IBM Regatta p690+, the throughput increases with the number of processors up to a maximum of 12−16 processors depending on the characteristics of the simulated systems. On dual-processor Xeon systems, the speedup is about 1.7.

...read moreread less

78 citations

Proceedings Article•10.1145/1065944.1065976•

Adaptive execution techniques for SMT multiprocessor architectures

[...]

Changhee Jung¹, Daeseob Lim², Jaejin Lee³, SangYong Han³•Institutions (3)

Electronics and Telecommunications Research Institute¹, University of California, San Diego², Seoul National University³

15 Jun 2005

TL;DR: Adaptive execution techniques to find the optimal execution mode for SMT multiprocessor architectures are presented and code is, on average, about 2 and 18 times faster than the original code executed on 4 and 8 logical processors, respectively.

...read moreread less

Abstract: In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize performance in an SMT multiprocessor, finding the optimal number of threads is important. This paper presents adaptive execution techniques to find the optimal execution mode for SMT multiprocessor architectures. A compiler preprocessor generates code that, based on dynamic feedback, automatically determines at run time the optimal number of threads for each parallel loop in the application. Using 10 standard numerical applications and running them with our techniques on an Intel 4-processor Hyper-Threading Xeon SMP with 8 logical processors, our code is, on average, about 2 and 18 times faster than the original code executed on 4 and 8 logical processors, respectively.

...read moreread less

73 citations

Proceedings Article•10.1145/1064212.1064254•

Evaluating the impact of simultaneous multithreading on network servers using real hardware

[...]

Yaoping Ruan¹, Vivek S. Pai¹, Erich M. Nahum², John M. Tracey²•Institutions (2)

Princeton University¹, IBM²

6 Jun 2005

TL;DR: It is found that while the general trends for microarchitectural behavior agree with real hardware, differences in sizing assumptions and performance models yield much more optimistic benefits for SMT than the authors observe.

...read moreread less

Abstract: This paper examines the performance of simultaneous multithreading (SMT) for network servers using actual hardware, multiple network server applications, and several workloads. Using three versions of the Intel Xeon processor with Hyper-Threading, we perform macroscopic analysis as well as microarchitectural measurements to understand the origins of the performance bottlenecks for SMT processors in these environments. The results of our evaluation suggest that the current SMT support in the Xeon is application and workload sensitive, and may not yield significant benefits for network servers.In general, we find that enabling SMT on real hardware usually produces only slight performance gains, and can sometimes lead to performance loss. In the uniprocessor case, previous studies appear to have neglected the OS overhead in switching from a uniprocessor kernel to an SMT-enabled kernel. The performance loss associated with such support is comparable to the gains provided by SMT. In the 2-way multiprocessor case, the higher number of memory references from SMT often causes the memory system to become the bottleneck, offsetting any processor utilization gains. This effect is compounded by the growing gap between processor speeds and memory latency. In trying to understand the large gains shown by simulation studies, we find that while the general trends for microarchitectural behavior agree with real hardware, differences in sizing assumptions and performance models yield much more optimistic benefits for SMT than we observe.

...read moreread less

33 citations

Proceedings Article•10.1145/1095921.1095982•

Performance evaluation of packet capturing systems for high-speed networks

[...]

Fabian Schneider¹, Jörg Wallerich¹•Institutions (1)

Technische Universität München¹

24 Oct 2005

TL;DR: A methodology for evaluating different systems with respect to their maximum capture rate is described and preliminary results of comparing Intel Xeon against AMD Opteron based systems running either Linux or FreeBSD are presented.

...read moreread less

Abstract: Using commodity systems for capturing packets in a Gigabit environments is a challenging task. This applies especially to a full capture of packet headers along with their data. Today's commodity PC systems come in different flavors in terms of processor hardware as well as in terms of operating systems. In this paper we describe a methodology for evaluating different systems with respect to their maximum capture rate and present our preliminary results of comparing Intel Xeon against AMD Opteron based systems running either Linux or FreeBSD.

...read moreread less

29 citations

Journal Article•

Hardware acceleration of mutual information-based 3D image registration

[...]

Carlos R. Castro-Pareja, Raj Shekhar

01 Jan 2005-Journal of Imaging Science and Technology

TL;DR: A hardware architecture is presented that can be used to accelerate a number of linear and elastic image registration algorithms that use mutual information as an image similarity measure and achieved speedups of 30 for linear registration and 100 for elastic registration against a 3.2 GHz Pentium III Xeon workstation.

...read moreread less

Abstract: Real-time image registration is potentially an enabling technology for the effective and efficient use of many image-guided diagnostic and treatment procedures relying on multimodality image fusion or serial image comparison. Mutual information is currently the best-known image similarity measure for intensity-based multimodality image registration. The calculation of mutual information is memory intensive and does not benefit from cache-based memory architectures in standard software implementations, i.e., the calculation incurs a large number of cache misses. Previous attempts to perform image registration in real time focused on parallel supercomputer implementations, which achieved significant speedups using large, expensive supercomputers that are impractical for clinical deployment. We present a hardware architecture that can be used to accelerate a number of linear and elastic image registration algorithms that use mutual information as an image similarity measure. A proof-of-concept implementation of the architecture achieved speedups of 30 for linear registration and 100 for elastic registration against a 3.2 GHz Pentium III Xeon workstation. Further speedup can be achieved by using several modules in parallel.

...read moreread less

27 citations

10.1109/DOD_UGC.2005.16•

CAP Phase II Simulations for the Air Force HEL-JTO Project: Atmospheric Turbulence Simulations on NAVO's 3000-Processor IBM P4+ and ARL's 2000-Processor Intel Xeon EM64T Cluster

[...]

J. Werne, T. Lund, D. Fritts

27 Jun 2005

TL;DR: The relative performance of Kraken and the new 2000-processor Intel XEON EM64T cluster (Jvn) at ARL is discussed, pointing out differences which amount primarily to inter-node network performance.

...read moreread less

Abstract: Shear turbulence induced by the Kelvin-Helmholtz (KH) instability in a stratified fluid is simulated in support of the Air Force High-Energy-Laser Joint Tactical Office (HEL-JTO) project using the new 3000-processor NAVO IBM Power 4+ system (Kraken). The results are used to 1) compare with and improve a dynamic LES method we have developed, and 2) provide the high-resolution simulation component of a new subgrid-scale (SGS) model we have developed for optical turbulence forecasting. We also discuss the relative performance of Kraken and the new 2000-processor Intel XEON EM64T cluster (Jvn) at ARL, pointing out differences which amount primarily to inter-node network performance. We suggest "%of wall time spent communicating" as a standard DoD benchmark criterion for very large distributed systems, as other more conventional criteria do not necessarily represent this important system and algorithm metric.

...read moreread less

22 citations

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

[...]

Jason Cope¹, Michael Oberg¹, Henry M. Tufo, Matthew Woitaszek¹•Institutions (1)

University of Colorado Boulder¹

1 Jan 2005

TL;DR: This paper deploys PVFS2, GPFS, Lustre, and TerraFS for shared deployment across multiple Linux clusters running with different hardware architectures and operating systems and shows that all of the parallel filesystems outperform a legacy NFS system but with different levels of complexity.

...read moreread less

Abstract: In this paper, we examine parallel filesystems for shared deployment across multiple Linux clusters running with different hardware architectures and operating systems. Specifically, we deploy PVFS2, GPFS, Lustre, and TerraFS in our test environment containing Intel Xeon, Intel x86-64, and IBM PPC970 systems. We comment on the feature sets of each filesystem, describe our implementation and configuration experiences, and present initial performance benchmark results. Our analysis shows that all of the parallel filesystems outperform a legacy NFS system but with different levels of complexity. Each of the filesystems demonstrates the best performance under certain conditions. Three of the systems – GPFS, Lustre and TerraFS – depend on specific kernel versions that increase administrative complexity and can reduce interoperability.

...read moreread less

18 citations

Proceedings Article•10.1109/ISPAN.2005.73•

Reactive spin-locks: a self-tuning approach

[...]

Phuong Hoai Ha¹, Marina Papatriantafilou¹, Philippas Tsigas¹•Institutions (1)

Chalmers University of Technology¹

7 Dec 2005

TL;DR: A new reactive spin-lock algorithm that is completely self-tuning, which means no experimentally tuned parameter nor probability distribution of inputs are needed, and is built on a competitive online algorithm.

...read moreread less

Abstract: Reactive spin-lock algorithms that can automatically adapt to contention variation on the lock have received great attention in the field of multiprocessor synchronization, since they can help applications achieve good performance in all possible contention conditions. However, in existing reactive spin-locks the reaction relies on (i) some fixed experimentally tuned thresholds, which may get frequently inappropriate in dynamic environments like multiprogramming/multiprocessor systems, or (ii) known probability distributions of inputs. This paper presents a new reactive spin-lock algorithm that is completely self-tuning, which means no experimentally tuned parameter nor probability distribution of inputs are needed. The new spin-lock is built on a competitive online algorithm. Our experiments, which use the Spark98 kernels and the SPLASH-2 applications as application benchmarks, on a multiprocessor machine SGI Origin2000 and on an Intel Xeon workstation show that the new self-tuning spin-lock helps applications with different characteristics achieve good performance in a wide range of contention levels.

...read moreread less

17 citations

Book Chapter•10.1007/11549468_93•

Parallel order reduction via balanced truncation for optimal cooling of steel profiles

[...]

José M. Badía¹, Peter Benner², Rafael Mayo¹, Enrique S. Quintana-Ortí¹, Gregorio Quintana-Ortí¹, Jens Saak² - Show less +2 more•Institutions (2)

James I University¹, Chemnitz University of Technology²

30 Aug 2005

TL;DR: In this article, the authors employ two efficient parallel approaches to reduce a model arising from a semi-discretization of a controlled heat transfer process for optimal cooling of a steel profile.

...read moreread less

Abstract: We employ two efficient parallel approaches to reduce a model arising from a semi-discretization of a controlled heat transfer process for optimal cooling of a steel profile. Both algorithms are based on balanced truncation but differ in the numerical method that is used to solve two dual generalized Lyapunov equations, which is the major computational task. Experimental results on a cluster of Intel Xeon processors compare the efficacy of the parallel model reduction algorithms.

...read moreread less

Proceedings Article•10.1109/FPT.2005.1568590•

The design of scalable stochastic biochemical simulator on FPGA

[...]

Masato Yoshimi¹, Yasunori Osana¹, Y. Iwaoka¹, Akira Funahashi, Noriko Hiroi, Yuichiro Shibata², Naoki Iwanaga², H. Kitano, Hideharu Amano - Show less +5 more•Institutions (2)

Keio University¹, Nagasaki University²

1 Dec 2005

TL;DR: The implementation of a stochastic biochemical simulation algorithm called Next Reaction Method for Virtex-II PRO is shown, and the FPGA-based simulator outperforms the software implementation on Xeon 2.40 GHz by 17.1 times.

...read moreread less

Abstract: Biochemical simulations including whole-cell models require high performance computing systems. Reconfigurable systems are expected to be an alternative solution for conventional methods by PC clusters or vector computers. This paper shows the implementation of a stochastic biochemical simulation algorithm called Next Reaction Method for Virtex-II PRO. As the result of benchmarking with a small reaction system, the FPGA-based simulator outperforms the software implementation on Xeon 2.40 GHz by 17.1 times

...read moreread less

Journal Article•

RAxML-OMP : An efficient program for phylogenetic inference on SMPs

[...]

Alexandros Stamatakis, Michael Ott, Thomas Ludwig

01 Jan 2005-Lecture Notes in Computer Science

TL;DR: RAxML-OMP as mentioned in this paper is an efficient OpenMP-parallelization for Symmetric Multi-Processing machines (SMPs) based on the sequential program RAxMLV (Randomized Axelerated Maximum Likelihood).

...read moreread less

Abstract: Inference of phylogenetic trees comprising hundreds or even thousands of organisms based on the Maximum Likelihood (ML) method is computationally extremely intensive. In order to accelerate computations we implemented RAxML-OMP, an efficient OpenMP-parallelization for Symmetric Multi-Processing machines (SMPs) based on the sequential program RAxML-V (Randomized Axelerated Maximum Likelihood). RAxML-V is a program for inference of evolutionary trees based upon the ML method and incorporates several advanced search algorithms like fast hill-climbing and simulated annealing. We assess performance of RAxML-OMP on the widely used Intel Xeon, Intel Itanium, and AMD Opteron architectures. RAxML-OMP scales particularly well on the AMD Opteron architecture and achieves even super-linear speedups for large datasets (with a length > 5.000 base pairs) due to improved cache-efficiency and data locality. RAxML-OMP is freely available as open source code.

...read moreread less

Proceedings Article•10.1109/PCCC.2005.1460525•

Performance characterization of iSCSI processing in a server platform

[...]

H.M. Khosravi, Abhijeet Joglekar, Ravi Iyer

7 Apr 2005

TL;DR: This work focuses on analyzing the underlying architectural characteristics of iSCSI packet processing and quantifying its compute/memory requirements, and does a detailed analysis of the architectural characteristics in terms of path length, cycles spent per instruction, cache misses at all levels and branch mispredictions.

...read moreread less

Abstract: The iSCSI protocol is a key building block for enabling IP-based network storage High performance iSCSI implementations that can support multi-gigabit storage traffic throughput at low latencies are important in facilitating the widespread deployment of this technology Motivated by this, our work presented in this paper focuses on analyzing the underlying architectural characteristics of iSCSI packet processing and quantifying its compute/memory requirements Our analysis and characterization methodology is based on in-depth measurement experiments of iSCSI packet processing performance on Intel/sup /spl reg// Xeon/spl trade/ processor, running the Red Hat Linux operating system Our measurement data shows the achievable throughput and consumed CPU utilization at different disk I/O sizes We also study the overhead of integrity checks on iSCSI performance by enabling CRC computation To understand the source of the iSCSI processing costs, we then do a detailed analysis of the architectural characteristics in terms of path length, cycles spent per instruction, cache misses at all levels and branch mispredictions

...read moreread less

Large Scale Simulations in Nanostructures with NEMO3-D on Linux Clusters

[...]

Marek Korkusinski, Faisal Saied, Haiying Xu, Seungwon Lee, Mohamed Sayeed, Sebastien Goasguen, Gerhard Klimeck - Show less +3 more

1 Jan 2005

TL;DR: The main goal of this paper is to present new capabilities that have been added to a simulation code, NEMO3-D to make it one of the premier simulation tools for design and analysis of realistically-sized nanoelectronic devices, and therefore to makeIt a valid tool for the computational nanotechnology community.

...read moreread less

Abstract: The rapid progress in nanofabrication technologies has led to the emergence of new classes of nanodevices, in which the quantum nature of charge carriers dominates the device properties and performance. The device sizes have already reached the level of hundreds down to even tens of nanometers, where the atomistic granularity of constituent materials cannot be neglected. This has led to new challenges in Computational Nanotechnology. The main goal of this paper is to present new capabilities that have been added a simulation code, NEMO3-D to make it one of the premier simulation tools for design and analysis of realistically-sized nanoelectronic devices, and therefore to make it a valid tool for the computational nanotechnology community. Memory requirements for strain and electronic structure calculations are described. Computational performance experiments are conducted on several cluster (Intel Xeon, Apple G4) and shared-memory (IBM Regatta) architectures. The simulation of electronic structure in a 21-million system is demonstrated which corresponds to a complex Hermitian matrix of order 4 x 10.

...read moreread less

Proceedings Article•10.1109/IISWC.2005.1526002•

A multi-level comparative performance characterization of SPECjbb2005 versus SPECjbb2000

[...]

Ricardo Morin¹, A. Kumar¹, E. Ilyina¹•Institutions (1)

Intel¹

7 Nov 2005

TL;DR: A comparative characterization of these two workloads based on detailed measurements on an Intel/spl reg/ Xeon/spl trade/ processor-based commercial server is presented and recommendations to JVM developers and hardware designers are provided.

...read moreread less

Abstract: SPEC has released SPECjbb2005, a new server-side Java benchmark which supersedes SPECjbb2000. SPECjbb2005 is a substantial update to SPECjbb2000, intended to make the workload more representative based on current Java development practices. SPECjbb2000 has been in existence for about five years and it has been a valuable tool for optimizing the performance of commercial JVMs as well as supporting research activities. Since SPECjbb2005 replaces SPECjbb2000, it is important to understand the key differences between the two, as well as implications for JVM and hardware designers. In this paper, we present a comparative characterization of these two workloads based on detailed measurements on an Intel/spl reg/ Xeon/spl trade/ processor-based commercial server. First, we describe key functional changes introduced in SPECjbb2005. Using low-intrusion application profiling tools we compare application execution profiles. Through JVM monitoring tools, we compare JVM behavior including JIT optimization and garbage collection. Using operating system monitoring tools we compare key system level metrics including CPU utilization. With the aid of processor performance monitoring events, we compare key architectural characteristics such as cache miss rates, memory/bus utilization, and branch behavior. Finally, we summarize key findings, provide recommendations to JVM developers and hardware designers, and suggest areas for future work.

...read moreread less

Proceedings Article•10.1109/RECONFIG.2005.10•

An FPGA-based coprocessor for the SPHINX speech recognition system: early experiences

[...]

Guillermo Marcus, Juan Arturo Nolazco-Flores¹•Institutions (1)

Monterrey Institute of Technology and Higher Education¹

28 Sep 2005

TL;DR: A FPGA based hardware coprocessor for the SPHINX Speech Recognition System implements a critical part of the Baum-Welch Algorithm to assist in the Gaussian probability calculations, currently with a peak performance of 264 MFlops.

...read moreread less

Abstract: A FPGA based hardware coprocessor for the SPHINX Speech Recognition System is presented. The coprocessor operates at 66MHz and implements a critical part of the Baum-Welch Algorithm to assist in the Gaussian probability calculations, currently with a peak performance of 264 MFlops. Results are presented in comparison with a Xeon 2.66GHz computer and a similar ASIC project, together with guidelines for future development

...read moreread less

Proceedings Article•10.1109/ICPPW.2005.71•

Performance evaluation of high-speed interconnects using dense communication patterns

[...]

Rod Fatoohi¹, Ken Kardys¹, Sumy Koshy¹, Soundarya Sivaramakrishnan¹, Jeffrey S. Vetter² - Show less +1 more•Institutions (2)

San Jose State University¹, Oak Ridge National Laboratory²

14 Jun 2005

TL;DR: The results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes.

...read moreread less

Abstract: We study the performance of high-speed interconnects using a set of communication microbenchmarks. The goal is to identify certain limiting factors and bottlenecks with these interconnects. Our microbenchmarks are based on dense communication patterns with different communicating partners and varying degrees of these partners. We tested our microbenchmarks on five platforms: an IBM system of 68-node 16-way Power3, interconnected by a SP switch2; another IBM system of 264-node 4-way Power PC 604e, interconnected by a SP switch; a Compaq cluster of 128-node 4-way ES40/EV6 7 processor, interconnected by an Quadrics interconnect; an Intel cluster of 16-node dual-CPU Xeon, interconnected by an Quadrics interconnect; and a cluster of 22-node Sun Ultra Sparc, interconnected by an Ethernet network. Our results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes.

...read moreread less

Posted Content•

The PACS-CS Project

[...]

Sinya Aoki, K. I. Ishikawa, T. Ishikawa, N. Ishizuka, Kazuyuki Kanaya, Yoshinobu Kuramashi, Masanori Okawa, Kiyoshi Sasaki, Y. Taniguchi, N. Tsutsui, Akira Ukawa, T. Yoshié - Show less +8 more

02 Oct 2005-arXiv: High Energy Physics - Lattice

TL;DR: A large-scale cluster system with a peak speed of 14.3Tflop for lattice QCD at the Center for Computational Sciences, University of Tsukuba, as a successor to the current 0.6Tflops CP-PACS computer is described.

...read moreread less

Abstract: We describe our plan to develop a large-scale cluster system with a peak speed of 14.3Tflops for lattice QCD at the Center for Computational Sciences, University of Tsukuba, as a successor to the current 0.6Tflops CP-PACS computer. The system consist of 2560 nodes connected by a 16x16x10 three-dimensional hyper crossbar network. Each node has a single low-voltage 2.8GHz Xeon processor and 2GBytes of memory with 6.4GBytes/sec bandwidth, and 160 GBytes of disk in RAID1 mode. The network link in each of the three directions is made of dual Gigabit Ethernet with the peak throughput of 250MByte/sec. Hence each node has an aggregate network bandwidth of 750MByte/sec. The system will run under Linux and SCore, and an extension of the PM driver is developed for the network. The system will be developed jointly with Hitachi Limited. The installation is scheduled in the first quarter of Japanese Fiscal 2006 (April-June 2006) and the start of operation is expected in July 2006.

...read moreread less

Proceedings Article•10.1109/IPDPS.2005.191•

Efficient direct user level sockets for an Intel/spl reg/ Xeon/spl trade/ processor based TCP on-load engine

[...]

Vikram A. Saletore¹, P.M. Stillwell¹, J.A. Wiegert¹, P. Cayton¹, J. Gray¹, Greg J. Regnier¹ - Show less +2 more•Institutions (1)

Intel¹

4 Apr 2005

TL;DR: The embedded transport acceleration software prototype that uses one of the Intel/spl reg/ Xeon/spl trade/ processors in a multi-processor server as a packet processing engine that is closely tied to the server's core CPU and memory complex is continued.

...read moreread less

Abstract: Intel Labs has continued development of the embedded transport acceleration (ETA) software prototype that uses one of the Intel/spl reg/ Xeon/spl trade/ processors in a multi-processor server as a packet processing engine (PPE) that is closely tied to the server's core CPU and memory complex. We have further developed the prototype to provide support for user-level, asynchronous interface for sockets. The direct user socket interface (DUSI) allows user-level applications to interface directly to the PPE using familiar socket commands and semantics. The prototype runs in an asymmetric multiprocessing mode, in that the PPE does not run as a general computing resource for the host operating system. We describe the prototype software architecture, the DUSI application interface, and detail our measurement and analysis of some micro-benchmarks. In particular, we measure throughput for transactions and end-to-end latency as the key metrics for the analysis.

...read moreread less

Proceedings Article•10.1117/12.585019•

Real-time implementation of a multiresolution motion-compensating temporal filter on general-purpose hardware

[...]

Alexandra Groth¹, Kai Eck¹•Institutions (1)

Philips¹

25 Feb 2005-electronic imaging

TL;DR: A data-driven algorithmic structure on a standard PC was developed for a block-based motion compensated temporal filtering in real time and the cache capacity miss rate is reduced to less than 0.8%.

...read moreread less

Abstract: A data-driven algorithmic structure on a standard PC was developed for a block-based motion compensated temporal filtering in real time. The major time limiting factor of the algorithm was identified as the irregular memory access mainly caused by the layered multi-resolution representation of the input frames. As a result, data is transferred from main memory to cache multiple times leading to memory-dominated critical paths in execution. In order to improve the cache utilization, the computations have been rearranged to process the complete signal on the cached subset of data. The input frames are now divided into super-lines, which are subsets of data containing the relevant information to calculate one line of motion vectors and to filter the corresponding image lines. Only when a set of data is no longer used nor for motion vector analysis nor for filtering the images themselves it is replaced by data of different layers or lines. Due to these data-driven techniques the cache capacity miss rate is reduced to less than 0.8%. As a result, images are processed at a rate of more than 44 fps on a standard PC (Intel dual-processor Xeon, 1.8 GHz), as opposed to 1 fps in the standard implementation.

...read moreread less

Proceedings Article•10.1109/NSSMIC.2005.1596788•

Impact of a high-performance communication network on cluster-based parallel iterative reconstruction

[...]

Judson Jones¹, W.F. Jones¹, J. Everman¹, Vladimir Y. Panin¹, Christian Michel¹, Frank Kehren¹, Jun Bao¹, J. Young¹, Michael E. Casey¹ - Show less +5 more•Institutions (1)

Siemens¹

1 Jan 2005

TL;DR: In every case PCI-X Infiniband provided significantly superior reconstruction performance at all cluster sizes, and in all cases but one extended the range of cluster sizes over which the impact of improving communication performance was observed.

...read moreread less

Abstract: Assuming balanced computation, the scalability of an iterative reconstruction algorithm on a cluster computer will be determined by the tradeoff between shorter computation times on larger clusters versus longer communication times. To investigate the impact of improving communication performance, we assembled a cluster equipped with dual communication networks: Gigabit Ethernet and PCI-X Infiniband. The cluster consisted of 8 compute nodes and one fileserver node; each node had dual 3.6 GHz Intel Xeon processors and the two dual-ported communication networks. For image synchronization alone, PCI-X Infiniband was 3.8 times faster than Gigabit Ethernet. We benchmarked two parallel OSEM-3D reconstruction algorithms representing a range of image and sinogram sizes, for both a brain-only scanner (HRRT) and a whole-body scanner (HIREZ). In every case PCI-X Infiniband provided significantly superior reconstruction performance at all cluster sizes, and in all cases but one extended the range of cluster sizes over which we observed performance improvements.

...read moreread less

Proceedings Article•10.1109/ICPPW.2005.66•

Parallel module network learning on distributed memory multiprocessors

[...]

Long Liu¹, Wei Hu², Chunrong Lai², Hongshan Jiang¹, Wenguang Chen¹, Weimin Zheng¹, Yimin Zhang² - Show less +3 more•Institutions (2)

Tsinghua University¹, Intel²

14 Jun 2005

TL;DR: A parallel implementation of the module network is proposed, a less time-consuming, learning algorithm based on the message-passing model, which groups computations by modules and then distributes them cyclically.

...read moreread less

Abstract: As an extension of the Bayesian network, the module network is used in situations where there are many variables but only a small set of data available. However, using this network is still time-consuming. In this paper, the authors proposed a parallel implementation of the module network, a less time-consuming, learning algorithm based on the message-passing model. In order to solve the load-imbalance problem introduced by either result caching or intrinsic computation, a grouping strategy was proposed, which groups computations by modules and then distributes them cyclically. The algorithm was tested on eight 4-way Intel Xeon multiprocessors. Speedups of 29.26 on 32 processors have been observed. The result shows that our algorithm is effective.

...read moreread less

Hot Chips and Hot Interconnects for High End Computing Systems

[...]

Subhash Saini¹•Institutions (1)

Ames Research Center¹

1 Jan 2005

TL;DR: The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks, and a discussion of general trends in the field of high performance computing.

...read moreread less

Abstract: I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).

...read moreread less

Proceedings Article•10.1109/HOTCHIPS.2005.7476602•

Intel 8×× seriies and paxville xeon-MP microprocessors

[...]

Jonatthan Dougllas¹•Institutions (1)

Intel¹

1 Aug 2005

Proceedings Article•

Parallel History Matching and Associated Forecast at the Center for Interactive Smart Oilfield Technologies.

[...]

Ken-ichi Nomura¹, Rajiv K. Kalia¹, Aiichiro Nakano¹, Priya Vashishta¹, Jorge L. Landa² - Show less +1 more•Institutions (2)

University of Southern California¹, Chevron Corporation²

1 Jan 2005

TL;DR: A Monte Carlo method, combined with proxies to avoid excessive data processing, is employed to identify reservoir simulation models that best match the oilfield production history and these models are used to forecast future productions with uncertainty estimates of unprecedented precision.

...read moreread less

Abstract: We have developed a parallel and distributed computing framework to solve an inverse problem, which involves massive data sets and is of great importance to petroleum industry A Monte Carlo method, combined with proxies to avoid excessive data processing, is employed to identify reservoir simulation models that best match the oilfield production history Subsequently, the selected models are used to forecast future productions with uncertainty estimates The parallelization framework combines: (1) message passing for tightly coupled intra-simulation decomposition; and (2) scheduler/Grid remote procedure calls for model parameter sweeps A preliminary numerical test has included 3,159 simulations on a 256-processor Intel Xeon cluster at the USC-CACS The results provide uncertainty estimates of unprecedented precision

...read moreread less

Performance of Two-Way Opteron and Xeon Processor-Based Servers for Scientific and Technical Applications

[...]

Douglas M. Pase, James Stephens

1 Jan 2005

TL;DR: This paper compares the performance of two servers that are typical of those used to build Linux clusters, both are two-way servers based on 64-bit versions of x86 processors and describes the architecture and performance of each server.

...read moreread less

Abstract: The performance 1 of Linux clusters used for High-Performance Computation (HPC) applications is affected by the performance of three important components of server architecture: the Arithmetic Logic Unit (ALU) or processor core, the memory, and the high-speed network used to interconnect the cluster servers or nodes. The behavior of these three subsystems is in turn affected by the choice of processor used in the server. In this paper we compare the performance of two servers that are typical of those used to build Linux clusters. Both are two-way servers based on 64-bit versions of x86 processors. The servers are each packaged in a 1U (1.75 inch high) rack-mounted chassis. The first server we describe is the IBM® eServer™ 326, based on the AMD Opteron™ processor. The second is the IBM eServer xSeries™ 336, based on the Intel® Xeon processor with Extended Memory 64 Technology (EM64T) enabled. Both are powerful servers designed and optimized to be used as the building blocks of a Linux cluster that may be as small as a few nodes or as large as several thousand nodes. We describe the architecture and performance of each server. We use results from the popular SPEC® CPU2000 and Linpack benchmarks to present different aspects of the performance of the processor core. We use results from the STREAM benchmark to present memory performance. Finally, we discuss how characteristics of the I/O slots affect the interconnect performance, whether the choice is Gigabit Ethernet, Myrinet, InfiniBand, or some other interconnect.

...read moreread less

Proceedings Article•10.1061/40754(183)99•

Multi-objective optimization for the construction of large-scale infrastructure systems

[...]

Amr Kandil¹, Khaled El-Rayes¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

1 Aug 2005

TL;DR: An advanced multi-objective genetic algorithm is implemented using the message passing interface (MPI) standard to distribute genetic algorithm computations over a cluster of 50 Intel Xeon processors to enable the optimization of large-scale infrastructure projects.

...read moreread less

Abstract: This paper presents the development of an advanced I nformation T echnology F ramework for O ptimizing C onstruction U tilization of Resources in Civil Infrastructure S ystems, named IT-FOCUS . The main objectives of this framework are to: (1) develop robust optimization models for minimizing construction cost, and duration, while maximizing quality; and (2) formulate scalable methodologies for solving large-scale construction optimization problems. To this end, the present framework is implemented using an advanced multi-objective genetic algorithm that is capable of generating optimal trade-offs between construction cost, duration, and quality. To enable the optimization of large-scale infrastructure projects, the algorithm is parallelized using the manager-worker paradigm of parallel and distributed computing. The parallel implementation of the algorithm utilizes the message passing interface (MPI) standard to distribute genetic algorithm computations over a cluster of 50 Intel Xeon processors. A number of large-scale construction projects with sizes ranging from 180 to 720 activities were evaluated using the 50 processor cluster to evaluate the computational requirements for optimizing real-life construction projects. The results of this evaluation highlight the significant computational savings that can be achieved by the implemented parallel computing framework.

...read moreread less

Journal Article•10.1177/1094342005056133•

Performance Analysis of a Linux Pc Cluster Using a Direct Numerical Simulation of Fluid Turbulence Code

[...]

Chun-Ho Liu¹, Chat-Ming Woo¹, Dennis Y.C. Leung¹•Institutions (1)

University of Hong Kong¹

1 Nov 2005

TL;DR: Testing the parallel performance of an Intel Xeon-based Linux PC cluster using a finite element code for direct numerical simulation (DNS) of incompressible fluid turbulence found that PC Linux clusters are thus affordable platforms, compared with more expensive supercomputers, to conduct large-scale scientific computing for fluid turbulence research.

...read moreread less

Abstract: In this paper, we have tested the parallel performance of an Intel Xeon-based Linux PC cluster using a finite element code for direct numerical simulation DNS of incompressible fluid turbulence. The parallel performance of the PC cluster, which used up to 64 2.8 GHz processors, was evaluated by comparing three scales of DNS trial runs consisting of 3.3, 5.8, and 10.1 million elements. Subroutines of different natures were contrasted to investigate the scalability of the DNS code. For DNS calculation of sufficiently large scale, the subroutines showed reasonable parallel efficiency. Doubling the number of processors reduced the CPU time by about 40%. Of particular interest was the CPU time required by the two subroutines handling interprocessor communication that was fairly constant within the range of processors tested. PC Linux clusters are thus affordable platforms, compared with more expensive supercomputers, to conduct large-scale scientific computing for fluid turbulence research.

...read moreread less

Journal Article•10.1016/J.NUCLPHYSBPS.2004.11.362•

Domain Wall Fermion Inverter on Pentium 4

[...]

Andrew Pochinsky¹•Institutions (1)

Massachusetts Institute of Technology¹

1 Mar 2005

TL;DR: A highly optimized domain wall fermion inverter developed as part of the SciDAC lattice initiative achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors.

...read moreread less

Abstract: A highly optimized domain wall fermion inverter has been developed as part of the SciDAC lattice initiative. By designing the code to minimize memory bus traffic, it achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors. The code uses the SciDAC QMP communication library.

...read moreread less