Top 37 papers published in the topic of Xeon in 2004

Showing papers on "Xeon published in 2004"

Journal Article•10.1109/MM.2004.1268989•

ETA: experience with an Intel Xeon processor as a packet processing engine

[...]

Greg J. Regnier¹, Dave B. Minturn¹, Gary L. McAlpine¹, Vikram A. Saletore¹, Annie Foong¹ - Show less +1 more•Institutions (1)

Intel¹

01 Jan 2004-IEEE Micro

TL;DR: In this article, the authors use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing, and show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server.

...read moreread less

Abstract: Server-based networks have well-documented performance limitations. These limitations outline a major goal of Intel's embedded transport acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and transmission control protocol/Internet protocol (TCP/IP) networks. By developing this capability, Intel hopes to take advantage of the large knowledge base and ubiquity of these standard technologies. With the advent of 10 gigabit Ethernet, these standards promise to provide the bandwidth required of the most demanding server applications. We use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing. Such PPEs have certain desirable attributes; the ETA project focuses on developing PPEs with such attributes, which include scalability, extensibility, and programmability. General-purpose processors, such as the Intel Xeon in our prototype, are extensible and programmable by definition. Our results show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server. Specifically, partitioning the packet processing onto a dedicated set of compute resources allows for optimizations that are otherwise impossible when time sharing the same compute resources with the operating system and applications.

...read moreread less

97 citations

Proceedings Article•10.1109/IPDPS.2004.1302990•

Towards efficient multi-level threading of H.264 encoder on Intel hyper-threading architectures

[...]

Yen-Kuang Chen¹, Xinmin Tian¹, Steven Ge¹, Milind B. Girkar¹•Institutions (1)

Intel¹

26 Apr 2004

TL;DR: Two efficient methods for multilevel data partitioning are described that can improve the performance of the multithreaded H.264 encoder using the OpenMP programming model, which allows us to leverage the advanced compiler technologies in the Intel/spl reg/ C++ compiler for Intel hyper-threading architectures.

...read moreread less

Abstract: Summary form only given. Exploiting thread-level parallelism is a promising way to improve the performance of multimedia applications that are running on multithreading general-purpose processors. We describe the work in developing our threaded H.264 encoder. We parallelize the H.264 encoder using the OpenMP programming model, which allows us to leverage the advanced compiler technologies in the Intel/spl reg/ C++ compiler for Intel hyper-threading architectures. After we present our design considerations in the parallelization process, we describe two efficient methods for multilevel data partitioning, which can improve the performance of our multithreaded H.264 encoder. Furthermore, we exploit different options in the OpenMP programming. While one implementation that uses the task queuing model is slightly slower than the other implementation, it is easier to be read than the other one. The results have shown good speedups ranging from 3.74x to 4.53x over the well-optimized sequential code performance on a system of 4 Intel Xeon/spl trade/processors with hyper-threading technology.

...read moreread less

76 citations

Journal Article•

Monte Carlo radiative heat transfer simulation on a reconfigurable computer

[...]

Maya Gokhale¹, Janette R. Frigo¹, Christine Ahrens¹, Justin L. Tripp¹, Ron Minnich¹ - Show less +1 more•Institutions (1)

Los Alamos National Laboratory¹

01 Jan 2004-Lecture Notes in Computer Science

TL;DR: This work porting a supercomputer application benchmark onto Xilinx Virtex II and VirtexII Pro FPGAs and comparing performance with three Pentium IV Xeon microprocessors shows that this application-specific pipeline, with 12 multiply, 10 add/subtract, one divide, and two compare modules of single precision floating point data type, shows speed up.

...read moreread less

Abstract: Recently, the appearance of very large (3 - 10M gate) FPGAs with embedded arithmetic units has opened the door to the possibility of floating point computation on these devices While previous researchers have described peak performance or kernel matrix operations, there is as yet relatively little experience with mapping an application-specific floating point loop onto FPGAs In this work, we port a supercomputer application benchmark onto Xilinx Virtex II and Virtex II Pro FPGAs and compare performance with three Pentium IV Xeon microprocessors Our results show that this application-specific pipeline, with 12 multiply, 10 add/subtract, one divide, and two compare modules of single precision floating point data type, shows speed up of 1037× We analyze the trade-offs between hardware and software to characterize the algorithms that will perform well on current and future FPGA architectures

...read moreread less

35 citations

Performance Analysis and Validation of the Intel Pentium 4 Processor on 90nm Technology

[...]

Ronak Singhal, Evan R. Cohn, John G. Holm

1 Jan 2004

TL;DR: The process for detecting post-silicon performance issues, developing associated optimizations, and communicating these to application engineers, compiler teams, and microprocessor architects for appropriate action is described.

...read moreread less

Abstract: In addition to the considerable effort spent on functional validation of Intel processors, a separate parallel activity is conducted to verify that processor performance meets or exceeds specifications. In this paper, we discuss both the pre-silicon and post-silicon performance validation processes carried out on the 90nm version of the Intel Pentium 4 processor. For pre-silicon performance work, we describe how a detailed performance simulator is used to ensure that the processor specification meets the product’s performance targets and also that the implementation matches the defined specification. Additionally, we describe how we project the performance of the Pentium 4 processor on key applications and benchmarks. Once silicon arrives, the second phase of performance verification work starts. We describe the process for detecting post-silicon performance issues, developing associated optimizations, and communicating these to application engineers, compiler teams, and microprocessor architects for appropriate action. We also discuss the tools used to gather performance metrics and characterize application performance.

...read moreread less

31 citations

Book Chapter•10.1007/978-3-540-30117-2_12•

Monte Carlo Radiative Heat Transfer Simulation on a Reconfigurable Computer

[...]

Maya Gokhale¹, Janette R. Frigo¹, Christine Ahrens¹, Justin L. Tripp¹, Ron Minnich¹ - Show less +1 more•Institutions (1)

Los Alamos National Laboratory¹

30 Aug 2004

TL;DR: In this paper, the authors ported a supercomputer application benchmark onto Xilinx Virtex II and Virtex Pro FPGAs and compared performance with three Pentium IV Xeon microprocessors.

...read moreread less

Abstract: Recently, the appearance of very large (3 – 10M gate) FPGAs with embedded arithmetic units has opened the door to the possibility of floating point computation on these devices. While previous researchers have described peak performance or kernel matrix operations, there is as yet relatively little experience with mapping an application-specific floating point loop onto FPGAs. In this work, we port a supercomputer application benchmark onto Xilinx Virtex II and Virtex II Pro FPGAs and compare performance with three Pentium IV Xeon microprocessors. Our results show that this application-specific pipeline, with 12 multiply, 10 add/subtract, one divide, and two compare modules of single precision floating point data type, shows speed up of 10.37×. We analyze the trade-offs between hardware and software to characterize the algorithms that will perform well on current and future FPGA architectures.

...read moreread less

31 citations

Proceedings Article•10.1109/IJCNN.2004.1381073•

Self-organizing map hardware accelerator system and its application to realtime image enlargement

[...]

Hakaru Tamukoh¹, T. Aso, Keiichi Horio, Takeshi Yamakawa•Institutions (1)

Kyushu Institute of Technology¹

25 Jul 2004

TL;DR: A new fast learning algorithm for SOM and its digital hardware design based on the massively parallel architecture is proposed and up to 256 competing units (16 /spl times/ 16 map) can be implemented.

...read moreread less

Abstract: We propose a new fast learning algorithm for SOM and its digital hardware design based on the massively parallel architecture. When this proposed algorithm is realized by using Xilinx XC2V6000-6 FPGA, a maximum performance of 17500 MCUPS is achieved and up to 256 competing units (16 /spl times/ 16 map) can be implemented. Each competing unit have a weight vector which is represented by 128 elements of 16 bits accuracy. Furthermore, we applied the proposed hardware to a realtime digital image enlargement system. In the case of full color (24 bits) image enlargement from QQVGA (160 /spl times/ 120 pixel) to QVGA (320 /spl times/ 240 pixel), a proposed hardware requires only 0.12 second per image, while the personal computer (Intel XEON, 2.8 GHz Dual) requires more than 5 seconds per image.

...read moreread less

28 citations

Book•

The Unabridged Pentium 4: IA32 Processor Genealogy

[...]

Tom Shanley

23 Jul 2004

TL;DR: Inside, Tom Shanley covers not only the hardware design and software enhancements of Intel's latest processors, he also explains the relationship between these hardware and software characteristics, and readers will come away with a complete understanding of the processor's internal architecture, the Front Side Bus (FSB), theprocessor's relationship to the system, and the processor’s software architecture.

...read moreread less

Abstract: “In this monumental new book, Tom Shanley pulls together 15 years of history of Intel's mainline microprocessors, the most popular and important computer architecture in history. Shanley has a keen eye for the salient facts, and an outstanding sense for how to organize and display the material for easy accessibility by the reader. If you want to know what does this bit control, what does that feature do, and how did those instructions evolve through several generations of x86, this is the reference book for you. This is the book Intel should have written, but now they don't have to.” i? i? i? i? i? i? i? i? i? -Bob Colwell, Intel FellowThe Unabridged Pentium 4 offers unparalleled coverage of Intel's IA32 family of processors, from the 386 through the Pentium 4 and Pentium M processors. Unlike other texts, which address solely a hardware or software audience, this book serves as a comprehensive technical reference for both audiences. Inside, Tom Shanley covers not only the hardware design and software enhancements of Intel's latest processors, he also explains the relationship between these hardware and software characteristics. As a result, readers will come away with a complete understanding of the processor's internal architecture, the Front Side Bus (FSB), the processor's relationship to the system, and the processor's software architecture.Essential topics covered include: Goals of single-task and multi-task operating systems The 386 processor-the baseline ancestor of the IA32 processor family The 486 processor, including a cache primer The Pentium processor The P6 roadmap, P6 processor core, and P6 FSB The Pentium Pro processor, including the Microcode Update feature The Pentium II and the Pentium II Xeon and Celeron processors The Pentium III and the Pentium III Xeon and Celeron processors The Pentium 4 processor family The Pentium M processor Processor identification, System Management Mode, and the IO and Local APICsAn “at-a-glance” table of contents allows readers to quickly find topics ranging from 386 Demand Mode Paging to Pentium 4 CPU Arbitration.The accompanying CD-ROM contains 16 extra chapters.Whether you design software or hardware or are responsible for system maintenance or customer support, The Unabridged Pentium 4 will prove an invaluable reference to the world's most widely used microprocessor chips.MindShare's PC System Architecture series is a crisply written and comprehensive set of guides to the most important PC hardware standards. Books in the series are intended for use by hardware and software designers, programmers, and support personnel.One of the leading technical training companies in the hardware industry, MindShare, Inc., provides innovative courses for dozens of companies, including HP, AMD, IBM, and Compaq. Through these classes and by writing the highly regarded PC System Architecture Series for Addison-Wesley, MindShare trainers emphasize the relationships of hardware subsystems to each other as well as the relationship between software and hardware.

...read moreread less

26 citations

Book Chapter•10.1007/11558958_32•

Parallel algorithms for balanced truncation model reduction of sparse systems

[...]

José M. Badía¹, Peter Benner², Rafael Mayo¹, Enrique S. Quintana-Ortí¹•Institutions (2)

James I University¹, Chemnitz University of Technology²

20 Jun 2004

TL;DR: In this article, the authors describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to $mathcal{O(10^5) ).

...read moreread less

Abstract: We describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to $\mathcal{O}(10^5)$. The major computational task in this approach is the solution of two large-scale sparse Lyapunov equations, performed via a coupled LR-ADI iteration with (super-)linear convergence. Experimental results on a cluster of Intel Xeon processors illustrate the efficacy of our parallel model reduction algorithm.

...read moreread less

21 citations

Proceedings Article•10.5555/1025127.1026014•

TO-Lock: Removing Lock Overhead Using the Owners' Temporal Locality

[...]

Takeshi Ogasawara¹, Hideaki Komatsu¹, Toshio Nakatani¹•Institutions (1)

IBM¹

29 Sep 2004

TL;DR: The experimental results of the benchmarking on an SMP machine using Intel Xeon processors showed that the proposed algorithm can significantly improve the performance by 83% on average compared to the case using a complex atomic instruction.

...read moreread less

Abstract: The performance of locking is critical, as programming languages with built-in thread support are coming into wide use. Many techniques for optimizing Java monitors have been proposed, based on the observation that the locks are rarely contended for in many applications. However, the problem of the performance degradation in SMP environments caused by necessary serializations of the processors' execution has not been addressed for shared objects. We propose a new algorithm for this problem. It uses simple instructions to acquire the lock by exploiting the owner locality for objects even if the ownership has migrated among the threads. Our algorithm is particularly effective for SMP environments because we can remove the overhead of the serialization caused by complex atomic operations for uncontended locks by allowing the lock operation and the code protected by the lock to be executed in parallel. We verified the safety of the algorithm by using a software tool, Spin. The experimental results of our bench-marking on an SMP machine using Intel Xeon processors showed that our algorithm can significantly improve the performance by 83% on average compared to the case using a complex atomic instruction.

...read moreread less

11 citations

Proceedings Article•10.1109/ICPADS.2004.62•

Parallelization of Bayesian network based SNPs pattern analysis and performance characterization on SMP/HT

[...]

Justin J. Song¹, Eric Li¹, Wei Hu¹, S. Ge¹, Chunrong Lai¹, Yimin Zhang¹, Xuegong Zhang², Wenguang Chen², Weimin Zheng² - Show less +5 more•Institutions (2)

Intel¹, Tsinghua University²

7 Jul 2004

TL;DR: Workload profiling shows that parallel SNPs' data sharing nature matches hyper-threading's cache sharing mechanism, and thus greatly reduces cache coherency protocol traffic on shared front side bus and Scalability analysis shows that imbalance and locks are two major factors that may limit the parallel workload speedup on more processor platforms.

...read moreread less

Abstract: Single nucleotide polymorphisms (SNPs) is subtle variation in a genomic DNA sequence of individuals of the same species. It plays a key role in the pharmaceutical industry to understand variations in drug treatment responses between individuals at the molecular level. Discovering patterns around SNPs loci is very important for better understanding the possible origin of SNPs in evolution. Bayesian network has been applied to this problem and got promising results. Since Bayesian network based SNPs pattern analysis demonstrates high computational complexity, we parallelized this workload on Intel Xeon SMP systems. SNPs' task level parallelism is exploited. Experiment results show that memory is bottleneck: on 8-way Xeon SMP hyper-threading enabled system, system memory bandwidth is fully saturated and memory load access latency is roughly 50% longer than on single processor system. Another interesting result is that Intel's hyper-threading technology helps improve the multithreaded workload's performance by 1.6X speedup. Workload profiling shows that parallel SNPs' data sharing nature matches hyper-threading's cache sharing mechanism, and thus greatly reduces cache coherency protocol traffic on shared front side bus. Scalability analysis shows that imbalance and locks are two major factors that may limit the parallel workload speedup on more processor platforms.

...read moreread less

10 citations

Proceedings Article•10.1109/CCGRID.2004.1336698•

High performance LU factorization for non-dedicated clusters

[...]

Toshio Endo¹, Kenji Kaneda¹, Kenjiro Taura¹, Akinori Yonezawa²•Institutions (2)

University of Tokyo¹, University of Virginia²

19 Apr 2004

TL;DR: An implementation of parallel LU factorization that achieves high performance on non-dedicated clusters by a combination of techniques including a latency tolerant communication and data partitioning that achieves both load balance and small communication volume for arbitrary and dynamically changing number of processors.

...read moreread less

Abstract: This paper describes an implementation of parallel LU factorization. The focus is to achieve high performance on non-dedicated clusters, where the number of available computing resources may be arbitrary and even dynamically changing. We accommodate joining/leaving processes by describing the algorithm in the Phoenix programming model. We achieve high performance in this setting by a combination of techniques including a latency tolerant communication and data partitioning that achieves both load balance and small communication volume for arbitrary and dynamically changing number of processors. We observed 130 GFlops with 128 processes on a 70-node dual 2.4GHz Xeon cluster, at matrix size = 46080. This performance is comparable to that of the High Performance Linpack (HPL). When cluster nodes are loaded by background processes, our implementation surpasses HPL.

...read moreread less

Optimized Lattice Boltzmann Kernels as Testbeds for Processor Performance

[...]

Thomas Zeiser, Gerhard Wellein, Georg Hager, S. Donath, Frank Deserno, Peter Lammers, Monika Wierse - Show less +3 more

1 Jan 2004

TL;DR: It is demonstrated that vector systems can outperform COTS architectures by more than one order of magnitude, and comparing different programming models of the LBM kernel shows the Cray X1 to deliver good performance even on the standard implementation of this kernel.

...read moreread less

Abstract: Delivering high sustained performance for scientific, memory-intensive applications is a well known problem in high performance computing (HPC). The main objective in the design of vector computers is to resolve this challenge. Commodity “off-the-shelf” (COTS) architectures do not mainly focus on HPC requirements, but dominate the HPC market due to their (often) moderate price-performance ratio. In our report we present a comprehensive survey of modern processor architectures ranging from IA32 compatible (Intel Xeon, AMD Opteron), superscalar RISC (IBM Power4), IA64 (Intel Itanium2) to classical vector (NEC SX6) and novel vector (Cray X1) architectures. Using a kernel from the lattice Boltzmann method (LBM), we point out different architecture dependent optimization strategies and discuss single processor performance numbers. Our results demonstrate that vector systems can outperform COTS architectures by more than one order of magnitude. The NEC SX6 and Cray X1 achieve comparable performance levels on large problem sizes. Comparing different programming models of the LBM kernel shows the Cray X1 to deliver good performance even on the standard implementation of this kernel.

...read moreread less

Proceedings Article•10.1109/HPCASIA.2004.1324023•

PM/InfiniBand-FJ: a high performance communication facility using InfiniBand for large scale PC clusters

[...]

Shinji Sumimoto¹, Akira Naruse¹, Kouichi Kumon¹, K. Hosoe¹, Toshiyuki Shimizu¹ - Show less +1 more•Institutions (1)

Fujitsu¹

20 Jul 2004

TL;DR: This work describes a design of high performance communication facility called the PM/InfiniBand-FJ using InfiniBand interconnect for large scale PC clusters using the original specification of Infini band.

...read moreread less

Abstract: This work describes a design of high performance communication facility called the PM/InfiniBand-FJ using InfiniBand interconnect for large scale PC clusters The PM/InfiniBand-FJ has developed to realize higher application performance than commercial supercomputers and comparable availability to them Since the specification of InfiniBand interconnect is designed for communication among servers and I/Os, there are some issues to use InfiniBand for high performance computation on over 1000 node PC clusters Therefore, the PM/InfiniBand-FJ solves the issues by expanding the original specification of InfiniBand We have implemented the PM/InfiniBand-FJ on SCore cluster system software, and evaluated the communication and application performance The performance results show that a 9132 MB/s of bandwidth and 156 /spl mu/s round trip time have been achieved on Xeon 28GHz PC with ServerWorks GC LE chipset The result of NAS parallel benchmark shows that the 128 node result of IS Class B on PM/InfiniBand-FJ is 152 times faster than that of PM/MyrinetXP using Fujitsu PR1MERGY RX200 PC cluster (Xeon 306GHz)

...read moreread less

Book Chapter•10.1007/978-3-540-27866-5_93•

Execution Schemes for Parallel Adams Methods

[...]

Thomas Rauber, Gudula Rünger¹•Institutions (1)

Chemnitz University of Technology¹

31 Aug 2004

TL;DR: Mixed parallel execution schemes for specific (explicit and implicit) variants of general linear methods, the Parallel Adams-Bashforth methods and the ParallelAdams-Moulton methods are studied, which are new methods providing additional method parallelism.

...read moreread less

Abstract: Many recent solvers for ordinary differential equations (ODEs) have been designed with an additional potential of method parallelism, but the actual effectiveness of exploiting method parallelism depends on the specific communication and computation requirements induced by the equation to be solved. In this paper we study mixed parallel execution schemes for specific (explicit and implicit) variants of general linear methods, the Parallel Adams-Bashforth methods and the Parallel Adams-Moulton methods, which are new methods providing additional method parallelism. The implementations are realized with a library for multiprocessor task programming. Experiments on a Cray T3E and a dual Xeon cluster show good efficiency results, also for sparse application problems.

...read moreread less

Book Chapter•10.1007/978-3-540-30566-8_115•

Evaluating performance of BLAST on intel xeon and itanium2 processors

[...]

Ramesh Radhakrishnan, Rizwan Ali, G. Kochhar, Kalyana Chadalavada, Ramesh Rajagopalan, Jenwei Hsieh, Onur Celebioglu - Show less +3 more

13 Dec 2004

TL;DR: The aim is to understand the performance impact of the different features associated with each processor/platform technology when running the BLAST workload.

...read moreread less

Abstract: High-performance computing (HPC) has increasingly adopted the use of clustered Intel architecture–based servers. This paper compares the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies. They are the PE1750 which is an IA-32 based Xeon system, PE1850 which uses the new 90nm technology Xeon processor at faster frequencies and the PE3250 which is an Itanium2 based system. BLAST (Basic Local Alignment Search Tool), a high performance computing application used in the field of biological research, is used as the workload for this study. The aim is to understand the performance benefits of the different features associated with each processor/platform technology to BLAST and explain the observations using other standard micro-benchmarks like STREAM and LMBench.

...read moreread less

Proceedings Article•10.1109/PARELEC.2004.68•

The Modified Speculative Method for the Transient States Analysis

[...]

J. Forenc¹, Andrzej Jordan•Institutions (1)

Białystok Technical University¹

7 Sep 2004

TL;DR: In the article the modified speculative method is modified, then the analysis of a non-linear model of a DC-motor supplied by a solar generator, as an example of the application of this method will be presented.

...read moreread less

Abstract: The speculative method is intended to conduct transient states analysis in electrical circuits in which the transient state is described by a system of ordinary differential equations (ODE), linear or non-linear. A general idea of this method is based on the decomposition of the time domain. Computations in subintervals of time are conducted in parallel with the use of one of wellknown sequential numerical methods of solving ODE systems. In the article the modified speculative method, then the analysis of a non-linear model of a DC-motor supplied by a solar generator, as an example of the application of this method, will be presented. The computations were carried out with the use of the cluster of 5 workstations based on Intel Xeon 2.66 GHz processor.

...read moreread less

10.5170/CERN-2005-002.1153•

Lattice QCD clusters at Fermilab

[...]

Don Holmgren, P. B. Mackenzie, Anitoj Singh, Jim Simone

1 Dec 2004

TL;DR: This paper describes production clusters for lattice QCD simulations, and discusses the investigations of various commodity processors, including Pentium 4E, Xeon, Opteron, and PPC970.

...read moreread less

Abstract: As part of the DOE SciDAC ''National Infrastructure for Lattice Gauge Computing'' project, Fermilab builds and operates production clusters for lattice QCD simulations. This paper will describe these clusters. The design of lattice QCD clusters requires careful attention to balancing memory bandwidth, floating point throughput, and network performance. We will discuss our investigations of various commodity processors, including Pentium 4E, Xeon, Opteron, and PPC970. We will also discuss our early experiences with the emerging Infiniband and PCI Express architectures. Finally, we will present our predictions and plans for future clusters.

...read moreread less

Proceedings Article•10.1109/DOD_UGC.2004.20•

Heterogeneous high performance computer emulation of a space based radar on-board processor

[...]

A. Salama, R. Linderman¹, J. Rooks¹, A. Leider¹•Institutions (1)

Air Force Research Laboratory¹

7 Jun 2004

TL;DR: The successful emulation of an on-board processor (OBP) to support space based radar (SBR) is presented and this framework allows for experimenting with architecture enhancements and changes, and ultimately will ensure a low cost, reliable, fully reprogrammable product produced without re-spins.

...read moreread less

Abstract: This paper presents the successful emulation of an on-board processor (OBP) to support space based radar (SBR). The emulation is demonstrated on the forty-eight node dual Xeon heterogeneous high performance computer (HHPC) operated by the Air Force Research Laboratory (AFRL) located in Rome, New York. Each node in the HHPC supports one Annapolis Wildstar II board composed of 2 Xilinx Virtex II 6 Million gate field programmable gate arrays (FPGAs). As system complexity increases, debugging the software of tera-scale systems with hundreds to thousands of processors is poorly supported by time consuming simulations. However, the advent of large FPGAs allows a powerful new tool to assist in the architecture development effort - emulation. For the case at hand, the 96 FPGAs of the HHPC are capable of emulating at 8% of the actual system clock speed (20 MHz of 250 MHz) and close to 15% of the 2560 individual processors of the proposed SBR system. Even at this reduced scale, this emulation provides a testing environment roughly a million times more capable than HPC-based simulation for early software bug detection and correction. Further, this framework allows for experimenting with architecture enhancements and changes, and ultimately will ensure a low cost, reliable, fully reprogrammable product produced without re-spins. The embedded system architecture of this SBR OBP is based on AFRL's dual processor, power efficient, programmable, wafer scale signal processor (WSSP). Target tracking and discrimination algorithms were developed and demonstrated on an earlier 96-processor embodiment of this architecture. For SBR, the algorithm set is being extended to include synthetic aperture radar (SAR) image formation and moving target indication (MTI) algorithms.

...read moreread less

Proceedings Article•10.1109/ISPDC.2004.33•

Load balancing multi-zone applications on a heterogeneous cluster with multi-level parallelism

[...]

P. Wong¹, Haoqiang Jin¹, J. Becker¹•Institutions (1)

Ames Research Center¹

5 Jul 2004

TL;DR: This work ran the Multi-zone versions of the NAS Parallel Benchmarks on a cluster composed of two SGI Origin 2000 servers, and an Intel SMP Xeon server connected by Gigabit Ethernet, and reported on the results and their implications for running parallel applications on heterogeneous clusters.

...read moreread less

Abstract: We investigate the feasibility of running parallel applications on heterogeneous clusters. The motivation for doing so is twofold. First, it is practical to be able to pull together existing machines to run a job that is too big for any one of them, especially if such jobs are run rarely. Second, in the event of an emergency, where a very large problem must be solved in a few days, it may not be feasible to purchase and install a new machine in time, and any existing machines will have to be brought to bear on the problem. We ran the Multi-zone versions of the NAS Parallel Benchmarks (NPB) on a cluster composed of two SGI Origin 2000 servers, and an Intel SMP Xeon server connected by Gigabit Ethernet. We report on the results and their implications for running parallel applications on heterogeneous clusters.

...read moreread less

Performance of Scientific Applications on Linux Clusters

[...]

Swamy Kandadai, Suga Sugavanam

1 Jan 2004

TL;DR: This paper presents performance comparisons of several scientific applications on the HPC Benchmark center Linux® clusters, selecting Stream, a benchmark that measures memory bandwidth; PALLAS, an Message Passing Interface (MPI) benchmark to measure single, parallel and collective communications, and LINPACK, a dense linear algebra equations solver.

...read moreread less

Abstract: This paper presents performance comparisons of several scientific applications on the HPC Benchmark center Linux® clusters. The applications selected are Stream, a benchmark that measures memory bandwidth; PALLAS, an Message Passing Interface (MPI) benchmark to measure single, parallel and collective communications; LINPACK, a dense linear algebra equations solver; HIMENO, a benchmark kernel that appears in a linear solver of Pressure Poisson included in an incompressible Navier-Stokes solver; and Numerical Aerodynamic Simulation (NAS) benchmarks. We present the results obtained on the 2.8 Ghz Xeon clusters, 3.06 Ghz Xeon clusters, and on 2.0 Ghz AMD Opteron TM clusters.

...read moreread less

Book Chapter•10.1007/11558958_139•

An implementation of parallel 3-d FFT using short vector SIMD instructions on clusters of PCs

[...]

Daisuke Takahashi¹, Taisuke Boku¹, Mitsuhisa Sato¹•Institutions (1)

University of Tsukuba¹

20 Jun 2004

TL;DR: In this article, the authors proposed an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. And they achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.

...read moreread less

Abstract: In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. We vectorized FFT kernels using Intel's Streaming SIMD Extensions 2 (SSE2) instructions. We show that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively. Performance results of three-dimensional FFTs on a dual Xeon 2.8 GHz PC SMP cluster are reported. We successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.

...read moreread less

Proceedings Article•10.1117/12.528763•

Implementation of projection-type autostereoscopic multiview 3D display system for real-time applications

[...]

Young-Gyoo Park¹, Seung-Chul Kim¹, Sang-Tae Lee¹, Eun-Soo Kim¹•Institutions (1)

Kwangwoon University¹

21 May 2004-electronic imaging

TL;DR: From some experimental results, it is found that the proposed system can display four-view VGA images with a full color of 16bits and a frame rate of 15fps in real-time.

...read moreread less

Abstract: In this paper, a new projection-type autostereoscopic multiview 3D display system for real-time applications is proposed by using IEEE 1394 digital cameras, Intel Xeon server computer system, projection-type 3D display system and Microsoft' DirectShow programming library and its performance is analyzed in terms of image-grabbing frame rate, displayed image resolution, possible color depth and number of views. In the proposed system, four-view color images are initially captured by using four IEEE 1394 digital cameras and then, these are processed in the Intel Xeon server computer system and they are transmitted to the graphic card having 4 output ports for supporting 4-view stereoscopic display system in real-time. These outputs are finally projected to the specially designed-Fresnel screen through four projectors to make 4-view autostereoscopic image. Also, the overall system control program is developed basing on the Microsoft's DirectShow programming library. From some experimental results, it is found that the proposed system can display four-view VGA images with a full color of 16bits and a frame rate of 15fps in real-time.© (2004) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

...read moreread less

Proceedings Article•10.1145/968280.968321•

Using an FPGA coprocessor for improving execution speed of TRT-LUT: one of the feature extraction algorithms for ATLAS LVL2 trigger

[...]

C. Hinkelbein¹, A. Khomich¹, Andreas Kugel¹, Reinhard Männer¹, M. Muller¹ - Show less +1 more•Institutions (1)

University of Mannheim¹

22 Feb 2004

TL;DR: This work investigates the suitability of using an FPGA coprocessor for speedup track finding algorithm for ATLAS Level 2 trigger using the TRT-LUT algorithm and finds that this realization can give us speed-up by factor ~2 for hybrid FPGa/CPU realization in comparison with CPU-only implementation.

...read moreread less

Abstract: This work investigates the suitability of using an FPGA coprocessor for speedup track finding algorithm for ATLAS Level 2 trigger. Two realizations of the same algorithm have been compared: C++ realization tested on a computer equipped with dual Xeon 2.4 GHz CPU, 64Bit/66MHz PCI bus, 1024 Mb DDR RAM main memories with Red Hat Linux 7.1; and hybrid C++ and VHDL realization tested on the same PC equipped in addition by MPRACE board (FPGA-Coprocessor board based on Xilinx Virtex-2 FPGA and made as 64Bit/66MHz PCI card developed at the University of Mannheim). In the TRT-LUT algorithm, the most time consuming parts were implemented in VHDL and using the FPGA coprocessor. This realization can give us speed-up by factor ~2 for hybrid FPGA/CPU realization in comparison with CPU-only implementation.

...read moreread less

Book Chapter•10.1016/S0927-5452(04)80056-X•

Efficient parallel search in video databases with dynamic feature extraction

[...]

Stefan Geisler

1 Jan 2004

TL;DR: The chapter presents an approach, in which the feature vectors are calculated after the user has sent his query to the video database, which follows a suggestion of KAO for image databases.

...read moreread less

Abstract: Publisher Summary This chapter discusses efficient parallel search in video databases with dynamic feature extraction. Video databases with dynamic feature extraction offer the possibility for powerful and flexible queries. However, the retrieval is very time consuming. Efficient algorithms are required to manage the search for objects in one single video, and parallel methods are necessary to cope with the large number of videos in a database. The chapter presents an approach, in which the feature vectors are calculated after the user has sent his query to the video database. This follows a suggestion of KAO for image databases. Objects or persons can be found because the template matching algorithm is performed on each image in the database. Obviously such systems will need a lot of computational power, which at present, cannot be realized without parallelism and efficient algorithms for searching in digital videos. Both are presented in the following sections after a short description of the video retrieval process. The efficiency of different levels of parallelism and different parallel architectures are compared with each other. Two different architectures are compared, a dual Xeon 2.2 GHz and an Alpha workstation with 4 PE at 600 MHz. The Xeon system is tested with and without hyper-threading. The hyper-threading technology allows using some functional units in parallel by pretending an additional PE.

...read moreread less

Proceedings Article•10.1145/977091.977130•

Improving the execution time of global communication operations

[...]

Matthias Kühnemann¹, Thomas Rauber², Gudula Rünger¹•Institutions (2)

Chemnitz University of Technology¹, University of Bayreuth²

14 Apr 2004

TL;DR: It is demonstrated that the optimized communication operations can be used to reduce the execution time of data parallel implementations of complex application programs without any other reordering of the computation and communication structure.

...read moreread less

Abstract: Many parallel applications from scientific computing use MPI global communication operations to collect or distribute data Since the execution times of these communication operations increase with the number of participating processors, scalability problems might occur In this article, we show for different MPI implementations how the execution time of global communication operations can be significantly improved by a restructuring based on orthogonal processor structures As platform, we consider a dual Xeon cluster, a Beowulf cluster and a Cray T3E with different MPI implementations We show that the execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on the dual Xeon cluster and the Beowulf cluster But also on a Cray T3E a significant improvement can be obtained by a careful selection of the processor groups We demonstrate that the optimized communication operations can be used to reduce the execution time of data parallel implementations of complex application programs without any other reordering of the computation and communication structure

...read moreread less

Proceedings Article•10.1145/1050330.1050393•

Research on a system performance with parallel modes of execution in a multiprocessor computer system with Hyper-Threading Technology

[...]

Ognjan Nakov¹, Stefan Stojchev¹•Institutions (1)

Technical University of Sofia¹

17 Jun 2004

TL;DR: The object of this paper is to provide results, analyses and conclusions from a research into the mechanism for managing threads of execution and common system performance.

...read moreread less

Abstract: This paper represents a research into the new Hyper-Threading Technology (TH) being introduced with the Intel Xeon family processors. Many applications expect maximum efficient exploitation of the hardware capability in the computer configuration. That depends on the ability of the operating system to manage concurrent executions of multiple instruction streams. The object of this paper is to provide results, analyses and conclusions from a research into the mechanism for managing threads of execution and common system performance.

...read moreread less

Proceedings Article•10.1109/WWC.2004.1437403•

Evaluating performance of BLAST on Intel Xeon and Itanium2 processors

[...]

R. Ali, Ramesh Radhakrishnan, G. Kochhar, J. Hsieh, O. Celebioglu, Kalyana Chadalavada, R. Rajagopalan - Show less +3 more

25 Oct 2004

TL;DR: In this article, the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies are compared using the Basic Local Alignment Search Tool (BLAST) workload.

...read moreread less

Abstract: High-performance computing has increasingly adopted the use of clustered Intel architecture-based servers. This increase in adoption has been largely fueled by a number of technological enhancements in the Intel architecture-based servers, primarily due to substantial improvement in the Intel processor and memory technology over the past few years. This paper compares the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies. They are the PE1750 which is an IA-32 based Xeon system, PE1850 which uses the new 90nm technology Xeon processor at faster frequencies and the PE3250 which is an Itanium2 based system. BLAST (Basic Local Alignment Search Tool), a high performance computing application used in the field of biological research, is used as the workload for this study. The aim is to understand the performance impact of the different features associated with each processor/platform technology when running the BLAST workload.

...read moreread less

Using the GeoFEST faulted region simulation system

[...]

Jay Parker¹, Gregory A. Lyzenga, Andrea Donnellan, Michele A. Judd, Charles Norton², Teresa Baker, Edwin Tisdale, Peggy Li - Show less +4 more•Institutions (2)

California Institute of Technology¹, Massachusetts Institute of Technology²

9 Jul 2004

TL;DR: Many new capabilities and means of access for GeoFEST are now supported, including MPI-based cluster parallel computing using automatic PYRAMID/Parmetis-based mesh partitioning, automatic mesh generation for layered media with rectangular faults, and results visualization that is integrated with remote sensing data.

...read moreread less

Abstract: GeoFEST (the Geophysical Finite Element Simulation Tool) simulates stress evolution, fault slip and plastic/elastic processes in realistic materials, and so is suitable for earthquake cycle studies in regions such as Southern California. Many new capabilities and means of access for GeoFEST are now supported. New abilities include MPI-based cluster parallel computing using automatic PYRAMID/Parmetis-based mesh partitioning, automatic mesh generation for layered media with rectangular faults, and results visualization that is integrated with remote sensing data. The parallel GeoFEST application has been successfully run on over a half-dozen computers, including Intel Xeon clusters, Itanium II and Altix machines, and the Apple G5 cluster. It is not separately optimized for different machines, but relies on good domain partitioning for load-balance and low communication, and careful writing of the parallel diagonally preconditioned conjugate gradient solver to keep communication overhead low. Demonstrated thousand-step solutions for over a million finite elements on 64 processors require under three hours, and scaling tests show high efficiency when using more than (order of) 4000 elements per processor. The source code and documentation for GeoFEST is available at no cost from Open Channel Foundation. In addition GeoFEST may be used through a browser-based portal environment available to approved users. That environment includes semi-automated geometry creation and mesh generation tools, GeoFEST, and RIVA-based visualization tools that include the ability to generate a flyover animation showing deformations and topography. Work is in progress to support simulation of a region with several faults using 16 million elements, using a strain energy metric to adapt the mesh to faithfully represent the solution in a region of widely varying strain.

...read moreread less

Journal Article•10.5303/PKAS.2004.19.1.077•

A high performance cluster for astronomical computations

[...]

Kim Jongsoo, Kim Bong Gyu, Yim In Sung, Baek Chang Hyun, Nam Hyun Woong, Ryu Dongsu, Kang Young Woon - Show less +3 more

1 Dec 2004

TL;DR: Its performance for parallel computations was measured with a three-dimensional hydrodynamic code and showed quite a good scalability as the number of computational cells increases.

...read moreread less

Abstract: A high performance computing cluster for astronomical computations has been built at Korea Astronomy Observatory. The 64 node cluster interconnected with Gigabit Ethernet is composed of 128 Intel Xeon processors, 160 GB memory, 6 TB global storage space, and an LTO (Linear Tape-Open) tape library. The cluster was installed and has been managed with the Open Source Cluster Application Resource (OSCAR) framework. Its performance for parallel computations was measured with a three-dimensional hydrodynamic code and showed quite a good scalability as the number of computational cells increases. The cluster has already been utilized for several computational research projects, some of which resulted in a few publications, even though its full operation time is less than one year. As a major resource of the testbed, the cluster has been used for Grid computations, too.

...read moreread less

Journal Article•

RIKEN Super Combined Cluster (RSCC) System

[...]

Kouichi Kumon, Toshiyuki Kimura, Kohichiro Hotta, Takayuki Hoshiya

01 Jan 2004-Fujitsu Scientific & Technical Journal

TL;DR: The technologies for realizing such a high-performance cluster system are described and the implementation of an example large-scale cluster and its performance are described.

...read moreread less

Abstract: Recently, Linux cluster systems have been replacing conventional vector processors in the area of high-performance computing. A typical Linux cluster system consists of high-performance commodity CPUs such as the Intel Xeon processor and commodity networks such as Gigabit Ethernet and Myrinet. Therefore, Linux cluster systems can directly benefit from the drastic performance improvement of these commodity components. It seems easy to build a large-scale Linux cluster by connecting these components, and the theoretical peak performance may be high. However, gaining stable operation and the expected performance from a large-scale cluster needs dedicated technologies from the hardware level to the software level. In February 2004, Fujitsu shipped one of the world's largest clusters to a customer. This system is currently the fastest Linux cluster in Japan and the seventh fastest supercomputer in the world. In this paper, we describe the technologies for realizing such a high-performance cluster system and then describe the implementation of an example large-scale cluster and its performance.

...read moreread less