TL;DR: In this article, the authors use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing, and show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server.
Abstract: Server-based networks have well-documented performance limitations. These limitations outline a major goal of Intel's embedded transport acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and transmission control protocol/Internet protocol (TCP/IP) networks. By developing this capability, Intel hopes to take advantage of the large knowledge base and ubiquity of these standard technologies. With the advent of 10 gigabit Ethernet, these standards promise to provide the bandwidth required of the most demanding server applications. We use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing. Such PPEs have certain desirable attributes; the ETA project focuses on developing PPEs with such attributes, which include scalability, extensibility, and programmability. General-purpose processors, such as the Intel Xeon in our prototype, are extensible and programmable by definition. Our results show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server. Specifically, partitioning the packet processing onto a dedicated set of compute resources allows for optimizations that are otherwise impossible when time sharing the same compute resources with the operating system and applications.
TL;DR: Two efficient methods for multilevel data partitioning are described that can improve the performance of the multithreaded H.264 encoder using the OpenMP programming model, which allows us to leverage the advanced compiler technologies in the Intel/spl reg/ C++ compiler for Intel hyper-threading architectures.
Abstract: Summary form only given. Exploiting thread-level parallelism is a promising way to improve the performance of multimedia applications that are running on multithreading general-purpose processors. We describe the work in developing our threaded H.264 encoder. We parallelize the H.264 encoder using the OpenMP programming model, which allows us to leverage the advanced compiler technologies in the Intel/spl reg/ C++ compiler for Intel hyper-threading architectures. After we present our design considerations in the parallelization process, we describe two efficient methods for multilevel data partitioning, which can improve the performance of our multithreaded H.264 encoder. Furthermore, we exploit different options in the OpenMP programming. While one implementation that uses the task queuing model is slightly slower than the other implementation, it is easier to be read than the other one. The results have shown good speedups ranging from 3.74x to 4.53x over the well-optimized sequential code performance on a system of 4 Intel Xeon/spl trade/processors with hyper-threading technology.
TL;DR: This work porting a supercomputer application benchmark onto Xilinx Virtex II and VirtexII Pro FPGAs and comparing performance with three Pentium IV Xeon microprocessors shows that this application-specific pipeline, with 12 multiply, 10 add/subtract, one divide, and two compare modules of single precision floating point data type, shows speed up.
Abstract: Recently, the appearance of very large (3 - 10M gate) FPGAs with embedded arithmetic units has opened the door to the possibility of floating point computation on these devices While previous researchers have described peak performance or kernel matrix operations, there is as yet relatively little experience with mapping an application-specific floating point loop onto FPGAs In this work, we port a supercomputer application benchmark onto Xilinx Virtex II and Virtex II Pro FPGAs and compare performance with three Pentium IV Xeon microprocessors Our results show that this application-specific pipeline, with 12 multiply, 10 add/subtract, one divide, and two compare modules of single precision floating point data type, shows speed up of 1037× We analyze the trade-offs between hardware and software to characterize the algorithms that will perform well on current and future FPGA architectures
TL;DR: The process for detecting post-silicon performance issues, developing associated optimizations, and communicating these to application engineers, compiler teams, and microprocessor architects for appropriate action is described.
Abstract: In addition to the considerable effort spent on functional validation of Intel processors, a separate parallel activity is conducted to verify that processor performance meets or exceeds specifications. In this paper, we discuss both the pre-silicon and post-silicon performance validation processes carried out on the 90nm version of the Intel Pentium 4 processor. For pre-silicon performance work, we describe how a detailed performance simulator is used to ensure that the processor specification meets the product’s performance targets and also that the implementation matches the defined specification. Additionally, we describe how we project the performance of the Pentium 4 processor on key applications and benchmarks. Once silicon arrives, the second phase of performance verification work starts. We describe the process for detecting post-silicon performance issues, developing associated optimizations, and communicating these to application engineers, compiler teams, and microprocessor architects for appropriate action. We also discuss the tools used to gather performance metrics and characterize application performance.
TL;DR: In this paper, the authors ported a supercomputer application benchmark onto Xilinx Virtex II and Virtex Pro FPGAs and compared performance with three Pentium IV Xeon microprocessors.
Abstract: Recently, the appearance of very large (3 – 10M gate) FPGAs with embedded arithmetic units has opened the door to the possibility of floating point computation on these devices. While previous researchers have described peak performance or kernel matrix operations, there is as yet relatively little experience with mapping an application-specific floating point loop onto FPGAs. In this work, we port a supercomputer application benchmark onto Xilinx Virtex II and Virtex II Pro FPGAs and compare performance with three Pentium IV Xeon microprocessors. Our results show that this application-specific pipeline, with 12 multiply, 10 add/subtract, one divide, and two compare modules of single precision floating point data type, shows speed up of 10.37×. We analyze the trade-offs between hardware and software to characterize the algorithms that will perform well on current and future FPGA architectures.
TL;DR: A new fast learning algorithm for SOM and its digital hardware design based on the massively parallel architecture is proposed and up to 256 competing units (16 /spl times/ 16 map) can be implemented.
Abstract: We propose a new fast learning algorithm for SOM and its digital hardware design based on the massively parallel architecture. When this proposed algorithm is realized by using Xilinx XC2V6000-6 FPGA, a maximum performance of 17500 MCUPS is achieved and up to 256 competing units (16 /spl times/ 16 map) can be implemented. Each competing unit have a weight vector which is represented by 128 elements of 16 bits accuracy. Furthermore, we applied the proposed hardware to a realtime digital image enlargement system. In the case of full color (24 bits) image enlargement from QQVGA (160 /spl times/ 120 pixel) to QVGA (320 /spl times/ 240 pixel), a proposed hardware requires only 0.12 second per image, while the personal computer (Intel XEON, 2.8 GHz Dual) requires more than 5 seconds per image.
TL;DR: Inside, Tom Shanley covers not only the hardware design and software enhancements of Intel's latest processors, he also explains the relationship between these hardware and software characteristics, and readers will come away with a complete understanding of the processor's internal architecture, the Front Side Bus (FSB), theprocessor's relationship to the system, and the processor’s software architecture.
Abstract: “In this monumental new book, Tom Shanley pulls together 15 years of history of Intel's mainline microprocessors, the most popular and important computer architecture in history. Shanley has a keen eye for the salient facts, and an outstanding sense for how to organize and display the material for easy accessibility by the reader. If you want to know what does this bit control, what does that feature do, and how did those instructions evolve through several generations of x86, this is the reference book for you. This is the book Intel should have written, but now they don't have to.” i? i? i? i? i? i? i? i? i? -Bob Colwell, Intel FellowThe Unabridged Pentium 4 offers unparalleled coverage of Intel's IA32 family of processors, from the 386 through the Pentium 4 and Pentium M processors. Unlike other texts, which address solely a hardware or software audience, this book serves as a comprehensive technical reference for both audiences. Inside, Tom Shanley covers not only the hardware design and software enhancements of Intel's latest processors, he also explains the relationship between these hardware and software characteristics. As a result, readers will come away with a complete understanding of the processor's internal architecture, the Front Side Bus (FSB), the processor's relationship to the system, and the processor's software architecture.Essential topics covered include: Goals of single-task and multi-task operating systems The 386 processor-the baseline ancestor of the IA32 processor family The 486 processor, including a cache primer The Pentium processor The P6 roadmap, P6 processor core, and P6 FSB The Pentium Pro processor, including the Microcode Update feature The Pentium II and the Pentium II Xeon and Celeron processors The Pentium III and the Pentium III Xeon and Celeron processors The Pentium 4 processor family The Pentium M processor Processor identification, System Management Mode, and the IO and Local APICsAn “at-a-glance” table of contents allows readers to quickly find topics ranging from 386 Demand Mode Paging to Pentium 4 CPU Arbitration.The accompanying CD-ROM contains 16 extra chapters.Whether you design software or hardware or are responsible for system maintenance or customer support, The Unabridged Pentium 4 will prove an invaluable reference to the world's most widely used microprocessor chips.MindShare's PC System Architecture series is a crisply written and comprehensive set of guides to the most important PC hardware standards. Books in the series are intended for use by hardware and software designers, programmers, and support personnel.One of the leading technical training companies in the hardware industry, MindShare, Inc., provides innovative courses for dozens of companies, including HP, AMD, IBM, and Compaq. Through these classes and by writing the highly regarded PC System Architecture Series for Addison-Wesley, MindShare trainers emphasize the relationships of hardware subsystems to each other as well as the relationship between software and hardware.
TL;DR: In this article, the authors describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to $mathcal{O(10^5) ).
Abstract: We describe the parallelization of an efficient algorithm for balanced truncation that allows to reduce models with state-space dimension up to $\mathcal{O}(10^5)$. The major computational task in this approach is the solution of two large-scale sparse Lyapunov equations, performed via a coupled LR-ADI iteration with (super-)linear convergence. Experimental results on a cluster of Intel Xeon processors illustrate the efficacy of our parallel model reduction algorithm.
TL;DR: The experimental results of the benchmarking on an SMP machine using Intel Xeon processors showed that the proposed algorithm can significantly improve the performance by 83% on average compared to the case using a complex atomic instruction.
Abstract: The performance of locking is critical, as programming languages with built-in thread support are coming into wide use. Many techniques for optimizing Java monitors have been proposed, based on the observation that the locks are rarely contended for in many applications. However, the problem of the performance degradation in SMP environments caused by necessary serializations of the processors' execution has not been addressed for shared objects. We propose a new algorithm for this problem. It uses simple instructions to acquire the lock by exploiting the owner locality for objects even if the ownership has migrated among the threads. Our algorithm is particularly effective for SMP environments because we can remove the overhead of the serialization caused by complex atomic operations for uncontended locks by allowing the lock operation and the code protected by the lock to be executed in parallel. We verified the safety of the algorithm by using a software tool, Spin. The experimental results of our bench-marking on an SMP machine using Intel Xeon processors showed that our algorithm can significantly improve the performance by 83% on average compared to the case using a complex atomic instruction.
TL;DR: Workload profiling shows that parallel SNPs' data sharing nature matches hyper-threading's cache sharing mechanism, and thus greatly reduces cache coherency protocol traffic on shared front side bus and Scalability analysis shows that imbalance and locks are two major factors that may limit the parallel workload speedup on more processor platforms.
Abstract: Single nucleotide polymorphisms (SNPs) is subtle variation in a genomic DNA sequence of individuals of the same species. It plays a key role in the pharmaceutical industry to understand variations in drug treatment responses between individuals at the molecular level. Discovering patterns around SNPs loci is very important for better understanding the possible origin of SNPs in evolution. Bayesian network has been applied to this problem and got promising results. Since Bayesian network based SNPs pattern analysis demonstrates high computational complexity, we parallelized this workload on Intel Xeon SMP systems. SNPs' task level parallelism is exploited. Experiment results show that memory is bottleneck: on 8-way Xeon SMP hyper-threading enabled system, system memory bandwidth is fully saturated and memory load access latency is roughly 50% longer than on single processor system. Another interesting result is that Intel's hyper-threading technology helps improve the multithreaded workload's performance by 1.6X speedup. Workload profiling shows that parallel SNPs' data sharing nature matches hyper-threading's cache sharing mechanism, and thus greatly reduces cache coherency protocol traffic on shared front side bus. Scalability analysis shows that imbalance and locks are two major factors that may limit the parallel workload speedup on more processor platforms.
TL;DR: An implementation of parallel LU factorization that achieves high performance on non-dedicated clusters by a combination of techniques including a latency tolerant communication and data partitioning that achieves both load balance and small communication volume for arbitrary and dynamically changing number of processors.
Abstract: This paper describes an implementation of parallel LU factorization. The focus is to achieve high performance on non-dedicated clusters, where the number of available computing resources may be arbitrary and even dynamically changing. We accommodate joining/leaving processes by describing the algorithm in the Phoenix programming model. We achieve high performance in this setting by a combination of techniques including a latency tolerant communication and data partitioning that achieves both load balance and small communication volume for arbitrary and dynamically changing number of processors. We observed 130 GFlops with 128 processes on a 70-node dual 2.4GHz Xeon cluster, at matrix size = 46080. This performance is comparable to that of the High Performance Linpack (HPL). When cluster nodes are loaded by background processes, our implementation surpasses HPL.
TL;DR: It is demonstrated that vector systems can outperform COTS architectures by more than one order of magnitude, and comparing different programming models of the LBM kernel shows the Cray X1 to deliver good performance even on the standard implementation of this kernel.
Abstract: Delivering high sustained performance for scientific, memory-intensive applications is a well known problem in high performance computing (HPC). The main objective in the design of vector computers is to resolve this challenge. Commodity “off-the-shelf” (COTS) architectures do not mainly focus on HPC requirements, but dominate the HPC market due to their (often) moderate price-performance ratio. In our report we present a comprehensive survey of modern processor architectures ranging from IA32 compatible (Intel Xeon, AMD Opteron), superscalar RISC (IBM Power4), IA64 (Intel Itanium2) to classical vector (NEC SX6) and novel vector (Cray X1) architectures. Using a kernel from the lattice Boltzmann method (LBM), we point out different architecture dependent optimization strategies and discuss single processor performance numbers. Our results demonstrate that vector systems can outperform COTS architectures by more than one order of magnitude. The NEC SX6 and Cray X1 achieve comparable performance levels on large problem sizes. Comparing different programming models of the LBM kernel shows the Cray X1 to deliver good performance even on the standard implementation of this kernel.
TL;DR: This work describes a design of high performance communication facility called the PM/InfiniBand-FJ using InfiniBand interconnect for large scale PC clusters using the original specification of Infini band.
Abstract: This work describes a design of high performance communication facility called the PM/InfiniBand-FJ using InfiniBand interconnect for large scale PC clusters The PM/InfiniBand-FJ has developed to realize higher application performance than commercial supercomputers and comparable availability to them Since the specification of InfiniBand interconnect is designed for communication among servers and I/Os, there are some issues to use InfiniBand for high performance computation on over 1000 node PC clusters Therefore, the PM/InfiniBand-FJ solves the issues by expanding the original specification of InfiniBand We have implemented the PM/InfiniBand-FJ on SCore cluster system software, and evaluated the communication and application performance The performance results show that a 9132 MB/s of bandwidth and 156 /spl mu/s round trip time have been achieved on Xeon 28GHz PC with ServerWorks GC LE chipset The result of NAS parallel benchmark shows that the 128 node result of IS Class B on PM/InfiniBand-FJ is 152 times faster than that of PM/MyrinetXP using Fujitsu PR1MERGY RX200 PC cluster (Xeon 306GHz)
TL;DR: Mixed parallel execution schemes for specific (explicit and implicit) variants of general linear methods, the Parallel Adams-Bashforth methods and the ParallelAdams-Moulton methods are studied, which are new methods providing additional method parallelism.
Abstract: Many recent solvers for ordinary differential equations (ODEs) have been designed with an additional potential of method parallelism, but the actual effectiveness of exploiting method parallelism depends on the specific communication and computation requirements induced by the equation to be solved. In this paper we study mixed parallel execution schemes for specific (explicit and implicit) variants of general linear methods, the Parallel Adams-Bashforth methods and the Parallel Adams-Moulton methods, which are new methods providing additional method parallelism. The implementations are realized with a library for multiprocessor task programming. Experiments on a Cray T3E and a dual Xeon cluster show good efficiency results, also for sparse application problems.
TL;DR: The aim is to understand the performance impact of the different features associated with each processor/platform technology when running the BLAST workload.
Abstract: High-performance computing (HPC) has increasingly adopted the use of clustered Intel architecture–based servers. This paper compares the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies. They are the PE1750 which is an IA-32 based Xeon system, PE1850 which uses the new 90nm technology Xeon processor at faster frequencies and the PE3250 which is an Itanium2 based system. BLAST (Basic Local Alignment Search Tool), a high performance computing application used in the field of biological research, is used as the workload for this study. The aim is to understand the performance benefits of the different features associated with each processor/platform technology to BLAST and explain the observations using other standard micro-benchmarks like STREAM and LMBench.
TL;DR: In the article the modified speculative method is modified, then the analysis of a non-linear model of a DC-motor supplied by a solar generator, as an example of the application of this method will be presented.
Abstract: The speculative method is intended to conduct transient states analysis in electrical circuits in which the transient state is described by a system of ordinary differential equations (ODE), linear or non-linear. A general idea of this method is based on the decomposition of the time domain. Computations in subintervals of time are conducted in parallel with the use of one of wellknown sequential numerical methods of solving ODE systems. In the article the modified speculative method, then the analysis of a non-linear model of a DC-motor supplied by a solar generator, as an example of the application of this method, will be presented. The computations were carried out with the use of the cluster of 5 workstations based on Intel Xeon 2.66 GHz processor.
TL;DR: This paper describes production clusters for lattice QCD simulations, and discusses the investigations of various commodity processors, including Pentium 4E, Xeon, Opteron, and PPC970.
Abstract: As part of the DOE SciDAC ''National Infrastructure for Lattice Gauge Computing'' project, Fermilab builds and operates production clusters for lattice QCD simulations. This paper will describe these clusters. The design of lattice QCD clusters requires careful attention to balancing memory bandwidth, floating point throughput, and network performance. We will discuss our investigations of various commodity processors, including Pentium 4E, Xeon, Opteron, and PPC970. We will also discuss our early experiences with the emerging Infiniband and PCI Express architectures. Finally, we will present our predictions and plans for future clusters.
TL;DR: The successful emulation of an on-board processor (OBP) to support space based radar (SBR) is presented and this framework allows for experimenting with architecture enhancements and changes, and ultimately will ensure a low cost, reliable, fully reprogrammable product produced without re-spins.
Abstract: This paper presents the successful emulation of an on-board processor (OBP) to support space based radar (SBR). The emulation is demonstrated on the forty-eight node dual Xeon heterogeneous high performance computer (HHPC) operated by the Air Force Research Laboratory (AFRL) located in Rome, New York. Each node in the HHPC supports one Annapolis Wildstar II board composed of 2 Xilinx Virtex II 6 Million gate field programmable gate arrays (FPGAs). As system complexity increases, debugging the software of tera-scale systems with hundreds to thousands of processors is poorly supported by time consuming simulations. However, the advent of large FPGAs allows a powerful new tool to assist in the architecture development effort - emulation. For the case at hand, the 96 FPGAs of the HHPC are capable of emulating at 8% of the actual system clock speed (20 MHz of 250 MHz) and close to 15% of the 2560 individual processors of the proposed SBR system. Even at this reduced scale, this emulation provides a testing environment roughly a million times more capable than HPC-based simulation for early software bug detection and correction. Further, this framework allows for experimenting with architecture enhancements and changes, and ultimately will ensure a low cost, reliable, fully reprogrammable product produced without re-spins. The embedded system architecture of this SBR OBP is based on AFRL's dual processor, power efficient, programmable, wafer scale signal processor (WSSP). Target tracking and discrimination algorithms were developed and demonstrated on an earlier 96-processor embodiment of this architecture. For SBR, the algorithm set is being extended to include synthetic aperture radar (SAR) image formation and moving target indication (MTI) algorithms.
TL;DR: This work ran the Multi-zone versions of the NAS Parallel Benchmarks on a cluster composed of two SGI Origin 2000 servers, and an Intel SMP Xeon server connected by Gigabit Ethernet, and reported on the results and their implications for running parallel applications on heterogeneous clusters.
Abstract: We investigate the feasibility of running parallel applications on heterogeneous clusters. The motivation for doing so is twofold. First, it is practical to be able to pull together existing machines to run a job that is too big for any one of them, especially if such jobs are run rarely. Second, in the event of an emergency, where a very large problem must be solved in a few days, it may not be feasible to purchase and install a new machine in time, and any existing machines will have to be brought to bear on the problem. We ran the Multi-zone versions of the NAS Parallel Benchmarks (NPB) on a cluster composed of two SGI Origin 2000 servers, and an Intel SMP Xeon server connected by Gigabit Ethernet. We report on the results and their implications for running parallel applications on heterogeneous clusters.
TL;DR: This paper presents performance comparisons of several scientific applications on the HPC Benchmark center Linux® clusters, selecting Stream, a benchmark that measures memory bandwidth; PALLAS, an Message Passing Interface (MPI) benchmark to measure single, parallel and collective communications, and LINPACK, a dense linear algebra equations solver.
Abstract: This paper presents performance comparisons of several scientific applications on the HPC Benchmark center Linux® clusters. The applications selected are Stream, a benchmark that measures memory bandwidth; PALLAS, an Message Passing Interface (MPI) benchmark to measure single, parallel and collective communications; LINPACK, a dense linear algebra equations solver; HIMENO, a benchmark kernel that appears in a linear solver of Pressure Poisson included in an incompressible Navier-Stokes solver; and Numerical Aerodynamic Simulation (NAS) benchmarks. We present the results obtained on the 2.8 Ghz Xeon clusters, 3.06 Ghz Xeon clusters, and on 2.0 Ghz AMD Opteron TM clusters.
TL;DR: In this article, the authors proposed an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. And they achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.
Abstract: In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. We vectorized FFT kernels using Intel's Streaming SIMD Extensions 2 (SSE2) instructions. We show that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively. Performance results of three-dimensional FFTs on a dual Xeon 2.8 GHz PC SMP cluster are reported. We successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.
TL;DR: From some experimental results, it is found that the proposed system can display four-view VGA images with a full color of 16bits and a frame rate of 15fps in real-time.
TL;DR: This work investigates the suitability of using an FPGA coprocessor for speedup track finding algorithm for ATLAS Level 2 trigger using the TRT-LUT algorithm and finds that this realization can give us speed-up by factor ~2 for hybrid FPGa/CPU realization in comparison with CPU-only implementation.
Abstract: This work investigates the suitability of using an FPGA coprocessor for speedup track finding algorithm for ATLAS Level 2 trigger. Two realizations of the same algorithm have been compared: C++ realization tested on a computer equipped with dual Xeon 2.4 GHz CPU, 64Bit/66MHz PCI bus, 1024 Mb DDR RAM main memories with Red Hat Linux 7.1; and hybrid C++ and VHDL realization tested on the same PC equipped in addition by MPRACE board (FPGA-Coprocessor board based on Xilinx Virtex-2 FPGA and made as 64Bit/66MHz PCI card developed at the University of Mannheim). In the TRT-LUT algorithm, the most time consuming parts were implemented in VHDL and using the FPGA coprocessor. This realization can give us speed-up by factor ~2 for hybrid FPGA/CPU realization in comparison with CPU-only implementation.
TL;DR: The chapter presents an approach, in which the feature vectors are calculated after the user has sent his query to the video database, which follows a suggestion of KAO for image databases.
Abstract: Publisher Summary This chapter discusses efficient parallel search in video databases with dynamic feature extraction. Video databases with dynamic feature extraction offer the possibility for powerful and flexible queries. However, the retrieval is very time consuming. Efficient algorithms are required to manage the search for objects in one single video, and parallel methods are necessary to cope with the large number of videos in a database. The chapter presents an approach, in which the feature vectors are calculated after the user has sent his query to the video database. This follows a suggestion of KAO for image databases. Objects or persons can be found because the template matching algorithm is performed on each image in the database. Obviously such systems will need a lot of computational power, which at present, cannot be realized without parallelism and efficient algorithms for searching in digital videos. Both are presented in the following sections after a short description of the video retrieval process. The efficiency of different levels of parallelism and different parallel architectures are compared with each other. Two different architectures are compared, a dual Xeon 2.2 GHz and an Alpha workstation with 4 PE at 600 MHz. The Xeon system is tested with and without hyper-threading. The hyper-threading technology allows using some functional units in parallel by pretending an additional PE.
TL;DR: It is demonstrated that the optimized communication operations can be used to reduce the execution time of data parallel implementations of complex application programs without any other reordering of the computation and communication structure.
Abstract: Many parallel applications from scientific computing use MPI global communication operations to collect or distribute data Since the execution times of these communication operations increase with the number of participating processors, scalability problems might occur In this article, we show for different MPI implementations how the execution time of global communication operations can be significantly improved by a restructuring based on orthogonal processor structures As platform, we consider a dual Xeon cluster, a Beowulf cluster and a Cray T3E with different MPI implementations We show that the execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on the dual Xeon cluster and the Beowulf cluster But also on a Cray T3E a significant improvement can be obtained by a careful selection of the processor groups We demonstrate that the optimized communication operations can be used to reduce the execution time of data parallel implementations of complex application programs without any other reordering of the computation and communication structure
TL;DR: The object of this paper is to provide results, analyses and conclusions from a research into the mechanism for managing threads of execution and common system performance.
Abstract: This paper represents a research into the new Hyper-Threading Technology (TH) being introduced with the Intel Xeon family processors. Many applications expect maximum efficient exploitation of the hardware capability in the computer configuration. That depends on the ability of the operating system to manage concurrent executions of multiple instruction streams. The object of this paper is to provide results, analyses and conclusions from a research into the mechanism for managing threads of execution and common system performance.
TL;DR: In this article, the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies are compared using the Basic Local Alignment Search Tool (BLAST) workload.
Abstract: High-performance computing has increasingly adopted the use of clustered Intel architecture-based servers. This increase in adoption has been largely fueled by a number of technological enhancements in the Intel architecture-based servers, primarily due to substantial improvement in the Intel processor and memory technology over the past few years. This paper compares the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies. They are the PE1750 which is an IA-32 based Xeon system, PE1850 which uses the new 90nm technology Xeon processor at faster frequencies and the PE3250 which is an Itanium2 based system. BLAST (Basic Local Alignment Search Tool), a high performance computing application used in the field of biological research, is used as the workload for this study. The aim is to understand the performance impact of the different features associated with each processor/platform technology when running the BLAST workload.
TL;DR: Many new capabilities and means of access for GeoFEST are now supported, including MPI-based cluster parallel computing using automatic PYRAMID/Parmetis-based mesh partitioning, automatic mesh generation for layered media with rectangular faults, and results visualization that is integrated with remote sensing data.
Abstract: GeoFEST (the Geophysical Finite Element Simulation Tool) simulates stress evolution, fault slip and plastic/elastic processes in realistic materials, and so is suitable for earthquake cycle studies in regions such as Southern California. Many new capabilities and means of access for GeoFEST are now supported. New abilities include MPI-based cluster parallel computing using automatic PYRAMID/Parmetis-based mesh partitioning, automatic mesh generation for layered media with rectangular faults, and results visualization that is integrated with remote sensing data. The parallel GeoFEST application has been successfully run on over a half-dozen computers, including Intel Xeon clusters, Itanium II and Altix machines, and the Apple G5 cluster. It is not separately optimized for different machines, but relies on good domain partitioning for load-balance and low communication, and careful writing of the parallel diagonally preconditioned conjugate gradient solver to keep communication overhead low. Demonstrated thousand-step solutions for over a million finite elements on 64 processors require under three hours, and scaling tests show high efficiency when using more than (order of) 4000 elements per processor. The source code and documentation for GeoFEST is available at no cost from Open Channel Foundation. In addition GeoFEST may be used through a browser-based portal environment available to approved users. That environment includes semi-automated geometry creation and mesh generation tools, GeoFEST, and RIVA-based visualization tools that include the ability to generate a flyover animation showing deformations and topography. Work is in progress to support simulation of a region with several faults using 16 million elements, using a strain energy metric to adapt the mesh to faithfully represent the solution in a region of widely varying strain.
TL;DR: Its performance for parallel computations was measured with a three-dimensional hydrodynamic code and showed quite a good scalability as the number of computational cells increases.
Abstract: A high performance computing cluster for astronomical computations has been built at Korea Astronomy Observatory. The 64 node cluster interconnected with Gigabit Ethernet is composed of 128 Intel Xeon processors, 160 GB memory, 6 TB global storage space, and an LTO (Linear Tape-Open) tape library. The cluster was installed and has been managed with the Open Source Cluster Application Resource (OSCAR) framework. Its performance for parallel computations was measured with a three-dimensional hydrodynamic code and showed quite a good scalability as the number of computational cells increases. The cluster has already been utilized for several computational research projects, some of which resulted in a few publications, even though its full operation time is less than one year. As a major resource of the testbed, the cluster has been used for Grid computations, too.
TL;DR: The technologies for realizing such a high-performance cluster system are described and the implementation of an example large-scale cluster and its performance are described.
Abstract: Recently, Linux cluster systems have been replacing conventional vector processors in the area of high-performance computing. A typical Linux cluster system consists of high-performance commodity CPUs such as the Intel Xeon processor and commodity networks such as Gigabit Ethernet and Myrinet. Therefore, Linux cluster systems can directly benefit from the drastic performance improvement of these commodity components. It seems easy to build a large-scale Linux cluster by connecting these components, and the theoretical peak performance may be high. However, gaining stable operation and the expected performance from a large-scale cluster needs dedicated technologies from the hardware level to the software level. In February 2004, Fujitsu shipped one of the world's largest clusters to a customer. This system is currently the fastest Linux cluster in Japan and the seventh fastest supercomputer in the world. In this paper, we describe the technologies for realizing such a high-performance cluster system and then describe the implementation of an example large-scale cluster and its performance.