TL;DR: This paper describes co-operating escape, thread structure, and delay set analyses that enable high performance for sequentially consistent programs.
Abstract: The rise of Java, C#, and other explicitly parallel languages has increased the importance of compiling for different software memory models. This paper describes co-operating escape, thread structure, and delay set analyses that enable high performance for sequentially consistent programs.We compare the performance of a set of Java programs compiled for sequential consistency (SC) with the performance of the same programs compiled for weak consistency. For SC, we observe a slowdown of 10% on average for an architecture based on the Intel Xeon processor, and 26% on average for an architecture based on the IBM Power3.
TL;DR: The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.
Abstract: This paper presents a method for producing hardware designs for elliptic curve cryptography (ECC) systems over the finite field GF(2/sup m/), using the optimal normal basis for the representation of numbers. Our field multiplier design is based on a parallel architecture containing multiple m-bit serial multipliers; by changing the number of such serial multipliers, designers can obtain implementations with different tradeoffs in speed, size and level of security. A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements. To facilitate performance characterization, we have developed a parametric model for estimating the number of cycles for our generic ECC architecture. The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.
TL;DR: The atomistic molecular dynamics program YASP has been parallelized for shared-memory computer architectures; most of the sequential FORTRAN code was kept; parallel constructs were inserted as compiler directives using the OpenMP standard.
Abstract: The atomistic molecular dynamics program YASP has been parallelized for shared-memory computer architectures. Parallelization was restricted to the most CPU-time-consuming parts: neighbor-list construction, calculation of nonbonded, angle and dihedral forces, and constraints. Most of the sequential FORTRAN code was kept; parallel constructs were inserted as compiler directives using the OpenMP standard. Only in the case of the neighbor list did the data structure have to be changed. The parallel code achieves a useful speedup over the sequential version for systems of several thousand atoms and above. On an IBM Regatta p690+, the throughput increases with the number of processors up to a maximum of 12−16 processors depending on the characteristics of the simulated systems. On dual-processor Xeon systems, the speedup is about 1.7.
TL;DR: Adaptive execution techniques to find the optimal execution mode for SMT multiprocessor architectures are presented and code is, on average, about 2 and 18 times faster than the original code executed on 4 and 8 logical processors, respectively.
Abstract: In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize performance in an SMT multiprocessor, finding the optimal number of threads is important. This paper presents adaptive execution techniques to find the optimal execution mode for SMT multiprocessor architectures. A compiler preprocessor generates code that, based on dynamic feedback, automatically determines at run time the optimal number of threads for each parallel loop in the application. Using 10 standard numerical applications and running them with our techniques on an Intel 4-processor Hyper-Threading Xeon SMP with 8 logical processors, our code is, on average, about 2 and 18 times faster than the original code executed on 4 and 8 logical processors, respectively.
TL;DR: It is found that while the general trends for microarchitectural behavior agree with real hardware, differences in sizing assumptions and performance models yield much more optimistic benefits for SMT than the authors observe.
Abstract: This paper examines the performance of simultaneous multithreading (SMT) for network servers using actual hardware, multiple network server applications, and several workloads. Using three versions of the Intel Xeon processor with Hyper-Threading, we perform macroscopic analysis as well as microarchitectural measurements to understand the origins of the performance bottlenecks for SMT processors in these environments. The results of our evaluation suggest that the current SMT support in the Xeon is application and workload sensitive, and may not yield significant benefits for network servers.In general, we find that enabling SMT on real hardware usually produces only slight performance gains, and can sometimes lead to performance loss. In the uniprocessor case, previous studies appear to have neglected the OS overhead in switching from a uniprocessor kernel to an SMT-enabled kernel. The performance loss associated with such support is comparable to the gains provided by SMT. In the 2-way multiprocessor case, the higher number of memory references from SMT often causes the memory system to become the bottleneck, offsetting any processor utilization gains. This effect is compounded by the growing gap between processor speeds and memory latency. In trying to understand the large gains shown by simulation studies, we find that while the general trends for microarchitectural behavior agree with real hardware, differences in sizing assumptions and performance models yield much more optimistic benefits for SMT than we observe.
TL;DR: A methodology for evaluating different systems with respect to their maximum capture rate is described and preliminary results of comparing Intel Xeon against AMD Opteron based systems running either Linux or FreeBSD are presented.
Abstract: Using commodity systems for capturing packets in a Gigabit environments is a challenging task. This applies especially to a full capture of packet headers along with their data. Today's commodity PC systems come in different flavors in terms of processor hardware as well as in terms of operating systems. In this paper we describe a methodology for evaluating different systems with respect to their maximum capture rate and present our preliminary results of comparing Intel Xeon against AMD Opteron based systems running either Linux or FreeBSD.
TL;DR: A hardware architecture is presented that can be used to accelerate a number of linear and elastic image registration algorithms that use mutual information as an image similarity measure and achieved speedups of 30 for linear registration and 100 for elastic registration against a 3.2 GHz Pentium III Xeon workstation.
Abstract: Real-time image registration is potentially an enabling technology for the effective and efficient use of many image-guided diagnostic and treatment procedures relying on multimodality image fusion or serial image comparison. Mutual information is currently the best-known image similarity measure for intensity-based multimodality image registration. The calculation of mutual information is memory intensive and does not benefit from cache-based memory architectures in standard software implementations, i.e., the calculation incurs a large number of cache misses. Previous attempts to perform image registration in real time focused on parallel supercomputer implementations, which achieved significant speedups using large, expensive supercomputers that are impractical for clinical deployment. We present a hardware architecture that can be used to accelerate a number of linear and elastic image registration algorithms that use mutual information as an image similarity measure. A proof-of-concept implementation of the architecture achieved speedups of 30 for linear registration and 100 for elastic registration against a 3.2 GHz Pentium III Xeon workstation. Further speedup can be achieved by using several modules in parallel.
TL;DR: The relative performance of Kraken and the new 2000-processor Intel XEON EM64T cluster (Jvn) at ARL is discussed, pointing out differences which amount primarily to inter-node network performance.
Abstract: Shear turbulence induced by the Kelvin-Helmholtz (KH) instability in a stratified fluid is simulated in support of the Air Force High-Energy-Laser Joint Tactical Office (HEL-JTO) project using the new 3000-processor NAVO IBM Power 4+ system (Kraken). The results are used to 1) compare with and improve a dynamic LES method we have developed, and 2) provide the high-resolution simulation component of a new subgrid-scale (SGS) model we have developed for optical turbulence forecasting. We also discuss the relative performance of Kraken and the new 2000-processor Intel XEON EM64T cluster (Jvn) at ARL, pointing out differences which amount primarily to inter-node network performance. We suggest "%of wall time spent communicating" as a standard DoD benchmark criterion for very large distributed systems, as other more conventional criteria do not necessarily represent this important system and algorithm metric.
TL;DR: This paper deploys PVFS2, GPFS, Lustre, and TerraFS for shared deployment across multiple Linux clusters running with different hardware architectures and operating systems and shows that all of the parallel filesystems outperform a legacy NFS system but with different levels of complexity.
Abstract: In this paper, we examine parallel filesystems for shared deployment across multiple Linux clusters running with different hardware architectures and operating systems. Specifically, we deploy PVFS2, GPFS, Lustre, and TerraFS in our test environment containing Intel Xeon, Intel x86-64, and IBM PPC970 systems. We comment on the feature sets of each filesystem, describe our implementation and configuration experiences, and present initial performance benchmark results. Our analysis shows that all of the parallel filesystems outperform a legacy NFS system but with different levels of complexity. Each of the filesystems demonstrates the best performance under certain conditions. Three of the systems – GPFS, Lustre and TerraFS – depend on specific kernel versions that increase administrative complexity and can reduce interoperability.
TL;DR: A new reactive spin-lock algorithm that is completely self-tuning, which means no experimentally tuned parameter nor probability distribution of inputs are needed, and is built on a competitive online algorithm.
Abstract: Reactive spin-lock algorithms that can automatically adapt to contention variation on the lock have received great attention in the field of multiprocessor synchronization, since they can help applications achieve good performance in all possible contention conditions. However, in existing reactive spin-locks the reaction relies on (i) some fixed experimentally tuned thresholds, which may get frequently inappropriate in dynamic environments like multiprogramming/multiprocessor systems, or (ii) known probability distributions of inputs. This paper presents a new reactive spin-lock algorithm that is completely self-tuning, which means no experimentally tuned parameter nor probability distribution of inputs are needed. The new spin-lock is built on a competitive online algorithm. Our experiments, which use the Spark98 kernels and the SPLASH-2 applications as application benchmarks, on a multiprocessor machine SGI Origin2000 and on an Intel Xeon workstation show that the new self-tuning spin-lock helps applications with different characteristics achieve good performance in a wide range of contention levels.
TL;DR: In this article, the authors employ two efficient parallel approaches to reduce a model arising from a semi-discretization of a controlled heat transfer process for optimal cooling of a steel profile.
Abstract: We employ two efficient parallel approaches to reduce a model arising from a semi-discretization of a controlled heat transfer process for optimal cooling of a steel profile. Both algorithms are based on balanced truncation but differ in the numerical method that is used to solve two dual generalized Lyapunov equations, which is the major computational task. Experimental results on a cluster of Intel Xeon processors compare the efficacy of the parallel model reduction algorithms.
TL;DR: The implementation of a stochastic biochemical simulation algorithm called Next Reaction Method for Virtex-II PRO is shown, and the FPGA-based simulator outperforms the software implementation on Xeon 2.40 GHz by 17.1 times.
Abstract: Biochemical simulations including whole-cell models require high performance computing systems. Reconfigurable systems are expected to be an alternative solution for conventional methods by PC clusters or vector computers. This paper shows the implementation of a stochastic biochemical simulation algorithm called Next Reaction Method for Virtex-II PRO. As the result of benchmarking with a small reaction system, the FPGA-based simulator outperforms the software implementation on Xeon 2.40 GHz by 17.1 times
TL;DR: RAxML-OMP as mentioned in this paper is an efficient OpenMP-parallelization for Symmetric Multi-Processing machines (SMPs) based on the sequential program RAxMLV (Randomized Axelerated Maximum Likelihood).
Abstract: Inference of phylogenetic trees comprising hundreds or even thousands of organisms based on the Maximum Likelihood (ML) method is computationally extremely intensive. In order to accelerate computations we implemented RAxML-OMP, an efficient OpenMP-parallelization for Symmetric Multi-Processing machines (SMPs) based on the sequential program RAxML-V (Randomized Axelerated Maximum Likelihood). RAxML-V is a program for inference of evolutionary trees based upon the ML method and incorporates several advanced search algorithms like fast hill-climbing and simulated annealing. We assess performance of RAxML-OMP on the widely used Intel Xeon, Intel Itanium, and AMD Opteron architectures. RAxML-OMP scales particularly well on the AMD Opteron architecture and achieves even super-linear speedups for large datasets (with a length > 5.000 base pairs) due to improved cache-efficiency and data locality. RAxML-OMP is freely available as open source code.
TL;DR: This work focuses on analyzing the underlying architectural characteristics of iSCSI packet processing and quantifying its compute/memory requirements, and does a detailed analysis of the architectural characteristics in terms of path length, cycles spent per instruction, cache misses at all levels and branch mispredictions.
Abstract: The iSCSI protocol is a key building block for enabling IP-based network storage High performance iSCSI implementations that can support multi-gigabit storage traffic throughput at low latencies are important in facilitating the widespread deployment of this technology Motivated by this, our work presented in this paper focuses on analyzing the underlying architectural characteristics of iSCSI packet processing and quantifying its compute/memory requirements Our analysis and characterization methodology is based on in-depth measurement experiments of iSCSI packet processing performance on Intel/sup /spl reg// Xeon/spl trade/ processor, running the Red Hat Linux operating system Our measurement data shows the achievable throughput and consumed CPU utilization at different disk I/O sizes We also study the overhead of integrity checks on iSCSI performance by enabling CRC computation To understand the source of the iSCSI processing costs, we then do a detailed analysis of the architectural characteristics in terms of path length, cycles spent per instruction, cache misses at all levels and branch mispredictions
TL;DR: The main goal of this paper is to present new capabilities that have been added to a simulation code, NEMO3-D to make it one of the premier simulation tools for design and analysis of realistically-sized nanoelectronic devices, and therefore to makeIt a valid tool for the computational nanotechnology community.
Abstract: The rapid progress in nanofabrication technologies has led to the emergence of new classes of nanodevices, in which the quantum nature of charge carriers dominates the device properties and performance. The device sizes have already reached the level of hundreds down to even tens of nanometers, where the atomistic granularity of constituent materials cannot be neglected. This has led to new challenges in Computational Nanotechnology. The main goal of this paper is to present new capabilities that have been added a simulation code, NEMO3-D to make it one of the premier simulation tools for design and analysis of realistically-sized nanoelectronic devices, and therefore to make it a valid tool for the computational nanotechnology community. Memory requirements for strain and electronic structure calculations are described. Computational performance experiments are conducted on several cluster (Intel Xeon, Apple G4) and shared-memory (IBM Regatta) architectures. The simulation of electronic structure in a 21-million system is demonstrated which corresponds to a complex Hermitian matrix of order 4 x 10.
TL;DR: A comparative characterization of these two workloads based on detailed measurements on an Intel/spl reg/ Xeon/spl trade/ processor-based commercial server is presented and recommendations to JVM developers and hardware designers are provided.
Abstract: SPEC has released SPECjbb2005, a new server-side Java benchmark which supersedes SPECjbb2000. SPECjbb2005 is a substantial update to SPECjbb2000, intended to make the workload more representative based on current Java development practices. SPECjbb2000 has been in existence for about five years and it has been a valuable tool for optimizing the performance of commercial JVMs as well as supporting research activities. Since SPECjbb2005 replaces SPECjbb2000, it is important to understand the key differences between the two, as well as implications for JVM and hardware designers. In this paper, we present a comparative characterization of these two workloads based on detailed measurements on an Intel/spl reg/ Xeon/spl trade/ processor-based commercial server. First, we describe key functional changes introduced in SPECjbb2005. Using low-intrusion application profiling tools we compare application execution profiles. Through JVM monitoring tools, we compare JVM behavior including JIT optimization and garbage collection. Using operating system monitoring tools we compare key system level metrics including CPU utilization. With the aid of processor performance monitoring events, we compare key architectural characteristics such as cache miss rates, memory/bus utilization, and branch behavior. Finally, we summarize key findings, provide recommendations to JVM developers and hardware designers, and suggest areas for future work.
TL;DR: A FPGA based hardware coprocessor for the SPHINX Speech Recognition System implements a critical part of the Baum-Welch Algorithm to assist in the Gaussian probability calculations, currently with a peak performance of 264 MFlops.
Abstract: A FPGA based hardware coprocessor for the SPHINX Speech Recognition System is presented. The coprocessor operates at 66MHz and implements a critical part of the Baum-Welch Algorithm to assist in the Gaussian probability calculations, currently with a peak performance of 264 MFlops. Results are presented in comparison with a Xeon 2.66GHz computer and a similar ASIC project, together with guidelines for future development
TL;DR: The results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes.
Abstract: We study the performance of high-speed interconnects using a set of communication microbenchmarks. The goal is to identify certain limiting factors and bottlenecks with these interconnects. Our microbenchmarks are based on dense communication patterns with different communicating partners and varying degrees of these partners. We tested our microbenchmarks on five platforms: an IBM system of 68-node 16-way Power3, interconnected by a SP switch2; another IBM system of 264-node 4-way Power PC 604e, interconnected by a SP switch; a Compaq cluster of 128-node 4-way ES40/EV6 7 processor, interconnected by an Quadrics interconnect; an Intel cluster of 16-node dual-CPU Xeon, interconnected by an Quadrics interconnect; and a cluster of 22-node Sun Ultra Sparc, interconnected by an Ethernet network. Our results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes.
TL;DR: A large-scale cluster system with a peak speed of 14.3Tflop for lattice QCD at the Center for Computational Sciences, University of Tsukuba, as a successor to the current 0.6Tflops CP-PACS computer is described.
Abstract: We describe our plan to develop a large-scale cluster system with a peak speed of 14.3Tflops for lattice QCD at the Center for Computational Sciences, University of Tsukuba, as a successor to the current 0.6Tflops CP-PACS computer. The system consist of 2560 nodes connected by a 16x16x10 three-dimensional hyper crossbar network. Each node has a single low-voltage 2.8GHz Xeon processor and 2GBytes of memory with 6.4GBytes/sec bandwidth, and 160 GBytes of disk in RAID1 mode. The network link in each of the three directions is made of dual Gigabit Ethernet with the peak throughput of 250MByte/sec. Hence each node has an aggregate network bandwidth of 750MByte/sec. The system will run under Linux and SCore, and an extension of the PM driver is developed for the network. The system will be developed jointly with Hitachi Limited. The installation is scheduled in the first quarter of Japanese Fiscal 2006 (April-June 2006) and the start of operation is expected in July 2006.
TL;DR: The embedded transport acceleration software prototype that uses one of the Intel/spl reg/ Xeon/spl trade/ processors in a multi-processor server as a packet processing engine that is closely tied to the server's core CPU and memory complex is continued.
Abstract: Intel Labs has continued development of the embedded transport acceleration (ETA) software prototype that uses one of the Intel/spl reg/ Xeon/spl trade/ processors in a multi-processor server as a packet processing engine (PPE) that is closely tied to the server's core CPU and memory complex. We have further developed the prototype to provide support for user-level, asynchronous interface for sockets. The direct user socket interface (DUSI) allows user-level applications to interface directly to the PPE using familiar socket commands and semantics. The prototype runs in an asymmetric multiprocessing mode, in that the PPE does not run as a general computing resource for the host operating system. We describe the prototype software architecture, the DUSI application interface, and detail our measurement and analysis of some micro-benchmarks. In particular, we measure throughput for transactions and end-to-end latency as the key metrics for the analysis.
TL;DR: A data-driven algorithmic structure on a standard PC was developed for a block-based motion compensated temporal filtering in real time and the cache capacity miss rate is reduced to less than 0.8%.
Abstract: A data-driven algorithmic structure on a standard PC was developed for a block-based motion compensated temporal filtering in real time. The major time limiting factor of the algorithm was identified as the irregular memory access mainly caused by the layered multi-resolution representation of the input frames. As a result, data is transferred from main memory to cache multiple times leading to memory-dominated critical paths in execution. In order to improve the cache utilization, the computations have been rearranged to process the complete signal on the cached subset of data. The input frames are now divided into super-lines, which are subsets of data containing the relevant information to calculate one line of motion vectors and to filter the corresponding image lines. Only when a set of data is no longer used nor for motion vector analysis nor for filtering the images themselves it is replaced by data of different layers or lines. Due to these data-driven techniques the cache capacity miss rate is reduced to less than 0.8%. As a result, images are processed at a rate of more than 44 fps on a standard PC (Intel dual-processor Xeon, 1.8 GHz), as opposed to 1 fps in the standard implementation.
TL;DR: In every case PCI-X Infiniband provided significantly superior reconstruction performance at all cluster sizes, and in all cases but one extended the range of cluster sizes over which the impact of improving communication performance was observed.
Abstract: Assuming balanced computation, the scalability of an iterative reconstruction algorithm on a cluster computer will be determined by the tradeoff between shorter computation times on larger clusters versus longer communication times. To investigate the impact of improving communication performance, we assembled a cluster equipped with dual communication networks: Gigabit Ethernet and PCI-X Infiniband. The cluster consisted of 8 compute nodes and one fileserver node; each node had dual 3.6 GHz Intel Xeon processors and the two dual-ported communication networks. For image synchronization alone, PCI-X Infiniband was 3.8 times faster than Gigabit Ethernet. We benchmarked two parallel OSEM-3D reconstruction algorithms representing a range of image and sinogram sizes, for both a brain-only scanner (HRRT) and a whole-body scanner (HIREZ). In every case PCI-X Infiniband provided significantly superior reconstruction performance at all cluster sizes, and in all cases but one extended the range of cluster sizes over which we observed performance improvements.
TL;DR: A parallel implementation of the module network is proposed, a less time-consuming, learning algorithm based on the message-passing model, which groups computations by modules and then distributes them cyclically.
Abstract: As an extension of the Bayesian network, the module network is used in situations where there are many variables but only a small set of data available. However, using this network is still time-consuming. In this paper, the authors proposed a parallel implementation of the module network, a less time-consuming, learning algorithm based on the message-passing model. In order to solve the load-imbalance problem introduced by either result caching or intrinsic computation, a grouping strategy was proposed, which groups computations by modules and then distributes them cyclically. The algorithm was tested on eight 4-way Intel Xeon multiprocessors. Speedups of 29.26 on 32 processors have been observed. The result shows that our algorithm is effective.
TL;DR: The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks, and a discussion of general trends in the field of high performance computing.
Abstract: I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).
TL;DR: A Monte Carlo method, combined with proxies to avoid excessive data processing, is employed to identify reservoir simulation models that best match the oilfield production history and these models are used to forecast future productions with uncertainty estimates of unprecedented precision.
Abstract: We have developed a parallel and distributed computing framework to solve an inverse problem, which involves massive data sets and is of great importance to petroleum industry A Monte Carlo method, combined with proxies to avoid excessive data processing, is employed to identify reservoir simulation models that best match the oilfield production history Subsequently, the selected models are used to forecast future productions with uncertainty estimates The parallelization framework combines: (1) message passing for tightly coupled intra-simulation decomposition; and (2) scheduler/Grid remote procedure calls for model parameter sweeps A preliminary numerical test has included 3,159 simulations on a 256-processor Intel Xeon cluster at the USC-CACS The results provide uncertainty estimates of unprecedented precision
TL;DR: This paper compares the performance of two servers that are typical of those used to build Linux clusters, both are two-way servers based on 64-bit versions of x86 processors and describes the architecture and performance of each server.
Abstract: The performance 1 of Linux clusters used for High-Performance Computation (HPC) applications is affected by the performance of three important components of server architecture: the Arithmetic Logic Unit (ALU) or processor core, the memory, and the high-speed network used to interconnect the cluster servers or nodes. The behavior of these three subsystems is in turn affected by the choice of processor used in the server. In this paper we compare the performance of two servers that are typical of those used to build Linux clusters. Both are two-way servers based on 64-bit versions of x86 processors. The servers are each packaged in a 1U (1.75 inch high) rack-mounted chassis. The first server we describe is the IBM® eServer™ 326, based on the AMD Opteron™ processor. The second is the IBM eServer xSeries™ 336, based on the Intel® Xeon processor with Extended Memory 64 Technology (EM64T) enabled. Both are powerful servers designed and optimized to be used as the building blocks of a Linux cluster that may be as small as a few nodes or as large as several thousand nodes. We describe the architecture and performance of each server. We use results from the popular SPEC® CPU2000 and Linpack benchmarks to present different aspects of the performance of the processor core. We use results from the STREAM benchmark to present memory performance. Finally, we discuss how characteristics of the I/O slots affect the interconnect performance, whether the choice is Gigabit Ethernet, Myrinet, InfiniBand, or some other interconnect.
TL;DR: An advanced multi-objective genetic algorithm is implemented using the message passing interface (MPI) standard to distribute genetic algorithm computations over a cluster of 50 Intel Xeon processors to enable the optimization of large-scale infrastructure projects.
Abstract: This paper presents the development of an advanced I nformation T echnology F ramework for O ptimizing C onstruction U tilization of Resources in Civil Infrastructure S ystems, named IT-FOCUS . The main objectives of this framework are to: (1) develop robust optimization models for minimizing construction cost, and duration, while maximizing quality; and (2) formulate scalable methodologies for solving large-scale construction optimization problems. To this end, the present framework is implemented using an advanced multi-objective genetic algorithm that is capable of generating optimal trade-offs between construction cost, duration, and quality. To enable the optimization of large-scale infrastructure projects, the algorithm is parallelized using the manager-worker paradigm of parallel and distributed computing. The parallel implementation of the algorithm utilizes the message passing interface (MPI) standard to distribute genetic algorithm computations over a cluster of 50 Intel Xeon processors. A number of large-scale construction projects with sizes ranging from 180 to 720 activities were evaluated using the 50 processor cluster to evaluate the computational requirements for optimizing real-life construction projects. The results of this evaluation highlight the significant computational savings that can be achieved by the implemented parallel computing framework.
TL;DR: Testing the parallel performance of an Intel Xeon-based Linux PC cluster using a finite element code for direct numerical simulation (DNS) of incompressible fluid turbulence found that PC Linux clusters are thus affordable platforms, compared with more expensive supercomputers, to conduct large-scale scientific computing for fluid turbulence research.
Abstract: In this paper, we have tested the parallel performance of an Intel Xeon-based Linux PC cluster using a finite element code for direct numerical simulation DNS of incompressible fluid turbulence. The parallel performance of the PC cluster, which used up to 64 2.8 GHz processors, was evaluated by comparing three scales of DNS trial runs consisting of 3.3, 5.8, and 10.1 million elements. Subroutines of different natures were contrasted to investigate the scalability of the DNS code. For DNS calculation of sufficiently large scale, the subroutines showed reasonable parallel efficiency. Doubling the number of processors reduced the CPU time by about 40%. Of particular interest was the CPU time required by the two subroutines handling interprocessor communication that was fairly constant within the range of processors tested. PC Linux clusters are thus affordable platforms, compared with more expensive supercomputers, to conduct large-scale scientific computing for fluid turbulence research.
TL;DR: A highly optimized domain wall fermion inverter developed as part of the SciDAC lattice initiative achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors.
Abstract: A highly optimized domain wall fermion inverter has been developed as part of the SciDAC lattice initiative. By designing the code to minimize memory bus traffic, it achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors. The code uses the SciDAC QMP communication library.