TL;DR: To speed up the Smith-Waterman algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm at the instruction level.
Abstract: Motivation: The only algorithm guaranteed to find the optimal local alignment is the Smith--Waterman. It is also one of the slowest due to the number of computations required for the search. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm at the instruction level.
Results: A faster implementation of the Smith--Waterman algorithm is presented. This algorithm achieved 2--8 times performance improvement over other SIMD based Smith--Waterman implementations. On a 2.0 GHz Xeon Core 2 Duo processor, speeds of >3.0 billion cell updates/s were achieved.
Availability: http://farrar.michael.googlepages.com/Smith-waterman
Contact: farrar.michael@gmail.com
TL;DR: This paper discusses the design and high‐performance implementation of collective communications operations on distributed‐memory computer architectures and develops implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI.
Abstract: SUMMARY We discuss the design and high-performance implementation of collective communications operations on distributed-memory computer architectures. Using a combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we have developed implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Xeon/Pentium 4 (R) processor cluster are included.
TL;DR: This paper describes a dual-core 64-b Xeon MP processor implemented in a 65-nm eight-metal process that implements both sleep and shut-off leakage reduction modes and employs multiple voltage and clock domains to reduce power.
Abstract: This paper describes a dual-core 64-b Xeon MP processor implemented in a 65-nm eight-metal process. The 435-mm2 die has 1.328-B transistors. Each core has two threads and a unified 1-MB L2 cache. The 16-MB shared, 16-way set-associative L3 cache implements both sleep and shut-off leakage reduction modes. Long channel transistors are used to reduce subthreshold leakage in cores and uncore (all portions of the die that are outside the cores) control logic. Multiple voltage and clock domains are employed to reduce power
TL;DR: This paper discusses the design and high-performance implementation of collective communications operations on distributed-memory computer architectures and develops implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI.
TL;DR: A reconfigurable accelerator for SW algorithm is presented, a modified equation is proposed to improve mapping efficiency of a processing element (PE), and a special floor plan is applied to a fine-grain parallel PE array and interface components to cut down their routing delay.
Abstract: Scanning bio-sequence database and finding similarities among DNA and protein sequences is basic and important work in bioinformatics field. To solve this problem, Needleman-Wunschh (NW) algorithm is a classical and precise tool, and Smith-Waterman (SW) algorithm is more practical for its capability to find similarities between subsequences. Such algorithms have computational complexity proportional to the length product of both involved sequences, hence processing time becomes insufferable due to exponential growth speed and great amount of bio-sequence database. To alleviate this serious problem, a reconfigurable accelerator for SW algorithm is presented. In the accelerator, a modified equation is proposed to improve mapping efficiency of a processing element (PE), and a special floor plan is applied to a fine-grain parallel PE array and interface components to cut down their routing delay. Basing on the two techniques, the proposed accelerator can reach at 82-MHz frequency in an Altera EP1S30 device. Experiments demonstrate the accelerator provides more than 330 speedup as compared to a standard desktop platform with a 2.8-GHz Xeon processor and 4-GB memory and has 50% improvement on the peak performance of a transferred traditional implementation without using the two special techniques. Our implementation is also about 9% faster than the fastest implementation in a most recent family of SW algorithm accelerators.
TL;DR: Detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores suggests that for the best performance and scalability, it is important to have fast cache-to-cache communication, large L2 or shared capacity, fast L2 to core latency, and fair cache resource sharing.
Abstract: As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors to the PC market. In this paper, performance studies on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64times2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dual-core processors. In addition to the overall execution time and throughput measurement using both multiprogrammed and multi-threaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for the best performance and scalability, it is important to have (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4.
TL;DR: Experimental results show that windowed stream joins can achieve high scalability by making efficient use of the extensive hardware parallelism provided by the Cell processor and significantly surpass the performance obtained form conventional high-end processors.
Abstract: Low-latency and high-throughput processing are key requirements of data stream management systems (DSMSs). Hence, multi-core processors that provide high aggregate processing capacity are ideal matches for executing costly DSMS operators. The recently developed Cell processor is a good example of a heterogeneous multi-core architecture and provides a powerful platform for executing data stream operators with high-performance. On the down side, exploiting the full potential of a multi-core processor like Cell is often challenging, mainly due to the heterogeneous nature of the processing elements, the software managed local memory at the co-processor side, and the unconventional programming model in general.
In this paper, we study the problem of scalable execution of windowed stream join operators on multi-core processors, and specifically on the Cell processor. By examining various aspects of join execution flow, we determine the right set of techniques to apply in order to minimize the sequential segments and maximize parallelism. Concretely, we show that basic windows coupled with low-overhead pointer-shifting techniques can be used to achieve efficient join window partitioning, column-oriented join window organization can be used to minimize scattered data transfers, delay-optimized double buffering can be used for effective pipelining, rate-aware batching can be used to balance join throughput and tuple delay, and finally SIMD (single-instruction multiple-data) optimized operator code can be used to exploit data parallelism. Our experimental results show that, following the design guidelines and implementation techniques outlined in this paper, windowed stream joins can achieve high scalability (linear in the number of co-processors) by making efficient use of the extensive hardware parallelism provided by the Cell processor (reaching data processing rates of a 13 GB/sec) and significantly surpass the performance obtained form conventional high-end processors (supporting a combined input stream rate of 2000 tuples/sec using 15 minutes windows and without dropping any tuples, resulting in a 8.3 times higher output rate compared to an SSE implementation on dual 3.2Ghz Intel Xeon).
TL;DR: To the best of the knowledge, this is the first parallel marking algorithm that completely avoids the synchronization primitives and has better scalability than work-stealing technique with pseudojbb and GCOld server-kind Java benchmarks.
Abstract: This paper describes a scalable parallel marking technique for garbage collection that does not employ any synchronization operation. To achieve good scalability, two major design issues have to be resolved in parallel marking algorithm, i.e., the overhead of synchronization operations and load balance. This paper presents task-pushing, a novel parallel marking algorithm where each thread proactively gives up its spare tasks to other threads. Enlightened by the idea of communicating sequential process (CSP), task-pushing arranges the computation into a process network, eliminating synchronization operations in the whole marking process. Load balance is achieved by dripping tasks from thread local mark-stack for other threads to execute. To the best of our knowledge, this is the first parallel marking algorithm that completely avoids the synchronization primitives. We evaluated task-pushing in aspects of queuing efficiency, load balancing strategy, synchronization overhead, and overall scalability. The results on a 16-way Intel Xeon machine showed that task-pushing has better scalability than work-stealing technique with pseudojbb and GCOld server-kind Java benchmarks.
TL;DR: This paper introduces a scalable FPGA implementation of a stochastic simulation algorithm (SSA) called the next reaction method, which aims to perform a data-driven multi-threading simulation.
Abstract: This paper introduces a scalable FPGA implementation of a stochastic simulation algorithm (SSA) called the next reaction method. There are some hardware approaches of SSAs that obtained high-throughput on reconfigurable devices such as FPGAs, but these works lacked in scalability. The design of this work can accommodate to the increasing size of target biochemical models, or to make use of increasing capacity of FPGAs. Interconnection network between arithmetic circuits and multiple simulation circuits aims to perform a data-driven multi-threading simulation. Approximately 8 times speedup was obtained compared to an execution on Xeon 2.80 GHz.
TL;DR: An OpenMP implementation capable of using large pages is designed and results show an improvement in performance of up to 25% for some applications and it also helps improve the scalability of these applications.
Abstract: Modern multi-core architectures have become popular because of the limitations of deep pipelines and heating and power concerns. Some of these multi-core architectures such as the Intel Xeon have the ability to run several threads on a single core. The OpenMP standard for compiler directive based shared memory programming allows the developer an easy path to writing multi-threaded programs and is a natural fit for multi-core architectures. The OpenMP standard uses loop parallelism as a basis for work division among multiple threads. These loops usually use arrays in their computation with different data distributions and access patterns. The performance of accesses to these arrays may be impacted by the underlying page size depending on the frequency and strides of these accesses. In this paper, we discuss the issues and potential benefits from using large pages for OpenMP applications. We design an OpenMP implementation capable of using large pages and evaluate the impact of using large page support available in most modern processors on the performance and scalability of parallel OpenMP applications. Results show an improvement in performance of up to 25% for some applications. It also helps improve the scalability of these applications.
TL;DR: This work shows how to make the bisection algorithm for eigenvalues of symmetric tridiagonal matrices (sstebz from LAPACK) run both fast and correctly on an ATI Radeon X1900 GPU.
Abstract: Graphical Processing Units (GPUs) potentially promise widespread and inexpensive high performance computation. However, architectural limitations (only some operations and memory access patterns can be performed quickly, partial support for IEEE floating point arithmetic) make it necessary to change existing algorithms to attain high performance and correctness. Here we show how to make the bisection algorithm for eigenvalues of symmetric tridiagonal matrices (sstebz from LAPACK) run both fast and correctly on an ATI Radeon X1900 GPU. Our fastest algorithm takes up to 156! less time than Intel's Math Kernel Library version of sstebz running on the CPU, but does so by doing many redundant floating point operations compared to the CPU version. We use an automatic tuning procedure analogous to ATLAS or PHiPAC to decide the optimal redundancy. Correctness despite partial IEEE floating point semantics required explicitly adding 0 in the inner loop. The problems and solutions discussed here are of interest on other GPU architectures. 1 Motivation and Objectives Modern graphics processors (GPUs) are data parallel architectures that can run general-purpose computations in single precision (so far) at high computational rates. They are capable of achieving 110 GFLOPS in matrix-matrix multiplication [Segal and Peercy 2006] and show 30-40x speedups compared to the recent Intel Xeon processors in computationally intensive applications such as Black-Scholes option pricing [McCool et al. 2006] and gas dynamics solvers [Hagen et al. 2007]. It is tempting to exploit this computational power in solving other common numerical problems. In this work we consider an implementation of another widely used linear algebra routine — the bisection algorithm for finding the eigenvalues of symmetric tridiagonal matrices. A numerically robust, vectorized implementation of this algorithm in single precision is available in LAPACK’s sstebz routine [Anderson et al. 1999]. Our goal is to port the vectorized segments of the code to the GPU. In order to increase the utilization of the parallel resources, we use the Multi-section with Multiple Eigenvalues method used previously by Katagiri et al. [2006]. For the purpose of this study we restrict our attention to finding all eigenvalues of the matrix. The extension to finding a subset of the eigenvalues as done in LAPACK’s sstebz routine, is straightforward. 2 The Bisection Algorithm
TL;DR: In this paper, a two-way dual-core HyperThreaded (HT) Intel Xeon SMP server under single program and multi-program multithreaded workloads using the NAS OpenMP benchmark suite was evaluated.
Abstract: Hybrid chip multithreaded SMPs present new challenges as well as new opportunities to maximize performance. Our intention is to discover the optimal operating configuration of such systems for scientific applications and to identify the shared resources that might become a bottleneck to performance under the different hardware configurations. This knowledge will be useful to the research community in developing software techniques to improve the performance of shared memory programs on modern multi-core multiprocessors. In this paper, we study a two-way dual-core HyperThreaded (HT) Intel Xeon SMP server under single program and multi-program multithreaded workloads using the NAS OpenMP benchmark suite. Our performance results indicate that in the single-program case, the CMP-based SMP and CMT-based SMP configurations have the highest average speedup across all of the applications. The most efficient architecture is a single HT-enabled dual-core processor that is almost comparable to the performance of a 2-way dual-core HTdisabled system.
TL;DR: This paper studies a two-way dual-core Hyper-Threaded (HT) Intel Xeon SMP server under single program and multi-program multithreaded workloads using the NAS OpenMP benchmark suite and results indicate that in the single-program case, the CMP- based SMP and CMT-based SMP configurations have the highest average speedup across all of the applications.
Abstract: Hybrid chip multithreaded SMPs present new challenges as well as new opportunities to maximize performance. Our intention is to discover the optimal operating configuration of such systems for scientific applications and to identify the shared resources that might become a bottleneck to performance under the different hardware configurations. This knowledge will be useful to the research community in developing software techniques to improve the performance of shared memory programs on modern multi-core multiprocessors. In this paper, we study a two-way dual-core Hyper-Threaded (HT) Intel Xeon SMP server under single program and multi-program multithreaded workloads using the NAS OpenMP benchmark suite. Our performance results indicate that in the single-program case, the CMP-based SMP and CMT-based SMP configurations have the highest average speedup across all of the applications. The most efficient architecture is a single HT-enabled dual-core processor that is almost comparable to the performance of a 2-way dual-core HT-disabled system.
TL;DR: In this article, an implementation of a parallel two-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on multi-core processors is proposed, where the combination of vectorization and the block two dimensional FFT algorithm is shown to effectively improve performance.
Abstract: In this paper, an implementation of a parallel two- dimensional fast Fourier transform (FFT) using short vector SIMD instructions on multi-core processors is proposed. Combination of vectorization and the block two- dimensional FFT algorithm is shown to effectively improve performance. We vectorized FFT kernels using Intel's streaming SIMD extensions 3 (SSE3) instruction. The performance results for two-dimensional FFTs on multi-core processors are reported. We succeeded in obtaining a performance of over 2.7 GFLOPS on a dual-core Intel Xeon (2.8 GHz, two CPUs, four cores) and over 3.3 GFLOPS on an Intel Core2 Duo E6600 (2.4 GHz, one CPU, two cores) for a 210 times 210 -point FFT.
TL;DR: The results show a significant improvement in dual-core Pentium M processor over Hyperthreaded Xeon processor for AON workload, which will not only provide insight to processor designers, but also help architects of AON devices to select from alternative processors with restrictions to use one or two physical CPUs due to space and power consumption limitations.
Abstract: There is a growing trend to insert application intelligence into network devices. Processors in this type of application-oriented networking (AON) devices are required to handle both packet-level network I/O intensive operations as well as XML message-level CPU intensive operations. In this paper, we investigate performance effect of dual processing via (1) hyperthreading, (2) uni-processor to dual- processor, and (3) single-core to dual-core, on both packet-level and XML message-level traffic. We analyze and cross-examine the dual processing effect from both high-level performance as well as processor microarchitectural perspectives. We employ on-chip performance counters to measure cycles per instruction, cache misses, bus utilization, and branch miss predictions for this work. Our results show a significant improvement in dual-core Pentium M processor over Hyperthreaded Xeon processor for AON workload. These results will not only provide insight to processor designers, but also help architects of AON devices to select from alternative processors with restrictions to use one or two physical CPUs due to space and power consumption limitations.
TL;DR: A new reactive spin-lock algorithm that is completely self-tuning, which means no experimentally tuned parameter nor probability distribution of inputs are needed, that performs as well as the best of hand-tunedspin-lock representatives in a wide range of contention levels.
TL;DR: This work presents a case study showing the development of a shared memory parallel code with minimum effort, and presents two implementations, one with minimal code modification and one without modification to the original SPICE3 program using Intel’s taskq construct.
Abstract: In this paper, we describe our experience of creating an OpenMP implementation of the SPICE3 circuit simulator program. Given the irregular patterns of access to dynamic data structures in the SPICE code, a parallelization using current standard OpenMP directives is impossible without major rewriting of the original program. The aim of this work is to present a case study showing the development of a shared memory parallel code with minimum effort. We present two implementations, one with minimal code modification and one without modification to the original SPICE3 program using Intel's taskq construct. We also discuss the results of the case study in terms of what future compiler tools may be needed to help OpenMP application developers with similar porting goals. Our experiments using SPICE3, based on SRAM model simulation, were compiled by the SUN compiler running on a SunFire V880 UltraSPARC-III 750MHz and by the Intel icc compiler running on both an IBM Itanium with four CPUs and Intel Xeon of two processors machines. The results are promising.
TL;DR: In this paper, the authors conducted several benchmark test cases using the same code on a variety of common hardware platforms and compilers, and the results indicate surprisingly little variation in the performance given the considerable number of vendors, architectures and compiler.
Abstract: Timings have been conducted for several benchmark testcases using the same code on a variety of common hardware platforms and compilers. The results indicate surprisingly little variation in the performance given the considerable number of vendors, architectures and compilers. The largest differences amounted to less than 30% of run-times. For the single processor runs, an increase in cache size reduced run-times, though not dramatically (approximately 10%). Going from 32 bits to 64 bits (with the same clockspeed, cache size, memory and compiler) in most cases produced a gain of 10%, although in some cases no gain was recorded. Overall, the chip with the slowest clocktime (Intel It II at 1.50 GHz) achieved the best performance. In some cases, it beat the 64-bit, 3.40 GHz P4/Xeon machines by a considerable margin. For shared memory parallelism, the best scaling was achieved by the SGI Altix. This should not come as a surprise, as SGI’s CCNUMA technology has matured over the last decade. For the AMD Opteron, the SUN compiler exhibited the best scaling.
TL;DR: Becciani et al. as discussed by the authors presented the new version of the tree N-body parallel code FLY that runs on a PC Linux Cluster using the one side communication paradigm MPI-2 and showed the performances obtained.
TL;DR: An additional text processing application was developed that used an FPGA to accelerate n-gram profiling for language classification and showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels.
Abstract: Critical data science applications requiring frequent access to storage perform poorly on today's computing architectures. This project addresses efficient computation of data-intensive problems in national security and basic science by exploring, advancing, and applying a new form of computing called storage-intensive supercomputing (SISC). Our goal is to enable applications that simply cannot run on current systems, and, for a broad range of data-intensive problems, to deliver an order of magnitude improvement in price/performance over today's data-intensive architectures. This technical report documents much of the work done under LDRD 07-ERD-063 Storage Intensive Supercomputing during the period 05/07-09/07. The following chapters describe: (1) a new file I/O monitoring tool iotrace developed to capture the dynamic I/O profiles of Linux processes; (2) an out-of-core graph benchmark for level-set expansion of scale-free graphs; (3) an entity extraction benchmark consisting of a pipeline of eight components; and (4) an image resampling benchmark drawn from the SWarp program in the LSST data processing pipeline. The performance of the graph and entity extraction benchmarks was measured in three different scenarios: data sets residing on the NFS file server and accessed over the network; data sets stored on local disk; and data sets stored on the Fusion I/Omore » parallel NAND Flash array. The image resampling benchmark compared performance of software-only to GPU-accelerated. In addition to the work reported here, an additional text processing application was developed that used an FPGA to accelerate n-gram profiling for language classification. The n-gram application will be presented at SC07 at the High Performance Reconfigurable Computing Technologies and Applications Workshop. The graph and entity extraction benchmarks were run on a Supermicro server housing the NAND Flash 40GB parallel disk array, the Fusion-io. The Fusion system specs are as follows: SuperMicro X7DBE Xeon Dual Socket Blackford Server Motherboard; 2 Intel Xeon Dual-Core 2.66 GHz processors; 1 GB DDR2 PC2-5300 RAM (2 x 512); 80GB Hard Drive (Seagate SATA II Barracuda). The Fusion board is presently capable of 4X in a PCIe slot. The image resampling benchmark was run on a dual Xeon workstation with NVIDIA graphics card (see Chapter 5 for full specification). An XtremeData Opteron+FPGA was used for the language classification application. We observed that these benchmarks are not uniformly I/O intensive. The only benchmark that showed greater that 50% of the time in I/O was the graph algorithm when it accessed data files over NFS. When local disk was used, the graph benchmark spent at most 40% of its time in I/O. The other benchmarks were CPU dominated. The image resampling benchmark and language classification showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels. Our experiments to date suggest that emerging hardware technologies offer significant benefit to boosting the performance of data-intensive algorithms. Using GPU and FPGA co-processors, we were able to improve performance by more than an order of magnitude on the benchmark algorithms, eliminating the processor bottleneck of CPU-bound tasks. Experiments with a prototype solid state nonvolative memory available today show 10X better throughput on random reads than disk, with a 2X speedup on a graph processing benchmark when compared to the use of local SATA disk.« less
TL;DR: This paper describes the design of the hardware architecture for several primitive intersection testing components implemented on a multi-FPGA Xilinx Virtex-II prototyping system and demonstrates that the proposed approach could prove to be faster than current GPU-based algorithms as well as CPU based algorithms for ray-triangle intersection.
Abstract: We present a novel FPGA-accelerated architecture for fast collision detection among rigid bodies. This paper describes the design of the hardware architecture for several primitive intersection testing components implemented on a multi-FPGA Xilinx Virtex-II prototyping system. We focus on the acceleration of ray-triangle intersection operation which is the one of the most important operations in various applications such as collision detection and ray tracing.
Our implementation result is a hardware-accelerated ray-triangle intersection engine that is capable of out-performing a 2.8 GHz Xeon processor, running a well-known high performance software ray-triangle intersection algorithm, by up to a factor of seventy. In addition, we demonstrate that the proposed approach could prove to be faster than current GPU-based algorithms as well as CPU based algorithms for ray-triangle intersection.
TL;DR: The parallel performance of classical molecular dynamics simulations of the thermal properties of solid-state materials is evaluated and the popular Lennard-Jones potential is used to investigate to role of cutoff distance on parallel performance.
Abstract: The parallel performance of classical molecular dynamics simulations of the thermal properties of solid-state materials is evaluated. Computations are validated by predicting the bulk silicon thermal conductivity as a function of temperature. The performance of the computational algorithm and software are tested on three different architectures, including the IBM BlueGene, the IBM Power 4 +, and an Intel Xeon Linux cluster, corresponding to different combinations of processor speeds, communications bandwidth, and latency. Two popular three-body potentials used for silicon simulation are evaluated and compared. In addition, the popular Lennard-Jones potential is used to investigate to role of cutoff distance on parallel performance.
TL;DR: Evaluated Woodcrest processors show excellent performance compared to a test system that uses Opteron processors from Advanced Micro Devices (AMD), though its performance advantage for full applications was less definitive.
Abstract: Intel recently began shipping its Xeon 5100 series processors, formerly known by their 'Woodcrest' code name. To evaluate the suitability of the Woodcrest processor for high-end scientific computing, we obtained access to a Woodcrest-based system at Intel and measured its performance first using computation and memory micro-benchmarks, followed by full applications from the areas of climate modeling and molecular dynamics. For computational benchmarks, the Woodcrest showed excellent performance compared to a test system that uses Opteron processors from Advanced Micro Devices (AMD), though its performance advantage for full applications was less definitive. Nevertheless, our evaluation suggests the Woodcrest to be a compelling foundation for future leadership class systems for scientific computing.
TL;DR: This paper characterizes the core, cache and memory behavior of two significant enterprise Java workloads on the newly released Intel core 2 Duo Xeon platform (dual-core, dual-socket) and proposes architectural optimizations along three dimensions that will guide future CMP architectures for enterprise Java servers.
Abstract: As we enter the era of chip multiprocessor (CMP) architectures, it is important that we explore the scaling characteristics of mainstream server workloads on these platforms. In this paper, we analyze the performance of two significant enterprise Java workloads (SPECjAppServer2004 and SPECjbb2005) on CMP platforms -present and future. We start by characterizing the core, cache and memory behavior of these workloads on the newly released Intel core 2 Duo Xeon platform (dual-core, dual-socket). Our findings from these measurements indicate that these workloads have a significant performance dependence on cache and memory subsystems. In order to guide the evolution of future CMP platforms, we perform a detailed investigation of potential cache and memory architecture choices. This includes analyzing the effects of thread sharing and migration, object allocation and garbage collection. Based on the observed behavior, we propose architectural optimizations along three dimensions: (a) data-less cache line initialization (DCLI), (b) hardware-guided thread collocation (HGTC) and (c) on-socket DRAM caches (OSDC). In this paper, we will describe these optimizations in detail and validate their performance potential based on trace-driven simulations and execution-driven emulation. Overall, we expect that the findings in this paper will guide future CMP architectures for enterprise Java servers.
TL;DR: A memory efficient, practical, systolic, parallel architecture for the complete 0/1 knapsack dynamic programming problem, including backtracking, using a divide-and-conquer technique that results in a pseudo-linear memory requirement.
Abstract: We present a memory efficient, practical, systolic, parallel architecture for the complete 0/1 knapsack dynamic programming problem, including backtracking. This problem was intentionally selected because its dynamic dependencies introduce difficulties in hardware implementation. The architecture uses a divide-and-conquer technique that results in a pseudo-linear memory requirement. This memory reduction comes in exchange for a factor of two slowdown due to redundant computation. The architecture uses Theta(n + p(C + Wmax)) memory and the run time is Theta(nC/p + nlog(n/p)). The heart of the architecture is a systolic module to compute the optimal profit for any problem that fits in available hardware resources. We implemented the module using 64 processors on an Alpha Data coprocessor board using a Xilinx VirtexII FPGA(2001 technology). Our implementation showed a factor of 32 improvement on the total execution time over a sequential algorithm running on a 1.5 GHz Xeon processor(2000 technology) and a factor of 16 improvement over a 3.2 GHz Pentium 4(2004 technology) and a 64 bit 3.4 GHz Pentium 4 (2006 technology). We measured complete wall-clock time, including the time to download the problem to the board, but not the time to download the bit stream to the FPGA.
TL;DR: This paper implements and optimize a complete data-intensive hydrodynamics application, QNJ-5, on the stream processor which is designed for computation-intensive applications and shows an ultimate speedup of 2.97 and 1.11 over original FORTRAN QNH5 on a Xeon and Iantium processor, respectively.
Abstract: Several representative scientific computing applications have been mapped on the stream processor. But most of them are computation-intensive kernels or synthetic benchmarks. In this paper, we implement and optimize a complete data-intensive hydrodynamics application, QNJ-5, on the stream processor which is designed for computation-intensive applications. Different from other stream programs, how to relieve memory access pressure is especially important to this stream program. Simulation results show that StreamQNJ-5 gets an ultimate speedup of 2.97 and 1.11 over original FORTRAN QNJ-5 on a Xeon and Iantium processor, respectively.
TL;DR: In this article, a parallel and distributed computing framework was developed to solve an inverse problem, which involves massive data sets and is of great importance to petroleum industry and is employed to identify reservoir simulation models that best match the oilfield production history.
Abstract: We have developed a parallel and distributed computing framework to solve an inverse problem, which involves massive data sets and is of great importance to petroleum industry. A Monte Carlo method, combined with proxies to avoid excessive data processing, is employed to identify reservoir simulation models that best match the oilfield production history. Subsequently, the selected models are used to forecast future productions with uncertainty estimates. The parallelization framework combines: (1) message passing for tightly coupled intra-simulation decomposition; and (2) scheduler/Grid remote procedure calls for model parameter sweeps. A preliminary numerical test has included 3,159 simulations on a 256-processor Intel Xeon cluster at the USC-CACS. The results provide uncertainty estimates of unprecedented precision.
TL;DR: The reliability and implementability of the conception of creating an intelligent computer was experimentally tested on the Inparcom-16 workstation.
Abstract: The conception of MIMD computers is considered from the viewpoint of automatic analysis of characteristics of engineering and scientific problems with approximate initial data, development of parallel computational algorithms and programs, solution of problems, and estimation of the reliability of computer solutions. The conception was experimentally tested on the Inparcom-16 workstation. This workstation consists of a host system (two host computers) and 16 processing nodes that use Xeon (3.2 GHz, 64 bit), a communication environment that consists of Gigabit Ethernet, Infiniband, and a hypercube. Programming languages are C, C++, and Fortran. The tests proved the reliability and implementability of the conception of creating an intelligent computer.
TL;DR: This paper proposes a partially overlapping block-parallel active search method (POBPAS) that uses a proper data segmentation to achieve parallelism and performs a high level of parallelism with little additional work on shared-memory multiprocessor systems (SMPs).
Abstract: Audio search plays an important role in analyzing audio data and retrieving useful audio information. In this paper, a partially overlapping block-parallel active search method (POBPAS) is proposed to perform audio quick search on shared-memory multiprocessor systems (SMPs). This method uses a proper data segmentation to achieve parallelism and performs a high level of parallelism with little additional work. Several techniques including I/O optimization, proper data partition and dynamic scheduling are also introduced to maximize its scalability performance. In addition, we conduct a detailed performance characterization analysis of the parallel implementation of the POBPAS for three data sets on two Intel Xeon SMPs. Experimental results indicate that there are no obvious parallel limiting factors in the implementation except memory bandwidth. As a result, it can achieve 11.3X speedup for a larger data set (searching a 15 seconds' clip in a 27 hours' audio stream) on the 16-way processor system.
TL;DR: The performance results from running the NASPB and from the memory bandwidth benchmarks show that better performance can sometimes be achieved using 1 ppn, while performance results show that the AMD Opteron/Myri...
Abstract: Purpose – The purpose of this paper is to evaluate how to use nodes in a cluster efficiently by studying the NAS Parallel Benchmarks (NASPB) on Intel Xeon and AMD Opteron dual CPU Linux clustersDesign/methodology/approach – The performance results of the NASPB are presented both with one MPI process per node (1 ppn) and with two MPI processes per node (2 ppn) These benchmark results were analyzed by considering the impact of cache effects, code scalability, memory bandwidth within nodes, and the impact of MPI and the MPI communication network Memory bandwidth was benchmarked using MPI versions of the Streams benchmarks The impact of MPI and the MPI communication network are evaluated by benchmarking the performance of MPI sends and receives, MPI broadcast, and the MPI all‐to‐all routinesFindings – The performance results from running the NASPB and from the memory bandwidth benchmarks show that better performance can sometimes be achieved using 1 ppn Performance results show that the AMD Opteron/Myri