TL;DR: This work is an overview of the preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors and techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed.
Abstract: This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.
TL;DR: This work describes an efficient implementation of SpMV on the IntelR Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses.
Abstract: Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) load-imbalance due to non-uniform matrix structures.We describe an efficient implementation of SpMV on the IntelR Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that addresses the above challenges. Our implementation exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses. By using a specialized data structure with careful load balancing, we attain performance on average close to 90% of KNC's achievable memory bandwidth on a diverse set of sparse matrices. Furthermore, we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual IntelR XeonR Processor E5-2680 and the NVIDIA Tesla K20X architecture.
TL;DR: The importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware is demonstrated and the considerable performance increase afforded by the use of Intel®Xeon Phi™coprocessors for highly parallel workloads is demonstrated.
Abstract: We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia's miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256and 512bit). The applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel®Xeon®processors with 256-bit SIMD, and adding a single Intel®Xeon Phi™coprocessor provides up to an additional 2x performance increase. These results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware; and (ii) the considerable performance increase afforded by the use of Intel®Xeon Phi™coprocessors for highly parallel workloads.
TL;DR: This paper describes how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™1 co-processor (code-named Knights Corner) in both native and hybrid configurations.
Abstract: Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™1 co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.
TL;DR: Three techniques to speedup fundamental problems in data mining algorithms on the CUDA platform are proposed: scalable thread scheduling scheme for irregular pattern, parallel distributed top-k scheme, and parallel high dimension reduction scheme.
Abstract: Recent development in Graphics Processing Units (GPUs) has enabled inexpensive high performance computing for general-purpose applications. Compute Unified Device Architecture (CUDA) programming model provides the programmers adequate C language like APIs to better exploit the parallel power of the GPU. Data mining is widely used and has significant applications in various domains. However, current data mining toolkits cannot meet the requirement of applications with large-scale databases in terms of speed. In this paper, we propose three techniques to speedup fundamental problems in data mining algorithms on the CUDA platform: scalable thread scheduling scheme for irregular pattern, parallel distributed top-k scheme, and parallel high dimension reduction scheme. They play a key role in our CUDA-based implementation of three representative data mining algorithms, CU-Apriori, CU-KNN, and CU-K-means. These parallel implementations outperform the other state-of-the-art implementations significantly on a HP xw8600 workstation with a Tesla C1060 GPU and a Core-quad Intel Xeon CPU. Our results have shown that GPU + CUDA parallel architecture is feasible and promising for data mining applications.
TL;DR: The design and optimizations of MPI collectives for clusters of NUMA nodes are investigated, performance models for collective communication using shared memory are developed, and several algorithms for various collectives are developed.
Abstract: As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory, and we develop several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several traditional MPI implementations -- Open MPI, MPICH2, and MVAPICH2 -- that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on average 2.5X and 2.3X speedup over MVAPICH2, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.
TL;DR: This work compares a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes and shows significant differences in absolute application performance and scalability.
Abstract: The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its native x86 support, the Xeon Phi may also be used stand-alone, meaning codes can be executed directly on the device without the need for interaction with a host. In this sense, the Xeon Phi resembles a big SMP on a chip if its 240 logical cores are compared to a common Xeon-based compute node offering up to 32 logical cores. In this work, we compare a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes. Considering both as individual SMP systems, they come at a very similar price and power envelope, but our results show significant differences in absolute application performance and scalability. We also show in how far common programming idioms for the Xeon multi-core architecture are applicable for the Xeon Phi many-core architecture and which challenges the changing ratio of core count to single core performance poses for the application programmer.
TL;DR: This study examines the design of several novel specialized multicore neural processors based on SRAM cores and memristor devices and indicates that these specialized systems can be between two to five orders more energy efficient compared to the traditional HPC systems.
Abstract: This study examines the design of several novel specialized multicore neural processors. Systems based on SRAM cores and memristor devices were examined. Detailed circuit simulations were used to ensure that the systems could be compared accurately. Two types of memristor cores were examined: digital and analog cores. Novel circuits were designed for both of these memristor systems. Additionally full system evaluation of multicore processors based on these cores and specialized routing circuits were developed. Our results show that the memristor systems yield the highest throughput and lowest power. We compared these specialized systems to more traditional HPC systems. Two commodity high performance processors were examined: a six core Intel Xeon processor, and an NVIDIA Tesla M2070 GPGPU. Care was taken to ensure the code on each platform was very efficient (multi-threaded on the Xeon processor, and a high device utilization CUDA program on the GPGPU). Our results indicate that the specialized systems can be between two to five orders more energy efficient compared to the traditional HPC systems. Additionally the specialized cores take up much less die area - allowing in some cases a reduction from 179 Xeon six-core processor chips to 1 memristor based multicore chip and a corresponding reduction in power from 17 kW down to 0.07 W.
TL;DR: The results show that the Intel® Composer XE 2013 compiler can make effective use of software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations to achieve significant performance improvements.
Abstract: The Intel® Xeon Phi™ coprocessor has software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations. In this work, we provide details on compiler-based generation of these instructions and evaluate their impact on the performance of the Intel® Xeon Phi™ coprocessor using a wide range of parallel applications with different characteristics. Our results show that the Intel® Composer XE 2013 compiler can make effective use of these mechanisms to achieve significant performance improvements.
TL;DR: An extensive experimental study shows that MPTS R-tree traversal algorithm on NVIDIA Tesla M2090 GPU consistently outperforms traditional recursive R-trees search algorithm on Intel Xeon E5506 processors.
TL;DR: This paper proposes to generate the large sparse linear systems arising in finite‐element analysis in an iterative manner on several GPUs and to use the graphics accelerators concurrently with CPUs performing collection and addition of the matrix fragments using a fast multithreaded procedure.
TL;DR: This work uses the NPB-OpenMP version to examine the performance of the Intel's new Xeon Phi co-processor and focus in particular on the many core aspect of the Xeon Phi architecture.
Abstract: NAS parallel benchmarks (NPB) are a set of applications commonly used to evaluate parallel systems. We use the NPB-OpenMP version to examine the performance of the Intel's new Xeon Phi co-processor and focus in particular on the many core aspect of the Xeon Phi architecture. A first analysis studies the scalability up to 244 threads on 61 cores and the impact of affinity settings on scaling. It also compares performance characteristics of Xeon Phi and traditional Xeon CPUs. The application of several well-established optimization techniques allows us to identify common bottlenecks that can specifically impede performance on the Xeon Phi but are not as severe on multi-core CPUs. We also find that many of the OpenMP-parallel loops are too short (in terms of the number of loop iterations) for a balanced execution by 244 threads. New or redesigned benchmarks will be needed to accommodate the greatly increased number of cores and threads. At the end, we summarize our findings in a set recommendations for performance optimization for Xeon Phi.
TL;DR: A novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the following two features: it introduces and makes use of a notion of a memory container, which is defined as a unit of memory that comprises the minimum number of page frames that can cover all the banks of the memory organization.
Abstract: We propose a novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the following two features. First, it introduces and makes use of a notion of a memory container, which is defined as a unit of memory that comprises the minimum number of page frames that can cover all the banks of the memory organization, by exclusively assigning a container to a core so that each core achieves bank parallelism as much as possible. Second, it orchestrates page frame allocation so that pages that threads access are dispersed randomly across multiple banks so that each thread's access pattern is randomized. The development of M3 is based on a tool that we develop to fully understand the architectural characteristics of the underlying memory organization. Using an extension of this tool, we observe that the same application that accesses pages in a random manner outperforms one that accesses pages in a regular pattern such as sequential or same ordered accesses. This is because such randomized accesses reduces inter-thread access interference on the row-buffer in memory banks. We implement M3 in the Linux kernel version 2.6.32 on the Intel Xeon system that has 16 cores and 32GB DRAM. Performance evaluation with various workloads show that M3 improves the overall performance for memory intensive benchmarks by up to 85% with an average of about 40%.
TL;DR: Estimates show that the performance of the novel architecture designed to realize a million-bit multiplication architecture matches that of previously reported software implementations on a high-end 3 Ghz Intel Xeon processor, while requiring only a tiny fraction of the area.
Abstract: In this work we present the first full and complete evaluation of a very large multiplication scheme in custom hardware. We designed a novel architecture to realize a million-bit multiplication architecture based on the Schonhage-Strassen Algorithm and the Number Theoretical Transform (NTT). The construction makes use of an innovative cache architecture along with processing elements customized to match the computation and access patterns of the FFT-based recursive multiplication algorithm. When synthesized using a 90nm TSMC library operating at a frequency of 666 MHz, our architecture is able to compute the product of integers in excess of a million bits in 7.74 milliseconds. Estimates show that the performance of our design matches that of previously reported software implementations on a high-end 3 Ghz Intel Xeon processor, while requiring only a tiny fraction of the area.
TL;DR: This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations by leveraging a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches.
Abstract: This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.
TL;DR: This chapter provides an overview of the Intel Xeon Phi Coprocessor including its hardware and software architecture and the key usages that enable these highly parallel applications, like many in Geoscience to achieve new levels of performance while using familiar, standard programming models.
Abstract: Intel recently launched the Intel® Xeon Phi™ Coprocessor to enhance the performance of the growing category of highly parallel scientific applications This chapter provides an overview of the Intel Xeon Phi Coprocessor including its hardware and software architecture and the key usages that enable these highly parallel applications, like many in Geoscience to achieve new levels of performance while using familiar, standard programming models
TL;DR: This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations, showing a reduction in computational time and power consumption and semiclassical GPU calculations are shown to be environment friendly.
Abstract: This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the GPU implementation of the semiclassical code are provided. Four molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (NVIDIA Tesla C2075 and Kepler K20) respectively versus two CPUs (Intel Core i5 and Intel Xeon E5-2687W) and the critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.
TL;DR: It is demonstrated that RnR can be implemented efficiently on a real multicore IA system, and it is shown that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead.
Abstract: There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality.This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel.This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.
TL;DR: The new high-performance BRD achieves up to a 30-fold speedup against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
Abstract: This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
TL;DR: A new, user-level middleware called COSMIC is proposed that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi, and increases multiprocessioning reliability by exploiting programmer-specified per-processCoprocessor memory requirements to completely avoid memory oversubscription and crashes.
Abstract: It is remarkably easy to offload processing to Intel's newest manycore coprocessor, the Xeon-Phi: it supports a popular ISA (x86-based), a popular OS (Linux) and a popular programming model (OpenMP). Unfortunately, easy portability does not automatically ensure high performance. Additional programmer effort is necessary to leverage the new performance-oriented hardware features. But programmer optimizations alone are insufficient. Multiprocessing is also necessary to improve hardware utilization, and Linux makes it easy for processes to share the manycore coprocessor. However multiprocessing inefficiencies can easily offset gains made by the programmer. Our experiments on a production, high-performance Xeon server with multiple Xeon Phi coprocessors show that multiprocessing on coprocessors not only slows down the processes but also introduces unreliability (some processes crash unexpectedly). We propose a new, user-level middleware called COSMIC that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi. COSMIC seamlessly fits in the existing Xeon Phi software stack and is transparent to programmers. It manages Xeon Phi processes that execute parallel regions offloaded to the coprocessors. Offloads typically have programmer-driven performance directives like thread and affinity requirements. Unlike the existing Xeon Phi software stack, COSMIC does fair scheduling of both processes and offloads, and takes into account conflicting requirements of offloads belonging to different processes. By doing so, COSMIC has two clear benefits. First, it improves multiprocessing performance by preventing thread and memory oversubscription, by avoiding inter-offload interference and by reducing load imbalance on coprocessors and cores. Second, it increases multiprocessing reliability by exploiting programmer-specified per-process coprocessor memory requirements to completely avoid memory oversubscription and crashes. Our experiments on several representative Xeon Phi workloads show that, in a multiprocessing environment, COSMIC improves average core utilization by up to 3 times, reduces make-span by up to 52%, reduces average process latency (turn-around-time) by 70%, and completely eliminates process crashes.
TL;DR: This is the first study that explores the potential of accelerating the ADER-DG method for seismic wave-propagation simulations using a GPU and the implementation obtained a speedup factor of about 24.3 for the single-precision version of the GPU code.
TL;DR: Proposed mapping of subdomains to CPU/cores was evaluated on massively paralleled Intel Xeon PC cluster and confirmed that it could reduce communication time and achieve higher parallel performance than without mapping in several benchmark tests.
TL;DR: This work addresses the L1 cache bandwidth problem in SMT processors experimentally on real hardware and proposes two L1 bandwidth aware thread to core (t2c) allocation policies, namely Static and Dynamic t2c allocation, respectively.
Abstract: Improving the utilization of shared resources is a key issue to increase performance in SMT processors. Recent work has focused on resource sharing policies to enhance the processor performance, but their proposals mainly concentrate on novel hardware mechanisms that adapt to the dynamic resource requirements of the running threads. This work addresses the L1 cache bandwidth problem in SMT processors experimentally on real hardware. Unlike previous work, this paper concentrates on thread allocation, by selecting the proper pair of co-runners to be launched to the same core. The relation between L1 bandwidth requirements of each benchmark and its performance (IPC) is analyzed. We found that for individual benchmarks, performance is strongly connected to L1 bandwidth consumption, and this observation remains valid when several co-runners are launched to the same SMT core. Based on these findings we propose two L1 bandwidth aware thread to core (t2c) allocation policies, namely Static and Dynamic t2c allocation, respectively. The aim of these policies is to properly balance L1 bandwidth requirements of the running threads among the processor cores. Experiments on a Xeon E5645 processor show that the proposed policies significantly improve the performance of the Linux OS kernel regardless the number of cores considered.
TL;DR: A performance evaluation of Pleiades based on the Intel Xeon E5-2670 processor, a fourth-generation eight-core Sandy Bridge architecture, and compare it with the previous third generation Nehalem architecture is presented.
Abstract: We present a performance evaluation of Pleiades based on the Intel Xeon E5-2670 processor, a fourth-generation eight-core Sandy Bridge architecture, and compare it with the previous third generation Nehalem architecture. Several architectural features have been incorporated in Sandy Bridge: (a) four memory channels as opposed to three in Nehalem; (b) memory speed increased from 1333 MHz to 1600 MHz; (c) ring to connect on-chip L3 cache with cores, system agent, memory controller, and QPI agent and I/O controller to increase the scalability; (d) new AVX unit with wider vector registers of 256 bit; (e) integration of PCI-Express 3.0 controllers into the I/O subsystem on chip; (f) new Turbo Boost version 2.0 where base frequency of processor increased from 2.6 to 3.2 GHz; and (g) QPI link rate from 6.4 to 8 GT/s and two QPI links to second socket. We critically evaluate these new features using several low-level benchmarks, and four full-scale scientific and engineering applications.
TL;DR: This paper describes the first userspace scheduler that supports preemptive, dynamic-priority, migrating real-time tasks on multicore hardware, and reports empirical latency and overhead measurements.
Abstract: As multicore computing hardware has become more ubiquitous, real-time scheduling theory aimed at multicore systems has become increasingly sophisticated and diverse. Real-time operating systems (RTOSs) are ill-suited for this kind of rapid change, and the slow-moving RTOS ecosystem is falling further and further behind advances in real-time scheduling theory. Thus, supporting new functionality in a layer of middleware software running in userspace (i.e., outside the RTOS kernel) has been proposed. In this paper, we describe the first userspace scheduler that supports preemptive, dynamic-priority, migrating real-time tasks on multicore hardware, and report empirical latency and overhead measurements. On an eight-core Intel Xeon platform, these measurements are in the range of ones to tens of microseconds under most tested configurations. We believe that this approach may prove superior to a kernel-based approach for supporting a subset of future real-world realtime applications.
TL;DR: A head-body finite automaton (HBFA) is proposed, which implements SPM in two parts: a head DFA (H-DFA) and a body NFA (B-NFA), which achieves 3x to 8x throughput when matching real-life large dictionaries against inputs with high match ratios.
Abstract: Conventionally, dictionary-based string pattern matching (SPM) has been implemented as Aho-Corasick deterministic finite automaton (AC-DFA). Due to its large memory footprint, a large-dictionary AC-DFA can experience poor cache performance when matching against inputs with high match ratio on multicore processors. We propose a head-body finite automaton (HBFA), which implements SPM in two parts: a head DFA (H-DFA) and a body NFA (B-NFA). The H-DFA matches the dictionary up to a predefined prefix length in the same way as AC-DFA, but with a much smaller memory footprint. The B-NFA extends the matching to full dictionary lengths in a compact variable-stride branch data structure, accelerated by single-instruction multiple-data (SIMD) operations. A branch grafting mechanism is proposed to opportunistically advance the state of the H-DFA with the matching progress in the B-NFA. Compared with a fully populated AC-DFA, our HBFA prototype has <;1/5 construction time, requires <;1/20 runtime memory, and achieves 3x to 8x throughput when matching real-life large dictionaries against inputs with high match ratios. The throughput scales up 27x to over 34 Gbps on a 32-core Intel Manycore Testing Lab machine based on the Intel Xeon X7560 processors.
TL;DR: The paper addresses why offload to a coprocessor is useful, how it is specified, and what the conditions for the profitability of offload are, and enumerates the key performance features for this heterogeneous computing stack, related to initialization, data movement and invocation.
Abstract: The Intel® Xeon PhiTM coprocessor platform enables offload of computation from a host processor to a coprocessor that is a fully-functional Intel® Architecture CPU. This paper presents the C/C++ and Fortran compiler offload runtime for that coprocessor. The paper addresses why offload to a coprocessor is useful, how it is specified, and what the conditions for the profitability of offload are. It also serves as a guide to potential third-party developers of offload runtimes, such as a gcc-based offload compiler, ports of existing commercial offloading compilers to Intel® Xeon PhiTM coprocessor such as CAPS®, and third-party offload library vendors that Intel is working with, such as NAG® and MAGMA®. It describes the software architecture and design of the offload compiler runtime. It enumerates the key performance features for this heterogeneous computing stack, related to initialization, data movement and invocation. Finally, it evaluates the performance impact of those features for a set of directed micro-benchmarks and larger workloads.
TL;DR: A sliding window method is proposed to extract feature points in parallel at selected scale levels and the time cost in feature extraction can be greatly reduced, and data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times.
Abstract: This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8].
TL;DR: The PCIT algorithm is re-implemented with exemplary parallel, vector, I/O, memory and instruction optimizations for today's multi- and many-core architectures, targeting the processor architectures of the Stampede supercomputer, but will also benefit other architectures.
Abstract: The PCIT method is an important technique for detecting interactions between networks. The PCIT algorithm has been used in the biological context to infer complex regulatory mechanisms and interactions in genetic networks, in genome wide association studies, and in other similar problems. In this work, the PCIT algorithm is re-implemented with exemplary parallel, vector, I/O, memory and instruction optimizations for today's multi- and many-core architectures. The evolution and performance of the new code targets the processor architectures of the Stampede supercomputer, but will also benefit other architectures. The Stampede system consists of an Intel Xeon E5 processor base system with an innovative component comprised of Intel Xeon Phi Coprocessors. Optimized results and an analysis are presented for both the Xeon and the Xeon Phi.
TL;DR: Per-die post silicon RWA fusing and per-diePost silicon redundancy flow have significantly improved L3 cache Vccmin by more than 150mv.
Abstract: The 20-way set associative 2.5MB slice ported L3 cache for the multi-core Xeon® Processor uses 0.108 um2 cell in a 22nm tri-gate technology with 2.7TB maximum bandwidth. It is protected by double-error correction/triple-error detection ECC. The basic building block is designed to support floorplan style on each processor with large L3 cache. On die fuse storage enables high resolution repair coverage. Programmable transient voltage collapse writes assist (TVC-WA), wordline underdrive read assist (WLUD-RA) circuits (RWA) is used to achieve aggressive Vccmin goal. Per-die post silicon RWA fusing and per-die post silicon redundancy flow have significantly improved L3 cache Vccmin by more than 150mv.Shutoff per slice features has reduced SRAM leakage.