TL;DR: A cache timing attack against the scatter-gather implementation used in the modular exponentiation routine in OpenSSL version 1.0.2f, which can fully recover the private key after observing 16,000 decryptions.
Abstract: The scatter-gather technique is a commonly implemented approach to prevent cache-based timing attacks. In this paper we show that scatter-gather is not constant time. We implement a cache timing attack against the scatter-gather implementation used in the modular exponentiation routine in OpenSSL version 1.0.2f. Our attack exploits cache-bank conflicts on the Sandy Bridge microarchitecture. We have tested the attack on an Intel Xeon E5-2430 processor. For 4096-bit RSA our attack can fully recover the private key after observing 16,000 decryptions.
TL;DR: This work integrates the hardware accelerator into MonetDB, a main-memory column store, and demonstrates a significant improvement in response time and throughput, and provides a novel and efficient implementation of two commonly used SQL operators for strings.
Abstract: Taking advantage of recently released hybrid multicore architectures, such as the Intel's Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.
TL;DR: In this article, the authors evaluate the performance of OpenMP, OpenACC, OpenCL, and CUDA with respect to program productivity, performance, and energy consumption on heterogeneous architectures.
Abstract: Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. However, exploiting the available performance of heterogeneous architectures may be challenging. There are various parallel programming frameworks (such as, OpenMP, OpenCL, OpenACC, CUDA) and selecting the one that is suitable for a target context is not straightforward. In this paper, we study empirically the characteristics of OpenMP, OpenACC, OpenCL, and CUDA with respect to programming productivity, performance, and energy. To evaluate the programming productivity we use our homegrown tool CodeStat, which enables us to determine the percentage of code lines required to parallelize the code using a specific framework. We use our tools MeterPU and x-MeterPU to evaluate the energy consumption and the performance. Experiments are conducted using the industry-standard SPEC benchmark suite and the Rodinia benchmark suite for accelerated computing on heterogeneous systems that combine Intel Xeon E5 Processors with a GPU accelerator or an Intel Xeon Phi co-processor.
TL;DR: This article synthesizes several extensive research efforts focusing exclusively on power and energy efficiency models and techniques for the processors composing these extreme-scale computing systems with absolute concentration on predictive power andEnergy models and prime emphasis on node architecture.
Abstract: Power and energy efficiency are now critical concerns in extreme-scale high-performance scientific computing. Many extreme-scale computing systems today (for example: Top500) have tight integration of multicore CPU processors and accelerators (mix of Graphical Processing Units, Intel Xeon Phis, or Field Programmable Gate Arrays) empowering them to provide not just unprecedented computational power but also to address these concerns. However, such integration renders these systems highly heterogeneous and hierarchical, thereby necessitating design of novel performance, power, and energy models to accurately capture these inherent characteristics. There are now several extensive research efforts focusing exclusively on power and energy efficiency models and techniques for the processors composing these extreme-scale computing systems. This article synthesizes these research efforts with absolute concentration on predictive power and energy models and prime emphasis on node architecture. Through this survey, we also intend to highlight the shortcomings of these models to correctly and comprehensively predict the power and energy consumptions by taking into account the hierarchical and heterogeneous nature of these tightly integrated high-performance computing systems.
TL;DR: This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs, and is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGA.
Abstract: One of the key challenges facing genomics today is how to efficiently analyze the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialized processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper, we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAs. This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to two mismatches. Our design is based on the FM-index, with optimizations to improve the alignment performance. In particular, the $n$ -step FM-index, index oversampling, a seed-and-compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGAs. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and nine times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.
TL;DR: This work presents doppioDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs), and evaluates it on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables.
Abstract: Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXP_LIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads.
TL;DR: The system uses a hardware bit-sieve engine that performs a Markov-chain Monte-Carlo search with a parallel-evaluation of the energy increment prior to the bit selection, achieving a speedup while guaranteeing convergence.
Abstract: We propose a hardware architecture for solving combinatorial optimization problems and implemented it on an FPGA. The hardware minimizes the energy of Ising model with 1,024 state variables fully connectable through 16-bit weights, which ease restrictions on mapping problems onto the Ising model. The system uses a hardware bit-sieve engine that performs a Markov-chain Monte-Carlo search with a parallel-evaluation of the energy increment prior to the bit selection, achieving a speedup while guaranteeing convergence. The engine is implemented on an Arria 10 GX FPGA and solves 32-city traveling salesman problems 104 times faster than simulated annealing running on a 3.5-GHz Intel Xeon E5-1620v3 processor.
TL;DR: This paper study empirically the characteristics of OpenMP, OpenACC, OpenCL, and CUDA with respect to programming productivity, performance, and energy and uses the homegrown tool CodeStat to evaluate programming productivity.
Abstract: Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. However, exploiting the available performance of heterogeneous architectures may be challenging. There are various parallel programming frameworks (such as, OpenMP, OpenCL, OpenACC, CUDA) and selecting the one that is suitable for a target context is not straightforward.
In this paper, we study empirically the characteristics of OpenMP, OpenACC, OpenCL, and CUDA with respect to programming productivity, performance, and energy. To evaluate the programming productivity we use our homegrown tool CodeStat, which enables us to determine the percentage of code lines that was required to parallelize the code using a specific framework. We use our tool x-MeterPU to evaluate the energy consumption and the performance. Experiments are conducted using the industry-standard SPEC benchmark suite and the Rodinia benchmark suite for accelerated computing on heterogeneous systems that combine Intel Xeon E5 Processors with a GPU accelerator or an Intel Xeon Phi co-processor.
TL;DR: This paper proposes a family of clustering-based cache partitioning policies to address fairness in systems that feature Intel’s CAT, a hardware Cache Allocation Technology (CAT) mechanism that can be controlled from userspace software and that allows to create partitions in the LLC and assign different groups of applications to them.
Abstract: Achieving system fairness is a major design concern in current multicore processors. Unfairness arises due to contention in the shared resources of the system, such as the LLC and main memory. To address this problem, many research works have proposed novel cache partitioning policies aimed at addressing system fairness without harming performance. Unfortunately, existing proposals targeting fairness require extra hardware which makes them impractical in commercial processors.Recent Intel Xeon processors feature Cache Allocation Technology (CAT), a hardware cache partitioning mechanism that can be controlled from userspace software and that allows to create partitions in the LLC and assign different groups of applications to them.In this paper we propose a family of clustering-based cache partitioning policies to address fairness in systems that feature Intel’s CAT. The proposal acts at two levels: applications showing similar amount of core stalls due to LLC accesses are first grouped into clusters, after which each cluster is given a number of ways using a simple mathematical model. To the best of our knowledge, this is the first attempt to address system fairness using the cache partitioning hardware in a real product. Results show that our best performing policy reduces system unfairness by up to 80% (39% on average) for 8-application workloads and by up to 45% (25% on average) for 12-application workloads compared to a non-partitioning approach.
TL;DR: This paper describes the best strategy for an efficient sorting algorithm, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm, and demonstrates that this approach is faster than the GNU C++ sort algorithm and the Intel IPP library.
Abstract: The modern CPU’s design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonicbased sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floatingpoint numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU C++ sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4.
TL;DR: This study presented the work of porting and optimizing the GNAQPMS model on the second generation Intel Xeon Phi processor codename “Knights Landing” (KNL), and described the five optimizations applied to the key modules of GNAZPMS – CBM-Z gas chemistry, advection, convection and wet deposition.
Abstract: The GNAQPMS model is the global version of the Nested Air Quality Prediction Modelling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present our work of porting and optimizing the GNAQPMS model on the second generation Intel Xeon Phi processor codename “Knights Landing” (KNL). Compared with the first generation Xeon Phi coprocessor, KNL introduced many new hardware features such as a bootable processor, high performance in-package memory and ISA compatibility with Intel Xeon processor. In particular, we described the five optimizations we applied to the key modules of GNAQPMS – CBM-Z gas chemistry, advection, convection and wet deposition. These optimizations work well on both the KNL 7250 processor as well as the Intel Xeon processor E5-2697 V4. They include: 1) updating the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP in emission, advection, convection and chemistry modules; 2) fully employ the 512-bit wide vector processing units (VPU) on the KNL platform; 3) reducing unnecessary memory access to improve caches efficiency; 4) reducing thread local storage (TLS) in CBM-Z gas phase chemistry module to improve its OpenMP performance; 5) changing global communication from interface-files writing/reading to using Message Passing Interface (MPI) functions to improve the performance and the parallel scalability. These optimizations improved GNAQPMS performance great. The same optimizations also work well for the Intel Xeon Broadwell processor, specifically, E5-2697v4. Compared with the baseline version of GNAQPMS, the optimized version is 3.34x faster on KNL and 2.39x faster on CPU. Furthermore, the optimized version on KNL runs at 26 % lower average power compare to CPU. Combining the performance and energy improvement, the KNL platform is 47% more efficient compare to the CPU platform. The optimizations also enables much further parallel scalability on both the CPU cluster and KNL cluster – scale to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 % and 42.2 %, respectively.
TL;DR: A high-performance and scalable Data Partitioning-based Multi-Leader (DPML) solution that can take advantage of the parallelism offered by multi-/many-core architectures in conjunction with the high throughput and high-end features offered by InfiniBand and Omni-Path to significantly enhance the performance of MPI_Allreduce on modern HPC systems is proposed.
Abstract: Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in communication throughput and recent advances in high-end features seen with modern interconnects like InfiniBand and Omni-Path. In this paper, we propose a high-performance and scalable Data Partitioning-based Multi-Leader (DPML) solution for MPI_Allreduce that can take advantage of the parallelism offered by multi-/many-core architectures in conjunction with the high throughput and high-end features offered by InfiniBand and Omni-Path to significantly enhance the performance of MPI_Allreduce on modern HPC systems. We also model DPML-based designs to analyze the communication costs theoretically. Microbenchmark level evaluations show that the proposed DPML-based designs are able to deliver up to 3.5 times performance improvement for MPI_Allreduce for multiple HPC systems at scale. At the application-level, up to 35% and 60% improvement is seen in communication for HPCG and miniAMR respectively.
TL;DR: The results of the evaluation reveal that the proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.
Abstract: Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.
TL;DR: Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations on a wide range of sparse matrices from real applications.
Abstract: Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form $C=A\,B$ on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix $A$ to evenly partition the work across threads with the objective of reducing the number of $B$ -matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for $B$ -matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for $C$ -matrix rows. A similarity graph model is proposed for $B$ -matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations.
TL;DR: This work investigates how the novel architectural features offered by KNL can be used in the context of decomposing sparse, unstructured tensors using the canonical polyadic decomposition (CPD), and develops problem decompositions for the CPD which are amenable to hundreds of concurrent threads while maintaining load balance and low synchronization costs.
Abstract: HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands. To address these challenging demands, HPC systems are turning to many-core architectures that feature a large number of energy-efficient cores backed by high-bandwidth memory. These features are exemplified in Intel's recent Knights Landing many-core processor (KNL), which typically has 68 cores and 16GB of on-package multi-channel DRAM (MCDRAM). This work investigates how the novel architectural features offered by KNL can be used in the context of decomposing sparse, unstructured tensors using the canonical polyadic decomposition (CPD). The CPD is used extensively to analyze large multi-way datasets arising in various areas including precision healthcare, cybersecurity, and e-commerce. Towards this end, we (i) develop problem decompositions for the CPD which are amenable to hundreds of concurrent threads while maintaining load balance and low synchronization costs; and (ii) explore the utilization of architectural features such as MCDRAM. Using one KNL processor, our algorithm achieves up to 1.8x speedup over a dual socket Intel Xeon system with 44 cores.
TL;DR: The design of the Marcher system is presented and the usage of Marcher power measurement tools are demonstrated to obtain detailed power consumption data in various research projects.
TL;DR: Performance results prove that Singularity-based container technology can achieve near-native performance for both Intel Xeon and Intel Xeon Knights Landing platforms with different memory access modes and shows very little overhead for running MPI-based HPC applications on both Omni-Path and InfiniBand networks.
Abstract: The Message Passing Interface (MPI) standard has become the de facto programming model for parallel computing with the last 25-year continuous community effort. With the development of building efficient HPC clouds, more and more MPI-based HPC applications start running on cloud-based environments. Singularity is one of the most attractive container technologies to build HPC clouds due to the claimed reproducible environments across the HPC centers. However, our investigations in the literature show that there is a lack of a systematical study on evaluating the performance of Singularity with various benchmarks and applications on different types of HPC platforms. Without these studies, it remains difficult to tell the community whether Singularity-based container technology is ready or not for running MPI applications on HPC clouds to gain desired performance. To fill this gap in the literature, as a third-party, we first propose a four-dimension evaluation methodology to cover various aspects and based on that, we conduct extensive studies on evaluating the performance of Singularity on modern processors, and high-performance interconnects. Performance results prove that Singularity-based container technology can achieve near-native performance for both Intel Xeon and Intel Xeon Knights Landing (KNL) platforms with different memory access modes (i.e., cache, flat). Singularity also shows very little overhead for running MPI-based HPC applications on both Omni-Path and InfiniBand networks. With the verification of our results, we believe that Singularity can be used for building next-generation HPC clouds with near-native performance as well as desired cloud features such as easy management and deployment.
TL;DR: A data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping is introduced and a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism and data reuse rate of the compressed network is developed.
Abstract: Convolutional neural networks (CNNs) have recently broken many performance records in image recognition and object detection problems. The success of CNNs, to a great extent, is enabled by the fast scaling-up of the networks that learn from a huge volume of data. The deployment of big CNN models can be both computation-intensive and memory-intensive, leaving severe challenges to hardware implementations. In recent years, sparsification techniques that prune redundant connections in the networks while still retaining the similar accuracy emerge as promising solutions to alliterate the computation overheads associated with CNNs [1]. However, imposing sparsity in CNNs usually generates random network connections and thus, the irregular data access pattern results in poor data locality. The low computation efficiency of the sparse networks, which is caused by the incurred unbalance in computing resource consumption and low memory bandwidth usage, significantly offsets the theocratical reduction of the computation complexity and limits the execution scalability of CNNs on general- purpose architectures [2]. For instance, as an important computation kernel in CNNs – the sparse convoluation, is usually accelerated by using data compression schemes where only nonzero elements of the kernel weights are stored and sent to multiplication-accumulation computations (MACs) at runtime. However, the relevant executions on CPUs and GPUs reach only 0.1% to 10% of the system peak performance even designated software libraries are applied (e.g., MKL library for CPUs and cuSPARSE library for GPUs). Field programmable gate arrays (FPGAs) have been also extensively studied as an important hardware platform for CNN computations [3]. Different from general-purpose architectures, FPGA allows users to customize the functions and organization of the designed hardware in order to adapt various resource needs and data usage patterns. This characteristic, as we identified in this work, can be leveraged to effectively overcome the main challenges in the execution of sparse CNNs through close coordinations between software and hardware. In particular, the reconfigurability of FPGA helps to 1) better map the sparse CNN onto the hardware for improving computation parallelism and execution efficiency and 2) eliminate the computation cost associated with zero weights and enhance data reuse to alleviate the adverse impacts of the irregular data accesses. In this work, we propose a hardware-software co-design framework to address the above challenges in sparse CNN accelerations. First, we introduce a data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping. Both memory allocation and data access regularization are considered in the optimization process. Second, we develop a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism and data reuse rate of the compressed network. Moreover, a holistic sparse optimization is introduced to our design framework for hardware platforms with different requirement. We evaluate our proposed frame- work by executing AlexNet on Xilinx Zynq ZC706. Our FPGA accelerator obtains a processing power of 71.2 GOPS, corresponding to 271.6 GOPS on the dense CNN model. On average, our FPGA design runs 11.5× faster than a well- tuned CPU implementation on Intel Xeon E5-2630, and has 3.2× better energy efficiency over the GPU realization on Nvidia Pascal Titan X. Compared to state-of-the-art FPGA designs [4], our accelerator reduces the classification time by 2.1×, with
TL;DR: This work proposes PEASE, a Programmable Event-driven processor Architecture for SNN Evaluation, a method to map any given SNN to PEASE such that the workload is balanced across SPUs and SPU clusters, while pipeling across layers of the network to improve performance.
Abstract: Spiking neural networks (SNNs) represent the third generation of neural networks and are expected to enable new classes of machine learning applications. However, evaluating large-scale SNNs (e.g., of the scale of the visual cortex) on power-constrained systems requires significant improvements in computing efficiency. A unique attribute of SNNs is their event-driven nature—information is encoded as a series of spikes, and work is dynamically generated as spikes propagate through the network. Therefore, parallel implementations of SNNs on multi-cores and GPGPUs are severely limited by communication and synchronization overheads. Recent years have seen great interest in deep learning accelerators for non-spiking neural networks, however, these architectures are not well suited to the dynamic, irregular parallelism in SNNs. Prior efforts on specialized SNN hardware utilize spatial architectures, wherein each neuron is allocated a dedicated processing element, and large networks are realized by connecting multiple chips into a system. While suitable for large-scale systems, this approach is not a good match to size or cost constrained mobile devices. We propose PEASE, a Programmable Event-driven processor Architecture for SNN Evaluation. PEASE comprises of Spike Processing Units (SPUs) that are dynamically scheduled to execute computations triggered by a spike. Instructions to the SPUs are dynamically generated by Spike Schedulers (SSs) that utilize event queues to track unprocessed spikes and identify neurons that need to be evaluated. The memory hierarchy in PEASE is fully software managed, and the processing elements are interconnected using a two-tiered bus-ring topology matching the communication characteristics of SNNs. We propose a method to map any given SNN to PEASE such that the workload is balanced across SPUs and SPU clusters, while pipelining across layers of the network to improve performance. We implemented PEASE at the RTL level and synthesized it to IBM 45 technology. Across 6 SNN benchmarks, our 64-SPU configuration of PEASE achieves 7.1×−17.5× and 2.6×−5.8× speedups, respectively, over software implementations on an Intel Xeon E5-2680 CPU and NVIDIA Tesla K40C GPU. The energy reductions over the CPU and GPU are 71×−179× and 198×−467×, respectively.
TL;DR: The porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL), greatly improved the GNAZPMS performance and enabled much further parallel scalability on both the CPU cluster and the KNL cluster, with a parallel efficiency of 70.4 and 42.2 %, respectively.
Abstract: . The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
TL;DR: SWhybrid is a hybrid computing framework for large-scale biological sequence database search on heterogeneous computing environments with multi-core or many-core processing units (PUs) based on the Smith- Waterman (SW) algorithm that achieves an efficiency of over 80% on all tested CPUs and GPUs and over 70% on Xeon Phis.
Abstract: Computer architectures continue to develop rapidly towards massively parallel and heterogeneous systems. Thus, easily extensible yet highly efficient parallelization approaches for a variety of platforms are urgently needed. In this paper, we present SWhybrid, a hybrid computing framework for large-scale biological sequence database search on heterogeneous computing environments with multi-core or many-core processing units (PUs) based on the Smith- Waterman (SW) algorithm. To incorporate a diverse set of PUs such as combinations of CPUs, GPUs and Xeon Phis, we abstract them as SIMD vector execution units with different number of lanes. We propose a machine model, associated with a unified programming interface implemented in C++, to abstract underlying architectural differences. Performance evaluation reveals that SWhybrid (i) outperforms all other tested state-of-the-art tools on both homogeneous and heterogeneous computing platforms, (ii) achieves an efficiency of over 80% on all tested CPUs and GPUs and over 70% on Xeon Phis, and (iii) achieves utlization rates of over 80% on all tested heterogeneous platforms. Our results demonstrate that there is enough commonality between vector-like instructions across CPUs and GPUs that one can develop higher-level abstractions and still specialize with close-to-peak performance. SWhybrid is open-source software and freely available at https://github.com/turbo0628/swhybrid.
TL;DR: Modern OpenMP threading techniques are used to convert the MPI-only Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm, which was shown to run up to six times faster than the original for a range of molecular system sizes.
Abstract: Modern OpenMP threading techniques are used to convert the MPI-only Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two separate implementations that differ by the sharing or replication of key data structures among threads are considered, density and Fock matrices. All implementations are benchmarked on a super-computer of 3,000 Intel® Xeon Phi™ processors. With 64 cores per processor, scaling numbers are reported on up to 192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory footprint by approximately 200 times compared to the legacy code. The MPI/OpenMP code was shown to run up to six times faster than the original for a range of molecular system sizes.
TL;DR: This paper presents a GPU-accelerated implementation of the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method with an inexact nullspace filtering approach to find eigenvalues in electromagnetics analysis with higherorder FEM.
Abstract: This paper presents a GPU-accelerated implementation of the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method with an inexact nullspace filtering approach to find eigenvalues in electromagnetics analysis with higher-order FEM. The performance of the proposed approach is verified using the Kepler (Tesla K40c) graphics accelerator, and is compared to the performance of the implementation based on functions from the Intel MKL on the Intel Xeon (E5-2680 v3, 12 threads) central processing unit (CPU) executed in parallel mode. Compared to the CPU reference implementation based on the Intel MKL functions, the proposed GPU-based LOBPCG method with inexact nullspace filtering allowed us to achieve up to 2.9-fold acceleration.
TL;DR: Several parallelization and optimization techniques have been presented in this paper with special emphasis on shuffle intrinsic specific to Nvidia Kepler architecture (or later), which significantly improves the performance compared to existing GPU implementations in the literature.
Abstract: The implementation of 2D-3v (2D in space and 3D in velocity space) PIC-MCC (Particle-In-Cell Monte Carlo Collision) method described in this paper involves the computational solution of Vlasov-Poisson equations, which provides the spatial and temporal evolution of the charged-particle velocity distribution functions in plasmas under the effect of self-consistent electromagnetic (EM) fields and collisions Stringent numerical constraints associated with a PIC code makes it computationally prohibitive on CPUs in case of large problem sizes (total number of particles, number of grid points and simulation time-scale) We present the design and implementation of a Graphics Processing Unit (GPU) based 2D-3v PIC code using the CUDA C APIs for Kepler architecture Several parallelization and optimization techniques have been presented in this paper with special emphasis on shuffle intrinsic specific to Nvidia Kepler architecture (or later), which significantly improves the performance compared to existing GPU implementations in the literature On a test bed comprising of a serial implementation on Xeon E5 CPU and parallel implementations on Nvidia Tesla K40 graphics card, we have achieved a speedup of up to 60x in double precision mode Effect of important numerical parameters on speedup has been investigated Finally, we compare the performance of our best parallel implementation on different GPUs (Kepler as well as Maxwell) and analyze the effect of hardware architecture on the performance of the PIC code
TL;DR: The Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS) as mentioned in this paper is a parallelization scheme for training convolutional neural networks (CNN) which is tailored for parallel computing systems that are accelerated with the Intel Xeon Phi.
Abstract: Deep learning is an important component of big-data analytic tools and intelligent applications, such as, self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive, and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time of deep neural networks. In this paper, we present our parallelization scheme for training convolutional neural networks (CNN) named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS). Major features of CHAOS include the support for thread and vector parallelism, non-instant updates of weight parameters during back-propagation without a significant delay, and implicit synchronization in arbitrary order. CHAOS is tailored for parallel computing systems that are accelerated with the Intel Xeon Phi. We evaluate our parallelization approach empirically using measurement techniques and performance modeling for various numbers of threads and CNN architectures. Experimental results for the MNIST dataset of handwritten digits using the total number of threads on the Xeon Phi show speedups of up to 103x compared to the execution on one thread of the Xeon Phi, 14x compared to the sequential execution on Intel Xeon E5, and 58x compared to the sequential execution on Intel Core i5.
TL;DR: Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version and the OpenCL implementation obtains the best overall acceleration results however, with the most complex source code.
TL;DR: This paper studies the performance/energy characteristics of OpenCL-generated FPGA designs on irregular memory access patterns, targeting XSBench, a memory-intensive Monte Carlo simulation code, as a case study, and implements XSbench in OpenCL and study optimization strategies for FPGAs.
Abstract: FPGAs are becoming an attractive choice as a heterogeneous computing unit for scientific computing because FPGA vendors are adding floating-point-optimized architectures to their product lines. Additionally, high-level synthesis (HLS) tools such as Altera OpenCL SDK are emerging, which could potentially break the FPGA programming wall and provide a streamlined flow for domain experts in scientific computing. On the other hand, providing high performance in the presence of irregular memory access patterns to off-chip memory remains a challenge for the automated synthesis flows. In this paper, we study the performance/energy characteristics of OpenCL-generated FPGA designs on irregular memory access patterns, targeting XSBench, a memory-intensive Monte Carlo simulation code, as a case study. To complete our study, we implement XSBench in OpenCL and study optimization strategies for FPGAs. We observe that our OpenCL implantation of XSBench achieves 50 % higher energy efficiency on an Intel Arria10-based FPGA platform than that on an Intel Xeon 8-core CPU while trading off 35 % of performance.
TL;DR: This paper discusses how the performance of CBCAES decryption based on GPU is influenced by 4 key parameters that include the size of input data, the number of threads per block, memory allocation style and parallel granularity.
Abstract: The Advanced Encryption Standard(AES) is used in security areas widely now. However, there is still a large room for further improvement of its execution efficiency. Since the graphics processing unit(GPU) with potent ability of parallel computing has been applied in general purpose of computation, people have tried to use it to faster execution time in various cryptographic algorithms. This paper discusses how the performance of CBCAES decryption based on GPU is influenced by 4 key parameters that include the size of input data, the number of threads per block, memory allocation style and parallel granularity. Further more, we compare the performance of AES on GPU to that of standard AES, AES-NI and find that when the size of input data is different, the implementations with different parameters setting achieve the best performance. So we provide several advices about how to implement CBC-AES on GPU aiming at different size of input data. In particular, our best performance of experiments on GPU(NVIDIA Tesla K40m) is about 112 times faster than the implementation of AES on CPU (Intel Xeon E5-2650) by using our optimization method.
TL;DR: The numerical results performed on the Xeon multi-core processor and two generations of the Xeon Phi many-core platform validate the proposed implementation and highlight the importance of vectorization necessary to exploit the features of modern hardware.
Abstract: In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel Xeon processors we concentrate on the portability of the code to the many-core family Intel Xeon Phi. Contrary to the GP-GPU programming accelerating many scientific codes, the standard x86 architecture of the Xeon Phi processors allows to reuse the already existing multi-core implementation. Although in many cases a simple recompilation would lead to an inefficient utilization of the Xeon Phi, the effort invested in the optimization usually leads to a better performance on the multi-core Xeon processors as well. This makes the Xeon Phi an interesting platform for scientists developing a software library aimed at both modern portable PCs and high performance computing environments. Here we focus at the manually vectorized assembly of the local element contributions and the parallel assembly of the global matrices on shared memory systems. Due to the quadratic complexity of the standard assembly we also present an assembly sparsified by the adaptive cross approximation based on the same acceleration techniques. The numerical results performed on the Xeon multi-core processor and two generations of the Xeon Phi many-core platform validate the proposed implementation and highlight the importance of vectorization necessary to exploit the features of modern hardware.
TL;DR: It is shown that in most cases the performance of ROSS scales well with the best results achieved when thread affinity is assigned, CPU cores are evenly loaded, cache sharing is exploited and communication is limited to small clusters of cores.
Abstract: Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. However, the low cost of on-chip communication in emerging many-core processors offers a promise to substantially alleviate conventional PDES bottlenecks. In this paper, we present a detailed evaluation and characterization of multi-threaded ROSS simulator on Intel's Knights Landing (KNL) processor. KNL is the second generation of the Intel Xeon Phi family of processors offering significant architecture improvements including 64 out-of-order multithreaded cores, sharing of some levels of the cache hierarchy among the cores, fast 2D mesh interconnect network and the ability to reconfigure the processor to support various clustering modes. We analyze the performance and scalability of ROSS simulator on KNL processor under different thread counts, communication patterns, event processing granularities, synchronization periods, thread placement policies, and workload partitioning schemes. We conclude that within a single KNL processor, up to 2X performance improvement can be achieved compared to commodity Xeon multicore processors. We show that in most cases the performance of ROSS scales well with the best results achieved when thread affinity is assigned, CPU cores are evenly loaded, cache sharing is exploited and communication is limited to small clusters of cores.