TL;DR: This work describes the hybrid execution patterns of GCNs on Intel Xeon CPU and proposes a hardware design with two efficient processing engines to alleviate the irregularity of Aggregation phase and leverage the regularity of Combination phase, and designs a GCN accelerator using a hybrid architecture to efficiently perform GCNs.
Abstract: Inspired by the great success of neural networks, graph convolutional neural networks (GCNs) are proposed to analyze graph data. GCNs mainly include two phases with distinct execution patterns. The Aggregation phase, behaves as graph processing, showing a dynamic and irregular execution pattern. The Combination phase, acts more like the neural networks, presenting a static and regular execution pattern. The hybrid execution patterns of GCNs require a design that alleviates irregularity and exploits regularity. Moreover, to achieve higher performance and energy efficiency, the design needs to leverage the high intra-vertex parallelism in Aggregation phase, the highly reusable inter-vertex data in Combination phase, and the opportunity to fuse phase-by-phase execution introduced by the new features of GCNs. However, existing architectures fail to address these demands. In this work, we first characterize the hybrid execution patterns of GCNs on Intel Xeon CPU. Guided by the characterization, we design a GCN accelerator, HyGCN, using a hybrid architecture to efficiently perform GCNs. Specifically, first, we build a new programming model to exploit the fine-grained parallelism for our hardware design. Second, we propose a hardware design with two efficient processing engines to alleviate the irregularity of Aggregation phase and leverage the regularity of Combination phase. Besides, these engines can exploit various parallelism and reuse highly reusable data efficiently. Third, we optimize the overall system via inter-engine pipeline for inter-phase fusion and priority-based off-chip memory access coordination to improve off-chip bandwidth utilization. Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA V100 GPU, our work achieves on average 1509× speedup with 2500× energy reduction and average 6.5× speedup with 10× energy reduction, respectively.
TL;DR: FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details.
Abstract: Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the algorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the application algorithms. In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor computation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines different exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for different hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x performance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.
TL;DR: In a performance comparison with Marvell (Cavium) ThunderX2 processor and Intel Xeon Skylake processor, the A64FX achieved higher performance in a memory bandwidth-intensive application thanks to its high memory bandwidth.
Abstract: RIKEN Center for Computational Science has been installing the supercomputer Fugaku. The Fujitsu A64FX, based on the Armv8.2-A+SVE architecture, is used in the system. In this paper, we evaluated the seven HPC applications and benchmarks on the A64FX. In a performance comparison with Marvell (Cavium) ThunderX2 processor and Intel Xeon Skylake processor, the A64FX achieved higher performance in a memory bandwidth-intensive application thanks to its high memory bandwidth. However, we confirmed that the performance of the A64FX decreased from a lack of out-of-order resources. To mitigate this problem, the “loop fission” function of the Fujitsu compiler was used to improve the performance.
TL;DR: An embedded system for fast and accurate license plate segmentation and recognition using a modified single shot detector (SSD) with a feature extractor based on depthwise separable convolutions and linear bottlenecks is proposed.
Abstract: As Intelligent Transportation Systems applications increase in prevalence, Automatic License Plate Recognition solutions must be made continually faster and more accurate. The authors propose an embedded system for fast and accurate license plate segmentation and recognition using a modified single shot detector (SSD) with a feature extractor based on depthwise separable convolutions and linear bottlenecks. The feature extractor requires less parameters than the original SSD + VGG implementation, enabling fast inference. Tested on the Caltech Cars dataset, the proposed model achieves 96.46% segmentation and 96.23% recognition accuracy. Tested on the UCSD-Stills dataset, the proposed model achieves 99.79% segmentation and 99.79% recognition accuracy. The authors achieve a per-plate (resized to 300 × 300 px) processing time of 59 ms on an Intel Xeon CPU with 12 cores (2.60 GHz per core), 14 ms using the same CPU and OpenVINO (a neural network acceleration platform), and 66 ms using the proposed low-cost Raspberry Pi 3 and Intel Neural Compute Stick 2 with OpenVINO embedded system.
TL;DR: This paper presents early experiences when using both the compatibility tool and oneAPI as well the employed extension to the SYCL programming standard for the tsunami simulation code easyWave and compares the original code running on Xeon processors using OpenMP as well as CUDA with the performance of the DPC++ counter part.
Abstract: Recently, Intel released the oneAPI programming environment. With Data Parallel C++(DPC++), oneAPI enables codes to target multiple hardware architectures like multi-core CPUs, GPUs, and even FPGAs or other hardware using a single source. For legacy codes that were written for Nvidia GPUs, a compatibility tool is provided which facilitates the transition to the SYCL-based DPC++ programming language. This paper presents early experiences when using both the compatibility tool and oneAPI as well the employed extension to the SYCL programming standard for the tsunami simulation code easyWave. A performance study compares the original code running on Xeon processors using OpenMP as well as CUDA with the performance of the DPC++ counter part on multicore CPUs as well as integrated GPUs.
TL;DR: The Kernel Tuning Toolkit as discussed by the authors enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes, which is key to performance portability.
TL;DR: In this article, the authors use FPGAs to evaluate the benefits of building specialized hardware for numerical kernels found in scientific applications, and compare Intel Arria 10 and Xilinx U280 performance against Intel Xeon, Intel Xeon Phi, and NVIDIA V100 GPUs.
Abstract: Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware specialization with modest non-recurring engineering cost. In this paper, we use FPGAs to evaluate the benefits of building specialized hardware for numerical kernels found in scientific applications. In order to properly evaluate performance, we not only compare Intel Arria 10 and Xilinx U280 performance against Intel Xeon, Intel Xeon Phi, and NVIDIA V100 GPUs, but we also extend the Empirical Roofline Toolkit (ERT) to FPGAs in order to assess our results in terms of the Roofline Model. Although FPGA performance is known to be far less than that of a GPU, we also benchmark the energy efficiency of each platform for the scientific kernels comparing to microbenchmark and technological limits. Results show that while FPGAs struggle to compete in absolute terms with GPUs on memory- and compute-intensive kernels, they require far less power and can deliver nearly the same energy efficiency.
TL;DR: This paper addresses the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach that created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context.
Abstract: A growing number of applications implement predictive functions using deep learning models, which require heavy use of compute and memory One popular technique for increasing resource efficiency is 8-bit integer quantization, in which 32-bit floating point numbers (fp32) are represented using shorter 8-bit integer numbers Although deep learning frameworks such as TensorFlow, TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy, they are not well suited to execute quantized models on a variety of hardware platforms For example, TFLite is optimized to run inference on ARM CPU edge devices but it does not have efficient support for Intel CPUs and Nvidia GPUs In this paper, we address the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach A deep learning compiler such as Apache TVM can enable the efficient execution of model from various frameworks on various targets Many deep learning compilers today, however, are designed primarily for fp32 computation and cannot optimize a pre-quantized INT8 model To address this issue, we created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context With this quantization context, the compiler can generate efficient code for pre-quantized models on various hardware platforms As implemented in Apache TVM, we observe that the QNN-augmented deep learning compiler achieves speedups of 235x, 215x, 135x and 140x on Intel Xeon Cascade Lake CPUs, Nvidia Tesla T4 GPUs, ARM Raspberry Pi3 and Pi4 respectively against well optimized fp32 execution, and comparable performance to the state-of-the-art framework-specific solutions
TL;DR: This research implements an FPGA-based accelerator called FPGAN for graph attention networks that achieves significant improvement on performance and energy efficiency without losing accuracy compared with PyTorch baseline.
Abstract: The Graph Attention Networks (GATs) exhibit outstanding performance in multiple authoritative node classification benchmark tests (including transductive and inductive). The purpose of this research is to implement an FPGA-based accelerator called FPGAN for graph attention networks that achieves significant improvement on performance and energy efficiency without losing accuracy compared with PyTorch baseline. It eliminates the dependence on digital signal processors (DSPs) and large amounts of on-chip memory and can even work well on low-end FPGA devices. We design FPGAN with software and hardware co-optimization across the full stack from algorithm through architecture. Specifically, we compress model to reduce the model size, quantify features to perform fixed-point calculation, replace multiplication addition cell (MAC) with shift addition units (SAUs) to eliminate the dependence on DSPs, and design an efficient algorithm to approximate SoftMax function. We also adjust the activation functions and fuse operations to further reduce the computation requirement. Moreover, all data is vectorized and aligned for scalable vector computation and efficient memory access. All the above optimizations are integrated into a universal hardware pipeline for various structures of GATs. We evaluate our design on an Inspur F10A board with an Intel Arria 10 GX1150 and 16 GB DDR3 memory. Experimental results show that FPGAN can achieve 7.34 times speedup over Nvidia Tesla V100 and 593 times over Xeon CPU Gold 5115 while maintaining accuracy, and 48 times and 2400 times on energy efficiency respectively.
TL;DR: The impact of reducing voltage margins beyond the nominal level on the efficiency of CPU power capping mechanisms is investigated, for three commercial systems, two Applied Micro ARMv8 micro-servers (X-Gene2 and X-Gene3) and an Intel x86-64 (Xeon E3).
Abstract: CPUs typically operate at a voltage which is higher than what is strictly required, using voltage margins to account for process variability and anticipate any combination of adverse operating conditions. However, these worst-case scenarios occur rarely, if ever, thus the operating voltage is overly pessimistic resulting in excessive power dissipation which leads to decreased performance under power capping. In this paper, we investigate the impact of reducing voltage margins beyond the nominal level on the efficiency of CPU power capping mechanisms, for three commercial systems, two Applied Micro ARMv8 micro-servers (X-Gene2 and X-Gene3) and an Intel x86-64 (Xeon E3). We show that CPU power capping at reduced voltage margins compared with Intel's RAPL and Dynamic Frequency Scaling (DFS) mechanisms results in performance improvement by up to 64% and 24% on average, respectively. In combination with state-of-the-art thread packing, the reduction of CPU voltage margins results in 36%, 33% and 27% performance improvement compared with RAPL and DFS for the Xeon E3 and the X-Gene processors, respectively. Also, we validate the robustness of our approach with a set of long-running experiments and show that significant energy gains can be achieved even when considering the cost of checkpointing and recovery in large-scale systems.
TL;DR: In this paper, the authors evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis (HLS) and show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, their design technique with combined spatial and temporal blocking remains effective.
Abstract: In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-the-art YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.
TL;DR: This work proposes an efficient FPGA-based reconfigurable framework, called FP-AMG, for high-performance AMG calculation, and proposes a novel and scalable architecture that can be reused for all kernels in AMG.
Abstract: Partial Differential Equations (PDEs) are fundamental to many real-world scientific computing applications and so their optimization has undergone decades of study. Algebraic multigrid (AMG) is one of the most well-known solvers, being widely adopted in High Performance Computing (HPC) due to its good scalability. Acceleration of AMG is known to be very challenging, due to the following reasons: (1) irregular computation patterns, (2) random memory access, and (3) a large number of kernels with various computation types. To the best of our knowledge, there is no prior work on FPGA-based acceleration of AMG. To tackle these challenges, we propose an efficient FPGA-based reconfigurable framework, called FP-AMG, for high-performance AMG calculation. In order to obtain full pipeline utilization, we propose a novel and scalable architecture that can be reused for all kernels in AMG. Given that AMG is strictly memory-bound, we propose algorithmic and architectural optimizations to ensure nearly ideal use of memory bandwidth. The efficiency of FP-AMG is evaluated with six well-known benchmarks on two FPGA devices: one with and one without high bandwidth memory (HBM). The experimental results are compared with a highly optimized Intel Xeon E5-2680-V4 implementation of the state-of-the-art HYPRE library. Our experiments show that FP-AMG can achieve average speedups of $ 2.5\times$ and $ 6.6\times$, for FPGAs without and with HBM, respectively.
TL;DR: This paper studies the performance characteristics of the Amazon Graviton Processor – an ARM64 processor with the Cortex-A72 micro-architecture – using the A1 (Graviton) product family on AWS EC2, with comparisons to the I3 and M5 product families based on Intel Xeon processors.
Abstract: ARM processors, with their low power consumption and heat dissipation, have been highly successful in embedded systems. In the recent past, there have been attempts to adopt these energy-efficient processors for servers in data centers. However, a fundamental question remains open with ARM-based systems on server side is whether they are capable of handling compute-intensive workloads at scale. This paper gives our answer to this question with an empirical approach. We study the performance characteristics of the Amazon Graviton Processor – an ARM64 processor with the Cortex-A72 micro-architecture – using the A1 (Graviton) product family on AWS EC2, with comparisons to the I3 and M5 product families based on Intel Xeon processors. We use a combination of micro benchmark and performance counters to identify the lack of L3 cache and the slower memory access speed limit Graviton’s capability in achieving higher performance. We confirm Graviton’s capability in handling various large-scale horizontally scalable compute-intensive workloads, including multi-tier web service, video transcoding and terabyte scale sorting. In our large-scale evaluations, the test worker fleet has up to 1600 vCPU cores, which is by far the largest ARM64 cluster that has been reported. We observe that the A1 product family achieves the same price-performance in multi-tier web service, up to 37% cost saving in video transcoding, and up to 65% cost saving in terabyte scale sorting, as compared with the I3 and M5 product families.
TL;DR: A novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches.
Abstract: We present a novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches. LASs is based on OmpSs, a task-based runtime that extends OpenMP directives, providing more flexibility to apply new strategies. Based on tasking and nesting, with the aim of improving the workload imbalance inherent to the SpMV operation, we present a strategy especially useful for highly imbalanced input matrices. In this approach, the number of created tasks is dynamically decided in order to maximize the use of the resources of the platform. Throughout this paper, SpMV behavior depending on the selected strategy (state of the art and proposed strategies) is deeply analyzed, setting in this way the base for a future auto-tunable code that is able to select the most suitable approach depending on the input matrix. The experiments of this work were carried out for a set of 12 matrices from the Suite Sparse Matrix Collection, all of them with different characteristics regarding their sparsity. The experiments of this work were performed on a node of Marenostrum 4 supercomputer (with two sockets Intel Xeon, 24 cores each) and on a node of Dibona cluster (using one ARM ThunderX2 socket with 32 cores). Our tests show that, for Intel Xeon, the best parallelization strategy reduces the execution time of the reference MKL multi-threaded version up to 67%. On ARM ThunderX2, the reduction is up to 56% with respect to the OmpSs parallel reference.
TL;DR: A new sparse matrix storage format CSR2 (Compressed Sparse Row 2) suitable for SIMD (Single Instruction Multiple Data)-accelerated SpMV (Sparse matrix-vector multiplication) and suitable for use on processor platforms with SIMD vectorization is proposed.
Abstract: SpMV (Sparse matrix-vector multiplication) has attracted the attention of researchers in related fields at home and abroad. Of course, improving SpMV performance has also been a research hot spot for researchers in related fields. In this paper, we propose a new sparse matrix storage format CSR2 (Compressed Sparse Row 2) suitable for SIMD (Single Instruction Multiple Data)-accelerated SpMV. First, the format operation of CSR2 is easy to implement and has a low overhead of conversion. Second, CSR2 is a new single format and suitable for use on processor platforms with SIMD vectorization. We compare the SpMV algorithm based on CSR21 with the one based on the current most advanced single format CSR5 (Compressed Sparse Row 5) on two mainstream high-performance processors: Intel Core i7-7700HQ CPU and Intel Xeon CPU E5-2670 v3. We choose 10 sets of regular matrices and 3 sets of irregular matrices to be used as benchmark suit. Experiments show that for the 13 sets of regular and irregular matrices in the benchmark suite, CSR2 has an average performance improvement of more than 50% compared to CSR5 (up to 125% on Intel Core i77700HQ CPU and 303% on Intel Xeon CPU E5-2670 v3). For applications with multiple iterations, in reality, using our CSR2 can bring low-overhead format conversion and high-throughput computing performance.
TL;DR: This work presents a set of parallel implementations for parallelizing quaternion moment image representations on different parallel architectures and proposes the loop mitigation technique, proposed to boost the level of parallelism in massively parallel environments, balance the parallel workload, and reduce both the space complexity and synchronization overhead.
TL;DR: This paper proposes and implements an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: how to parallelize the entire Huffman encode algorithm, including codebook construction, and how to fully utilize the high memory-bandwidth feature of modern GPU architecture.
Abstract: Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0X and 6.8X on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3X over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs.
TL;DR: In this article, a fast preprocessing method is proposed to divide the matrix into sub-matrices and determine the critical performance bound of submatrices according to the data distribution characteristics.
Abstract: Sparse Matrix-Vector Multiplication (SpMV) is a fundamental workload of numerous applications. However, for today's high-end superscalar CPUs, such as Intel Xeon series, it is usually difficult to efficiently perform SpMV due to the irregular, matrix-dependent data access and computation pattern. While many researches focus on optimizing the memory bandwidth bound by improving data locality, this work dives into the execution of SpMV computation on Intel Xeon CPU and reveals that the bad-speculation penalty is significant in many sparse matrices and too expensive to be ignored. We study and characterize sparsity structure types that are more vulnerable to the cache miss penalty or the bad speculation penalty, respectively. Based on this insight, we proposed a fast preprocessing method, which divides the matrix into sub-matrices and determines the critical performance bound of sub-matrices according to the data distribution characteristics. On each submatrix, a combination of dedicated row reordering strategies is performed to efficiently alleviate its key performance bounds: bad speculation, cache miss, or both. Our matrix representation is based on standard Compressed Sparse Row (CSR) format, and can be easily adapted to existing SpMV libraries. Our approach is evaluated on Intel Xeon Gold 6146 Processor with a wide-range of matrices from the SuiteSparse benchmarks. The results demonstrate that the proposed approach achieves an average 1.8× speedup (up to 2.5×) on multi-threaded MKL Sparse Routines, with a quite low pre-processing cost. Additionally, when used in conjunction with MKL's original optimization method, our approach can further prompt the speedup, to average 3.6 × (up to 8.3 ×), This result indicates that our method can serve as a fast and wide-spectrum optimization method which is compatible with existing routines.
TL;DR: This chapter investigates the performance of parallel implementations of the Jacobi method on Knights Corner (KNC), the first generation of the Intel MIC architectures, and measures their performance in terms of execution time, offloading time, and speedup.
Abstract: Many important scientific, engineering, and smart city applications require solving large sparse linear equation systems. The numerical methods for solving linear equations can be categorised into direct methods and iterative methods. Jacobi method is one of the iterative solvers that has been widely used due to its simplicity and efficiency. Its performance is affected by factors including the storage format, the specific computational algorithm, and its implementation. While the performance of Jacobi has been studied extensively on conventional CPU architectures, research on its performance on emerging architectures, such as the Intel Many Integrated Core (MIC) architecture, is still in its infancy. In this chapter, we investigate the performance of parallel implementations of the Jacobi method on Knights Corner (KNC), the first generation of the Intel MIC architectures. We implement Jacobi with two storage formats, Compressed Sparse Row (CSR) and Modified Sparse Row (MSR), and measure their performance in terms of execution time, offloading time, and speedup. We report results of sparse matrices with over 28 million rows and 640 million non-zero elements acquired from 13 diverse application domains. The experimental results show that our Jacobi parallel implementation on MIC achieves speedups of up to 27.75× compared to the sequential implementation. It also delivers a speedup of up to 3.81× compared to a powerful node comprising 24 cores in two Intel Xeon E5-2695v2 processors.
TL;DR: This paper focuses on the algorithm and hardware co-design for the Base Quality Score Re-calibration (BQSR) step in GATK, which is an important and time-consuming step to correct systematic errors made by a sequencing machine.
Abstract: Genome sequencing is one of the key applications in healthcare and has a great potential to realize precision medicine and personalized healthcare. However, its computing process is very time consuming. Even pre-processing the raw sequence data of a whole genome for a single person to the analysis ready data can take several days on a single-core CPU.In this paper, we propose to accelerate the performance of the widely used Genome Analysis ToolKit (GATK) using FPGAs. More specifically, we focus on the algorithm and hardware co-design for the Base Quality Score Re-calibration (BQSR) step in GATK, which is an important and time-consuming step to correct systematic errors made by a sequencing machine. Prior studies did not consider hardware acceleration for BQSR because it requires a large amount of memory with random access and has a lot of control flow. To address these challenges, we first adapt the algorithm to resolve the random memory access conflicts to achieve a fully pipelined accelerator design and reduce its dataset size. Second, we leverage the newly introduced large-capacity UltraRAM (URAM) in Xilinx UltraScale+ FPGAs to butter BQSR’s large dataset on chip, and further optimize its operating frequency. Finally, we also explore the coarse-grained pipeline and parallelism to improve the overall performance of the BQSR accelerator. Compared to the latest software implementation of BQSR on GATK 4.1, running on single-thread and 56-thread CPUs (14nm Xeon E5-2680 v4), our FPGA accelerator running on Xilinx 16nmUltraScale+VCUl525 board achieves up to 40. 7x and 8. 5x speedups, respectively.
TL;DR: This paper proposes a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high- Bandwidth stacked memory technology.
Abstract: FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.
TL;DR: A message-padding scheme for Horovod is designed, illustrated significantly smoother allreduce latency profiles are illustrated, and cases where improvement for end-to-end training are observed.
Abstract: Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, we choose Horovod, a distributed training middleware, to analyze and profile various DNN training workloads using TensorFlow and PyTorch in addition to standard MPI microbenchmarks. We use a wide variety of systems with CPUs like Intel Xeon and IBM POWER9, GPUs like Volta V100, and various interconnects to analyze the following metrics: 1) message-size with Horovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number of MPI/NCCL calls; and 4) time taken by each MPI/NCCL call. We observed extreme performance variations for non-power-of-two message sizes on different platforms. To address this, we design a message-padding scheme for Horovod, illustrate significantly smoother allreduce latency profiles, and report cases where we observed improvement for end-to-end training.
TL;DR: This paper proposes a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of identical nodes where each node has $h$ identical nodes and proposes an extension of the algorithm for clusters of non-identical nodes.
Abstract: Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of $h$ identical nodes where each node has $c$ heterogeneous processors. This algorithm takes as input $c$ discrete speed functions of cardinality $m$ corresponding to the $c$ heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of $O(m^{2} \times h + m^{3} \times c^{3})$ unlike the state-of-the-art algorithm solving the same problem with the complexity of $O(m^{3} \times c^{3} \times h^{3})$ . We also propose an extension of the algorithm for clusters of $h$ non-identical nodes where each node has $c$ heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.
TL;DR: ReSQM, a novel ReCAM-based accelerator, which can dramatically reduce the response time of database systems, and a new data mapping mechanism that allows enjoying in-situ in-memory computations for SELECTION operating upon intermediate results.
Abstract: The huge amount of data enforces great pressure on the processing efficiency of database systems. By leveraging the in-situ computing ability of emerging nonvolatile memory, processing-in-memory (PIM) technology shows great potential in accelerating database operations against traditional architectures without data movement overheads. In this article, we introduce ReSQM, a novel ReCAM-based accelerator, which can dramatically reduce the response time of database systems. The key novelty of ReSQM is that some commonly used database queries that would be otherwise processed inefficiently in previous studies can be in-situ accomplished with massively high parallelism by exploiting the PIM-enabled ReCAM array. ReSQM supports some typical database queries (such as SELECTION, SORT, and JOIN) effectively based on the limited computational mode of the ReCAM array. ReSQM is also equipped with a series of hardware-algorithm co-designs to maximize efficiency. We present a new data mapping mechanism that allows enjoying in-situ in-memory computations for SELECTION operating upon intermediate results. We also develop a count-based ReCAM-specific algorithm to enable the in-memory sorting without any row swapping. The relational comparisons are integrated for accelerating inequality join by making a few modifications to the ReCAM cells with negligible hardware overhead. The experimental results show that ReSQM can improve the (energy) efficiency by $611\times $ ( $193\times $ ), $19\times $ ( $17\times $ ), $59\times $ ( $43\times $ ), and $307\times $ ( $181\times $ ) in comparison to a 10-core Intel Xeon E5-2630v4 processor for SELECTION, SORT, equi-join, and inequality join, respectively. In contrast to state-of-the-art CMOS-based CAM, GPU, FPGA, NDP, and PIM solutions, ReSQM can also offer $2.2\times 39\times $ speedups.
TL;DR: A communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges is described.
Abstract: We describe the application of a communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges. Our variable-precision strategy, using a customized precision format based on mantissa segmentation (CPMS), abandons the IEEE 754 single- and double-precision number representation formats employed in the standard implementation of PageRank, and instead handles the data in memory using a customized floating-point format. The customized format enables fast data access in different accuracy, prevents overflow/underflow by preserving the IEEE 754 double-precision exponent, and efficiently avoids data duplication, since all bits of the original IEEE 754 double-precision mantissa are preserved in memory, but re-organized for efficient reduced precision access. With this approach, the truncated values (omitting significand bits), as well as the original IEEE double-precision values, can be retrieved without duplicating the data in different formats. Our numerical experiments on an NVIDIA V100 GPU (Volta architecture) and a server equipped with two Intel Xeon Platinum 8168 CPUs (48 cores in total) expose that, compared with a standard IEEE double-precision implementation, the CPMS-based PageRank completes about 10% faster if high-accuracy output is needed, and about 30% faster if reduced output accuracy is acceptable.
TL;DR: This paper presents design methodology and implementation of an elastic-natured 32-bit RISC-pipelined processor inspired from Intel Xeon and MIPS to function as a standard integrated platform for server models.
Abstract: The server models functioning in the industry are required to be more elastic in nature. They are constantly scaling-up and scaling-down on required computation power depending on different conditions. These elastic cloud platforms use accelerators like DSP’s, TPU’s, GPU’s, FPGA’s, and multi-core processors to provide exponential computing power and outsource their services. This process is not only costly and non-efficient but is also responsible for damaging the server’s hardware architecture. Furthermore, these additions degrade the level of threading and symmetric parallel processing capability of the architecture. Intel uses hyperthreading technology (HTT) to split the workload between hardware and operating system to avoid additions, but that too is only possible up until a certain limit. This paper presents design methodology and implementation of an elastic-natured 32-bit RISC-pipelined processor inspired from Intel Xeon and MIPS to function as a standard integrated platform for server models. It implements concepts of hyperthreading technology (HTT) and virtualization on hardware basis. It will allow to derive multiple outputs from units on hardware basis to enhance security and performance without compromising compatibility. The designed elastic core uses a probabilistic node-based closed-queuing network model for server analysis and implementation. Hence, elastic behavior from individual core microarchitecture to server model architecture enables a generic automated scaling self-aware optimization architecture.
TL;DR: In this paper, a multi-pipelined high-cardinality HyperLogLogLog implementation was proposed to improve the performance of data streams. But it does not count every data item but provides probabilistic guarantees on the result thereby reducing its memory footprint.
Abstract: Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain such as URLs, IP addresses, user IDs, etc. Among the many data sketches available, HyperLogLog has become the reference for cardinality counting i.e., how many distinct data items there are in a data set. Although it does not count every data item but provides probabilistic guarantees on the result thereby reducing its memory footprint, and the result is often used to analyze data streams. In this paper, we explore how to implement HyperLogLog on an FPGA to benefit from the parallelism available and the ability to process data streams coming from high-speed networks. Our multi-pipelined high-cardinality HyperLogLog implementation delivers 1.8x higher throughput than the best-optimized multi-thread HyperLogLog running on a dual-socket Intel Xeon E5-2630 v3 system with a total of 16 cores and 32 hyper-threads.
TL;DR: A thorough performance analysis of the image-domain gridding algorithm for an Intel Xeon CPU, Intel Xeon Phi, and GPUs from AMD and NVIDIA shows that, by evaluating trigonometric functions in hardware, GPUs are both much faster and more energy efficient than a CPU or Xeon Phi.