TL;DR: GPUTreeShap as mentioned in this paper is a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units (GPUs) and achieves speedups of up to 19× for SHAP values and up to 340× for interaction values over a state-of-the-art multi-core CPU implementation.
Abstract: SHapley Additive exPlanation (SHAP) values (Lundberg & Lee, 2017) provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values (Shapley, 1953). While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap (Lundberg et al., 2020) is available for decision tree models. However, despite its polynomial time complexity, TreeShap can become a significant bottleneck in practical machine learning pipelines when applied to large decision tree ensembles. Unfortunately, the complicated TreeShap algorithm is difficult to map to hardware accelerators such as GPUs. In this work, we present GPUTreeShap, a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units. Our approach first preprocesses each decision tree to isolate variable sized sub-problems from the original recursive algorithm, then solves a bin packing problem, and finally maps sub-problems to single-instruction, multiple-thread (SIMT) tasks for parallel execution with specialised hardware instructions. With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19× for SHAP values, and speedups of up to 340× for SHAP interaction values, over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs. We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2 M rows per second-equivalent CPU-based performance is estimated to require 6850 CPU cores.
TL;DR: Sapphire Rapids (SPR) as mentioned in this paper is the next-generation Xeon® Processor with increased core count, greater than 100MB shared L3 cache, 8 DDR5 channels, 32GT/s PCIe/CXL lanes, 16GT/S UPI lanes and integrated accelerators supporting cryptography, compression and data streaming.
Abstract: Sapphire Rapids (SPR) is the next-generation Xeon® Processor with increased core count, greater than 100MB shared L3 cache, 8 DDR5 channels, 32GT/s PCIe/CXL lanes, 16GT/s UPI lanes and integrated accelerators supporting cryptography, compression and data streaming. The processor is made up of 4 die (Fig. 2.2.7) manufactured on Intel 7 process technology which features dual-poly-pitch SuperFin (SF) transistors with performance enhancements beyond 10SF,>25% additional MIM density over SuperMIM and a metal stack with a 400nm pitch routing layer optimized for global interconnects. This layer achieves ~30% delay reduction at the same signal density and is key for achieving the required latency. The core provides better performance via a programmable power management controller. New technologies include Intel Advanced Matrix Extensions (AMX), a matrix multiplication capability for acceleration of AI workloads and new virtualization technologies to address new and emerging workloads.
TL;DR: A first systematic performance study of Intel SGXv2 is conducted and it is compared to the previous generation of SGX to answer the question whether previous efforts to overcome the limitations of SGZ for DBMSs are still applicable and if the new generation ofSGX can truly deliver on the promise to secure data without compromising on performance.
Abstract: In recent years, trusted execution environments (TEEs) such as Intel Software Guard Extensions (SGX) have gained a lot of attention in the database community. This is because TEEs provide an interesting platform for building trusted databases in the cloud. However, until recently SGX was only available on low-end single socket servers built on the Intel Xeon E3 processor generation and came with many restrictions for building DBMSs. With the availability of the new Ice Lake processors, Intel provides a new implementation of the SGX technology that supports high-end multi-socket servers. With this new implementation, which we refer to as SGXv2 in this paper, Intel promises to address several limitations of SGX enclaves. This raises the question whether previous efforts to overcome the limitations of SGX for DBMSs are still applicable and if the new generation of SGX can truly deliver on the promise to secure data without compromising on performance. To answer this question, in this paper we conduct a first systematic performance study of Intel SGXv2 and compare it to the previous generation of SGX.
TL;DR: ReFlip as mentioned in this paper leverages PIM-featured crossbar architectures to build a unified architecture for supporting the two types of GCN kernels simultaneously and adopts novel algorithm mappings that can maximize potential performance gains reaped from the unified architecture by exploiting the massive crossbar-structured parallelism.
Abstract: Graph convolutional networks (GCNs) are promising to enable machine learning on graphs. GCNs exhibit mixed computational kernels, involving regular neural-network-like computing and irregular graph-analytics-like processing. Existing GCN accelerators obey a divide-and-conquer philosophy to architect two separate types of hardware to accelerate these two types of GCN kernels, respectively. This hybrid architecture improves intra-kernel efficiency but considers little inter-kernel interactions in a holistic view for improving overall efficiency.In this paper, we present a new GCN accelerator, RE-FLIP, with three key innovations in terms of architecture design, algorithm mappings, and practical implementations. First, ReFlip leverages PIM-featured crossbar architectures to build a unified architecture for supporting the two types of GCN kernels simultaneously. Second, ReFlip adopts novel algorithm mappings that can maximize potential performance gains reaped from the unified architecture by exploiting the massive crossbar-structured parallelism. Third, ReFlip assembles software/hardware co-optimizations to process real-world graphs efficiently. Compared to the state-of-the-art software frameworks running on Intel Xeon E5-2680v4 CPU and NVIDIA Tesla V100 GPU, ReFlip achieves the average speedups of 6,432× and 86.32× and the average energy savings of 9,817× and 302.44×, respectively. In addition, ReFlip also outperforms a state-of-the-art GCN hardware accelerator, AWB-GCN, by achieving an average speedup of 5.06× and an average energy saving of 15.63×.
TL;DR: In this article , the authors present HARU, a resource-efficient hardware-software codesign-based method that exploits a low-cost and portable heterogeneous multiprocessor system-on-chip platform with on-chip field-programmable gate arrays (FPGA) to accelerate the sDTW-based Read Until algorithm.
Abstract: Abstract Background Third-generation nanopore sequencers offer selective sequencing or “Read Until” that allows genomic reads to be analyzed in real time and abandoned halfway if not belonging to a genomic region of “interest.” This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analyzing should be as low as possible for selective sequencing to be effective so that unnecessary reads can be rejected as early as possible. However, existing methods that employ a subsequence dynamic time warping (sDTW) algorithm for this problem are too computationally intensive that a massive workstation with dozens of CPU cores still struggles to keep up with the data rate of a mobile phone–sized MinION sequencer. Results In this article, we present Hardware Accelerated Read Until (HARU), a resource-efficient hardware–software codesign-based method that exploits a low-cost and portable heterogeneous multiprocessor system-on-chip platform with on-chip field-programmable gate arrays (FPGA) to accelerate the sDTW-based Read Until algorithm. Experimental results show that HARU on a Xilinx FPGA embedded with a 4-core ARM processor is around 2.5× faster than a highly optimized multithreaded software version (around 85× faster than the existing unoptimized multithreaded software) running on a sophisticated server with a 36-core Intel Xeon processor for a SARS-CoV-2 dataset. The energy consumption of HARU is 2 orders of magnitudes lower than the same application executing on the 36-core server. Conclusions HARU demonstrates that nanopore selective sequencing is possible on resource-constrained devices through rigorous hardware–software optimizations. The source code for the HARU sDTW module is available as open source at https://github.com/beebdev/HARU, and an example application that uses HARU is at https://github.com/beebdev/sigfish-haru.
TL;DR: The findings illustrate the complex NUMA properties and how data placement and cache coherence states impact access latencies to local and remote locations and compares theoretical and effective bandwidths for accessing data at the different memory levels and main memory bandwidth saturation at reduced core counts.
Abstract: Modern processors, in particular within the server segment, integrate more cores with each generation. This increases their complexity in general, and that of the memory hierarchy in particular. Software executed on such processors can suffer from performance degradation when data is distributed disadvantageously over the available resources. To optimize data placement and access patterns, an in-depth analysis of the processor design and its implications for performance is necessary. This paper describes and experimentally evaluates the memory hierarchy of AMD EPYC Rome and Intel Xeon Cascade Lake SP server processors in detail. Their distinct microarchitectures cause different performance patterns for memory latencies, in particular for remote cache accesses. Our findings illustrate the complex NUMA properties and how data placement and cache coherence states impact access latencies to local and remote locations. This paper also compares theoretical and effective bandwidths for accessing data at the different memory levels and main memory bandwidth saturation at reduced core counts. The presented insight is a foundation for modeling performance of the given microarchitectures, which enables practical performance engineering of complex applications. Moreover, security research on side-channel attacks can also leverage the presented findings.
TL;DR: The FastGeodis package provides an e-cient implementation for computing Geodesic and Euclidean distance transforms targeting efficient utilisation of CPU and GPU hard-wares and implements paralellisable raster scan method from [4], where elements in row (2D) or plane (3D) can be computed with parallel threads.
Abstract: The FastGeodis package provides an efficient implementation for computing Geodesic and Euclidean distance transforms (or a mixture of both), targeting efficient utilisation of CPU and GPU hardware. In particular, it implements the paralellisable raster scan method from Criminisi et al. (2009), where elements in a row (2D) or plane (3D) can be computed with parallel threads. This package is able to handle 2D as well as 3D data, where it achieves up to a 20x speedup on a CPU and up to a 74x speedup on a GPU as compared to an existing open-source library (Wang, 2020) that uses a non-parallelisable single-thread CPU implementation. The performance speedups reported here were evaluated using 3D volume data on an Nvidia GeForce Titan X (12 GB) with a 6-Core Intel Xeon E5-1650 CPU. Further in-depth comparison of performance improvements are discussed in the FastGeodis documentation: https://fastgeodis.readthedocs.io
TL;DR: In this article , the authors describe the concept, system architecture, supporting system software, and applications on their world-first supercomputer with multihybrid accelerators using GPU and FPGA coupling, named Cygnus, which runs at Center for Computational Sciences, University of Tsukuba.
Abstract: In this paper, we describe the concept, system architecture, supporting system software, and applications on our world-first supercomputer with multihybrid accelerators using GPU and FPGA coupling, named Cygnus, which runs at Center for Computational Sciences, University of Tsukuba. A special group of 32 nodes is configured as a multihybrid accelerated computing system named Albireo part although Cygnus is constructed with over 80 computation nodes as a GPU-accelerated PC cluster. Each node of the Albireo part is equipped with four NVIDIA V100 GPU cards and two Intel Stratix10 FPGA cards in addition to two sockets of Intel Xeon Gold CPU where all nodes are connected by four lanes of InfiniBand HDR100 interconnection HCA in the full bisection bandwidth of NVIDIA HDR200 switches. Beside this ordinary interconnection network, all FPGA cards in Albireo part are connected by a special 2-Dimensional Torus network with direct optical links on each FPGA for constructing a very high throughput and low latency of FPGA-centric interconnection network. To the best of our knowledge, Cygnus is the world’s first production-level PC cluster to realize multihybrid acceleration with the GPU and FPGA combination. Unlike other GPU-accelerated clusters, users can program parallel codes where each process exploits both or either of the GPU and/or FPGA devices based on the characteristics of their applications. We developed various supporting system software such as inter-FPGA network routing system, DMA engine for GPU-FPGA direct communication managed by FPGA, and multihybrid accelerated programming framework because the programming method of such a complicated system has not been standardized. Further, we developed the first real application on Cygnus for fundamental astrophysics simulation to fully utilize GPU and FPGA together for very efficient acceleration. We describe the overall concept and construction of the Cygnus cluster with a brief introduction of the several underlying hardware and software research studies that have already been published. We summarize how such a concept of GPU/FPGA coworking will usher in a new era of accelerated supercomputing.
TL;DR: In this paper , a CPU-FPGA based heterogeneous acceleration system and a subgraph segmentation scheme for CNN-RNN hybrid neural networks are proposed to accelerate CNNs.
Abstract: Scene text detection network, such as Connectionist Text Proposal Network (CTPN), takes CNN-RNN hybrid neural network as the main body, which can effectively recognize the text arranged horizontally in the image. However, complex structures and intensive calculations lead to longer network inference time. FPGAs have been widely used in data centers to accelerate neural network inference due to their flexibility and low power consumption. In this brief, a CPU-FPGA based heterogeneous acceleration system and a subgraph segmentation scheme for CNN-RNN hybrid neural networks are proposed. Winograd algorithm is applied to accelerate CNN. In RNN accelerating, fixed-point quantization, loop tiling, and piecewise linear approximation of activation function are used to reduce hardware resource usage and achieve a high degree of parallelization. CTPN is tested in our system with Intel Xeon 4116 CPU and Arria10 GX1150 FPGA, and the throughput reaches 1223.53GOP/s.
TL;DR: This work investigates an octree-based method combined with traditional finite-difference algorithms specifically designed to execute structured mesh refinement applications efficiently on modern cluster architectures to demonstrate the performance of the octree construction and balance algorithms to scale to billions of mesh cells.
TL;DR: ReaDy as discussed by the authors is the first DGCN accelerator with an integrated architecture to accelerate DGCNs based on emerging PIM-featured ReRAM architectures, which is equipped with a redundancy-free scheduling mechanism to alleviate intrinsic dynamic irregularity for the GCN kernel.
Abstract: Dynamic graph convolutional networks (DGCNs) have emerged as an effective approach to analyzing graph data that is constantly changing. The typical DGCNs incorporate not only graph convolutional networks (GCNs) to extract the structural information but also with recurrent neural networks (RNNs) to capture the temporal information from evolving graph data. These two alternative execution kernels of DGCNs impose unique architecture challenges for both types of kernels to be implemented efficiently. The presence of complex execution patterns of DGCNs renders existing architectures unsuitable. In this article, we present the first DGCN accelerator with an integrated architecture, named ReaDy, to accelerate DGCNs based on emerging PIM-featured ReRAM architectures. ReaDy is novel with an integrated architecture that enables running the GCN and RNN kernels of DGCNs simultaneously. Specifically, ReaDy is equipped with a redundancy-free scheduling mechanism to alleviate intrinsic dynamic irregularity for the GCN kernel, improving hardware utilization. In addition, ReaDy also includes a locality-aware dataflow strategy to exploit the inherent intervertex data locality for the RNN kernel, reducing superfluous data accesses to vertices and weight parameters. In a holistic view, ReaDy further enhances the entire system via an interkernel pipeline to reduce the off-chip accesses of intermediate results, boosting the overall efficiency of DGCNs significantly. Compared to the state-of-the-art software framework, PyGT, running on Intel Xeon E5-2680v4 CPU and NVIDIA Ampere A100 GPU, ReaDy achieves the average speedups of $955\times $ and $27.33\times $ , and the average energy savings of 1 $093\times $ and $80.21\times $ , respectively. In addition, ReaDy outperforms ReFlip-ERA, which is obtained by combining a state-of-the-art GCN accelerator ReFlip and RNN accelerator ERA-LSTM, by an average speedup of $8.30\times $ and an average energy saving of $7.29\times $ .
TL;DR: In this paper , the impact of various state-of-the-art C/C++ compilers available for novel AMD EPYC Rome and Milan multi-core processors on the performance of real-world scientific codes corresponding to the solidification modeling application using the phase-field (PF) method and generalized finite difference scheme for solving governing PDEs.
Abstract: • For a real-life scientific application, we assess the impact of five state-of-the-art C/C++ compilers available for novel AMD EPYC processors. • Besides performance, the numerical accuracy of computation is verified using statistical metrics. • For objectivity, we evaluate various versions of the code with the increasing complexity of memory access patterns and control structures. • Our assessment confirms the advantage of the Intel compiler over other compilers. • We show that the Intel compiler's performance advantage over the AMD compiler results from better utilization of AVX2 vector instructions. • The Intel compiler usage permits us to fully reveal the performance potential of the AMD EPYC against the Intel Xeon architecture. The phase-field (PF) method is a powerful tool for solving interfacial problems in materials science. This paper’s primary goal is to assess the impact of various state-of-the-art C/C++ compilers available for novel AMD EPYC Rome and Milan multi-core processors on the performance of real-world scientific codes corresponding to the solidification modeling application using the PF method and generalized finite difference scheme for solving governing PDEs. Among the studied compilers are AOCC, Clang, GCC, Intel compiler, and PGI. Besides performance, the numerical accuracy of the simulation is verified since various optimizations used by the compilers can cause differences in the simulation results. For the objectivity of the assessment, we study different application versions with the increasing complexity of memory access patterns and control structures. All tested codes are executed on a dual-socket server equipped with 64-core AMD EPYC Rome 7742 CPUs. This assessment confirms the advantage of the Intel compiler over other compilers. In particular, while for the static intensity only GCC lags noticeably behind other compilers, in the case of the dynamic intensity, the clear winner is the Intel compiler, which allows increasing the performance up to about 1.3 times over the AMD compiler. Moreover, we reveal the sensitivity of the application version and compiler to selecting a different number of NUMA domains and enabling/disabling the NUMA balancing option in EPYC processors. By comparing the performance gain achieved by various compilers due to vectorization of the application codes, we show that the Intel compiler’s performance advantage results primarily from considerably better utilization of capabilities of AVX2 vector instructions. Finally, we show the importance of choosing a compiler when comparing the performance of AMD EPYC and Intel Xeon processors.
TL;DR: In this paper , the reactive force field (ReaxFF) potential from LAMMPS is optimized for non-bonded interactions, which is nearly 100 × faster than the management processing element on the Sunway TaihuLight supercomputer.
Abstract: Molecular dynamics (MD) simulations are playing an increasingly important role in many areas ranging from chemical materials to biological molecules. With the continuing development of MD models, the potentials are getting larger and more complex. In this article, we focus on the reactive force field (ReaxFF) potential from LAMMPS to optimize the computation of interactions. We present our efforts on refactoring for neighbor list building, bond order computation, as well as valence angles and torsion angles computation. After redesigning these kernels, we develop a vectorized implementation for non-bonded interactions, which is nearly 100 × faster than the management processing element (MPE) on the Sunway TaihuLight supercomputer. Furthermore, we have implemented the three-body-list free torsion angles computation, and propose a line-locked software cache method to eliminate write conflicts in the torsion angle and valence angle interactions resulting in an order-of-magnitude speedup on a single Sunway TaihuLight node. In addition, we achieve a speedup of up to 3.5 compared to the KOKKOS package on an Intel Xeon Gold 6148 core. When executed on 1,024 processes, our implementation enables the simulation of 21,233,664 atoms on 66,560 cores with a performance of 0.032 ns/day and a weak scaling efficiency of 95.71 percent.
TL;DR: In this paper , a parallel ab initio quantum transport solver implemented in C programming language, with bismuthene nanoribbon (BiNRs) simulations used for the demonstration of its performance.
Abstract: We describe our parallel ab initio quantum transport solver implemented in C programming language, with bismuthene nanoribbon (BiNRs) simulations used for the demonstration of its performance. The inputs are Hamiltonians obtained from ab initio density functional theory (DFT), which are wannierized into a localized basis to increase Hamiltonian matrix sparsity and to reduce the computational load without the loss of bandstructure accuracy. Numerical matrix operations are parallelized for cluster computation and optimized using Intel Message Passing Interface (MPI) and Intel oneAPI Math Kernel Library (MKL). We demonstrate that an acceleration of about ~45× is achieved through parallelization on 64 Xeon Silver CPU cores compared to a single-core execution. Finally, we investigate the electronic, transport and device properties of ultra-scaled bismuthene nanodevices.
TL;DR: In this paper , scaling studies of ab initio molecular dynamics simulations using the popular CP2K code on both Intel Xeon CPU and NVIDIA V100 GPU architectures were conducted using a realistic molecular catalyst system.
Abstract: Using a realistic molecular catalyst system, we conduct scaling studies of ab initio molecular dynamics simulations using the popular CP2K code on both Intel Xeon CPU and NVIDIA V100 GPU architectures. Additional performance improvements were gained by finding more optimal process placement and affinity settings. Statistical methods were employed to understand performance changes in spite of the variability in runtime for each molecular dynamics timestep. Ideal conditions for CPU runs were found when running at least four MPI ranks per node, bound evenly across each socket. This study also showed that fully utilizing processing cores, with one OpenMP thread per core, performed better than when reserving cores for the system. The CPU-only simulations scaled at 70% or more of the ideal scaling up to 10 compute nodes, after which the returns began to diminish more quickly. Simulations on a single 40-core node with two NVIDIA V100 GPUs for acceleration achieved over 3.7× speedup compared to the fastest single 36-core node CPU-only version. These same GPU runs showed a 13% speedup over the fastest time achieved across five CPU-only nodes.
TL;DR: In this paper , an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) is proposed to improve the time-to-solution of the pre-defined MPI reductions operations.
TL;DR: In this paper , the authors present a modern efficient parallel OpenMP+CUDA implementation of crowd simulation for hybrid CPU+GPU systems and demonstrate its higher performance over CPU-only and GPU-only implementations for several problem sizes including 10 000, 50 000, 100 000, 500 000 and 1 000 000 agents.
TL;DR: In this paper , an FPGA-based XGBoost accelerator designed with High-Level Synthesis (HLS) tools and design flow accelerating binary classification inference is presented.
Abstract: Advanced ensemble trees have proven quite effective in providing real-time predictions against ransomware detection, medical diagnosis, recommendation engines, fraud detection, failure predictions, crime risk, to name a few. Especially, XGBoost, one of the most prominent and widely used decision trees, has gained popularity due to various optimizations on gradient boosting framework that provides increased accuracy for classification and regression problems. XGBoost’s ability to train relatively faster, handling missing values, flexibility and parallel processing make it a better candidate to handle data center workload. Today’s data centers with enormous Input/Output Operations per Second (IOPS) demand a real-time accelerated inference with low latency and high throughput because of significant data processing due to applications such as ransomware detection or fraud detection.This paper showcases an FPGA-based XGBoost accelerator designed with High-Level Synthesis (HLS) tools and design flow accelerating binary classification inference. We employ Alveo U50 and U200 to demonstrate the performance of the proposed design and compare it with existing state-of-the-art CPU (Intel Xeon E5-2686 v4) and GPU (Nvidia Tensor Core T4) implementations with relevant datasets. We show a latency speedup of our proposed design over state-of-art CPU and GPU implementations, including energy efficiency and cost-effectiveness. The proposed accelerator is up to 65.8x and 5.3x faster, in terms of latency than CPU and GPU, respectively. The Alveo U50 is a more cost-effective device, and the Alveo U200 stands out as more energy-efficient.
TL;DR: In this paper, the reactive force field (ReaxFF) potential from LAMMPS was optimized to optimize the computation of non-bonded interactions, and the three-body-list free torsion angles computation was implemented on a single Sunway TaihuLight node.
Abstract: Molecular dynamics (MD) simulations are playing an increasingly important role in many areas ranging from chemical materials to biological molecules. With the continuing development of MD models, the potentials are getting larger and more complex. In this article, we focus on the reactive force field (ReaxFF) potential from LAMMPS to optimize the computation of interactions. We present our efforts on refactoring for neighbor list building, bond order computation, as well as valence angles and torsion angles computation. After redesigning these kernels, we develop a vectorized implementation for non-bonded interactions, which is nearly $100 \times$ 100 × faster than the management processing element (MPE) on the Sunway TaihuLight supercomputer. Furthermore, we have implemented the three-body-list free torsion angles computation, and propose a line-locked software cache method to eliminate write conflicts in the torsion angle and valence angle interactions resulting in an order-of-magnitude speedup on a single Sunway TaihuLight node. In addition, we achieve a speedup of up to 3.5 compared to the KOKKOS package on an Intel Xeon Gold 6148 core. When executed on 1,024 processes, our implementation enables the simulation of 21,233,664 atoms on 66,560 cores with a performance of 0.032 ns/day and a weak scaling efficiency of 95.71 percent.
TL;DR: In this paper , a multi-core architecture can parallelize the exact sequence alignment procedure of the sequences, significantly reducing the execution time while still maintaining high flexibility, and the implementation on a Xilinx Alveo U280 achieves up to $2.68
Abstract: Genome assembly is one of the most challenging tasks in bioinformatics, as it is the key to many applications. One of the fundamental tasks in genome assembly is exact sequence alignment. This process enables the identification of recurrent patterns and mutations inside the DNA, which can substantially support clinicians in providing a quicker diagnosis and producing individual-specific drugs. However, this procedure represents a bottleneck in genome analysis as it is computationally intensive and time-consuming. In this scenario, the efficiency of the chosen algorithm to perform this operation also plays a crucial role to speed up the analysis process. In this paper, we present a high-performance, energy-efficient FPGA implementation of the Knuth Morris Pratt (KMP) algorithm. Our multi-core architecture can parallelize the alignment procedure of the sequences, significantly reducing the execution time while still maintaining high flexibility. Experimental results show that our implementation on a Xilinx Alveo U280 achieves up to $2.68\times$ speedup and up to $7.46\times$ improvement in energy efficiency against Bowtie2, a State-of-the-Art application for sequence alignment run on a 40-thread Intel Xeon processor. Finally, our design also outperforms hardware-accelerated applications of the KMP present the State of the Art by up to $19.38\times$ and $15.63\times$ in terms of throughput and energy efficiency respectively.
TL;DR: In this article , the authors proposed an implementation of the parallel number-theoretic transform (NTT) using Intel Advanced Vector Extensions 512 (AVX-512) instructions, which achieved a performance of over 83 giga-operations per second on an Intel Xeon Platinum 8368 (2.4 GHz, 38 cores) with a modulus of 51 bits.
Abstract: In this paper, we propose an implementation of the parallel number-theoretic transform (NTT) using Intel Advanced Vector Extensions 512 (AVX-512) instructions. The butterfly operation of the NTT can be performed using modular addition, subtraction, and multiplication. We show that a method known as the six-step fast Fourier transform algorithm can be applied to the NTT. We vectorized NTT kernels using the Intel AVX-512 instructions and parallelized the six-step NTT using OpenMP. We successfully achieved a performance of over 83 giga-operations per second on an Intel Xeon Platinum 8368 (2.4 GHz, 38 cores) for a $$2^{20}$$ -point NTT with a modulus of 51 bits.
TL;DR: In this paper , a vectorized version of the banded Myers algorithm tailored to the 256-bit vector registers of the SW26010 was implemented on the TaihuLight supercomputer optimized for its fourth-generation ShenWei many-core architecture (SW26010).
TL;DR: In this article , the authors proposed a technique of tuning the block-size to the given multi-core system, which involves profiling tools in the tuning process and allows the increase of the parallel algorithm throughput.
Abstract: Finding shortest paths in a weighted graph is one of the key problems in computer-science, which has numerous practical applications in multiple domains. This paper analyzes the parallel blocked all-pairs shortest path algorithm at the aim of evaluating the influence of the multi-core system and its hierarchical cache memory on the parameters of algorithm implementation depending on the size of the graph and the size of distance matrix’s block. It proposes a technique of tuning the block-size to the given multi-core system. The technique involves profiling tools in the tuning process and allows the increase of the parallel algorithm throughput. Computational experiments carried out on a rack server equipped with two Intel Xeon E5-2620 v4 processors of 8 cores and 16 hardware threads each have convincingly shown for various graph sizes that the behavior and parameters of the hierarchical cache memory operation don’t depend on the graph size and are determined only by the distance matrix’s block size. To tune the algorithm to the target multi-core system, the preferable block size can be found once for the graph size whose in-memory matrix representation is larger than the size of cache shared among all processor’s cores. Then this blocksize can be reused on graphs of bigger size for efficient solving the all-pairs shortest path problem
TL;DR: In this paper , a real-world implementation of linear regression, logistic regression, decision tree, and K-means clustering on a general-purpose PIM architecture is presented.
Abstract: Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is $27\times$ faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and $1.34\times$ faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is $2.8\times$ and $3.2\times$ than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
TL;DR: In this article , an approach that implements full kinetic particle-in-cell simulations on GPU architecture devices using the CUDA Fortran language programming for the first time was introduced. But it is not suitable for the simulation of space physics.
Abstract: The emerging computable devices, graphical processing units (GPUs), are gradually applied in the simulations of space physics. In this paper, we introduce an approach that implements full kinetic particle-in-cell simulations on GPU architecture devices using the CUDA Fortran language programming for the first time. Using the latest high-performance computing NVIDIA GPUs, this program, which follows the second-order leap-frog iteration method, can speed up the computing process by a factor of 150–285 on a single device compared with the time cost of running with a single core of an Intel Xeon Gold processor. Our scheme improves fast accessibility to the simulation results and provides valuable assistance in studying the physical process.
TL;DR: In this paper , the authors focus on the development of predictive analytics for urban traffic, which is based on deep learning techniques localized in the edge, where computing devices have very limited computational resources.
Abstract: Processing data generated at high volume and speed from the Internet of Things, smart cities, domotic, intelligent surveillance, and e-healthcare systems require efficient data processing and analytics services at the Edge to reduce the latency and response time of the applications. The fog computing edge infrastructure consists of devices with limited computing, memory, and bandwidth resources, which challenge the construction of predictive analytics solutions that require resource-intensive tasks for training machine learning models. In this work, we focus on the development of predictive analytics for urban traffic. Our solution is based on deep learning techniques localized in the Edge, where computing devices have very limited computational resources. We present an innovative method for efficiently training the gated recurrent-units (GRUs) across available resource-constrained CPU and GPU Edge devices. Our solution employs distributed GRU model learning and dynamically stops the training process to utilize the low-power and resource-constrained Edge devices while ensuring good estimation accuracy effectively. The proposed solution was extensively evaluated using low-powered ARM-based devices, including Raspberry Pi v3 and the low-powered GPU-enabled device NVIDIA Jetson Nano, and also compared them with Single-CPU Intel Xeon machines. For the evaluation experiments, we used real-world Floating Car Data. The experiments show that the proposed solution delivers excellent prediction accuracy and computational performance on the Edge when compared to the baseline methods.
TL;DR: In this article , an FPGA implementation of the Wavefront Alignment (WFA) algorithm is presented, which exploits homologous regions between the sequences to speed up the alignment process and whose complexity is related to the score of the alignment, rather than to the length of the sequences.
Abstract: Pairwise sequence alignment represents a fundamental step in genome and molecular analysis applications, accounting for most of their runtime. Given the quadratic time complexity of alignment algorithms, the community presses for the development of more efficient algorithms. Moreover, current limitations of general-purpose architectures push users to use hardware accelerators to reduce the analysis time. In this context, we present an FPGA implementation of the Wavefront Alignment (WFA) algorithm, a recently introduced solution that exploits homologous regions between the sequences to speed up the alignment process and whose complexity is related to the score of the alignment, rather than to the lengths of the sequences. Our multicore design can achieve up to 8.09 × improvement in speedup and 57.77 × in energy efficiency compared to the multithreaded software implementation run on a Xeon Gold Processor. Moreover, our design highly outperforms the current State-of-the-Art hardware-accelerated solution, reaching up to 2876 Giga Cell Updates Per Second (GCUPS) and 68.47 GCUPS/W on a single FPGA, with an improvement of up to 2.29× and 9.90× in terms of performance and energy efficiency, respectively.
TL;DR: This work takes a step forward in the direction of developing high performance codes for the convolution, based on the Winograd transformation, that are easy to customize for different processor architectures via the introduction of vector intrinsics to exploit the SIMD capabilities of current processors as well as OpenMP pragmas to exploit multi-thread parallelism.
Abstract: We take a step forward in the direction of developing high performance codes for the convolution, based on the Winograd transformation, that are easy to customize for different processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector intrinsics to exploit the SIMD (single-instruction multiple-data) capabilities of current processors as well as OpenMP pragmas to exploit multi-thread parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on two distinct processors, with Intel Xeon Skylake and ARM Cortex A57 architectures, show that the impact is affordable, and still renders a Winograd-based solution that is competitive with the general method for the convolution based on the so-called im2col transform followed by a matrix-matrix multiplication.
TL;DR: In this paper , the authors explore the potential of Intel's Memory Bandwidth Allocation (MBA) technology, available on Xeon Scalable processors and assess the indirect memory bandwidth limitation achievable by applying MBA delays, showing that only given delay values (namely 70, 80 and 90) are effective in their setting.
Abstract: Industries are recently considering the adoption of cloud computing for hosting safety critical applications. However, the use of multicore processors usually adopted in the cloud introduces temporal anomalies due to contention for shared resources, such as the memory subsystem. In this paper we explore the potential of Intel’s Memory Bandwidth Allocation (MBA) technology, available on Xeon Scalable processors. By adopting a systematic measurement approach on real hardware, we assess the indirect memory bandwidth limitation achievable by applying MBA delays, showing that only given delay values (namely 70, 80 and 90) are effective in our setting. We also test the derived bandwidth assured to a hypothetical critical core when interfering cores (e.g., generating a concurrent memory access workload) are present on the same machine. Our results can support designers by providing understanding of impact of the shared memory to enable predictable progress of safety critical applications in cloud environments.