Top 110 papers published in the topic of Xeon in 2020

Showing papers on "Xeon published in 2020"

Proceedings Article•10.1109/HPCA47549.2020.00012•

HyGCN: A GCN Accelerator with Hybrid Architecture

[...]

Mingyu Yan¹, Lei Deng², Xing Hu², Ling Liang², Yujing Feng¹, Xiaochun Ye¹, Zhimin Zhang¹, Dongrui Fan¹, Yuan Xie² - Show less +5 more•Institutions (2)

Chinese Academy of Sciences¹, University of California, Santa Barbara²

7 Jan 2020

TL;DR: This work describes the hybrid execution patterns of GCNs on Intel Xeon CPU and proposes a hardware design with two efficient processing engines to alleviate the irregularity of Aggregation phase and leverage the regularity of Combination phase, and designs a GCN accelerator using a hybrid architecture to efficiently perform GCNs.

...read moreread less

Abstract: Inspired by the great success of neural networks, graph convolutional neural networks (GCNs) are proposed to analyze graph data. GCNs mainly include two phases with distinct execution patterns. The Aggregation phase, behaves as graph processing, showing a dynamic and irregular execution pattern. The Combination phase, acts more like the neural networks, presenting a static and regular execution pattern. The hybrid execution patterns of GCNs require a design that alleviates irregularity and exploits regularity. Moreover, to achieve higher performance and energy efficiency, the design needs to leverage the high intra-vertex parallelism in Aggregation phase, the highly reusable inter-vertex data in Combination phase, and the opportunity to fuse phase-by-phase execution introduced by the new features of GCNs. However, existing architectures fail to address these demands. In this work, we first characterize the hybrid execution patterns of GCNs on Intel Xeon CPU. Guided by the characterization, we design a GCN accelerator, HyGCN, using a hybrid architecture to efficiently perform GCNs. Specifically, first, we build a new programming model to exploit the fine-grained parallelism for our hardware design. Second, we propose a hardware design with two efficient processing engines to alleviate the irregularity of Aggregation phase and leverage the regularity of Combination phase. Besides, these engines can exploit various parallelism and reuse highly reusable data efficiently. Third, we optimize the overall system via inter-engine pipeline for inter-phase fusion and priority-based off-chip memory access coordination to improve off-chip bandwidth utilization. Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA V100 GPU, our work achieves on average 1509× speedup with 2500× energy reduction and average 6.5× speedup with 10× energy reduction, respectively.

...read moreread less

378 citations

Proceedings Article•10.1145/3373376.3378508•

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

[...]

Size Zheng¹, Yun Liang¹, Shuo Wang¹, Renze Chen¹, Kaiwen Sheng¹ - Show less +1 more•Institutions (1)

Peking University¹

9 Mar 2020

TL;DR: FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details.

...read moreread less

Abstract: Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the algorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the application algorithms. In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor computation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines different exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for different hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x performance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.

...read moreread less

193 citations

Proceedings Article•10.1109/CLUSTER49012.2020.00075•

Preliminary Performance Evaluation of the Fujitsu A64FX Using HPC Applications

[...]

Tetsuya Odajima, Yuetsu Kodama, Miwako Tsuji, Motohiko Matsuda, Yutaka Maruyama, Mitsuhisa Sato - Show less +2 more

1 Sep 2020

TL;DR: In a performance comparison with Marvell (Cavium) ThunderX2 processor and Intel Xeon Skylake processor, the A64FX achieved higher performance in a memory bandwidth-intensive application thanks to its high memory bandwidth.

...read moreread less

Abstract: RIKEN Center for Computational Science has been installing the supercomputer Fugaku. The Fujitsu A64FX, based on the Armv8.2-A+SVE architecture, is used in the system. In this paper, we evaluated the seven HPC applications and benchmarks on the A64FX. In a performance comparison with Marvell (Cavium) ThunderX2 processor and Intel Xeon Skylake processor, the A64FX achieved higher performance in a memory bandwidth-intensive application thanks to its high memory bandwidth. However, we confirmed that the performance of the A64FX decreased from a lack of out-of-order resources. To mitigate this problem, the “loop fission” function of the Fujitsu compiler was used to improve the performance.

...read moreread less

48 citations

Journal Article•10.1049/IET-ITS.2019.0481•

License plate segmentation and recognition system using deep learning and OpenVINO

[...]

Riel Castro-Zunti, Juan Yepez, Seok-Bum Ko

01 Feb 2020-Iet Intelligent Transport Systems

TL;DR: An embedded system for fast and accurate license plate segmentation and recognition using a modified single shot detector (SSD) with a feature extractor based on depthwise separable convolutions and linear bottlenecks is proposed.

...read moreread less

Abstract: As Intelligent Transportation Systems applications increase in prevalence, Automatic License Plate Recognition solutions must be made continually faster and more accurate. The authors propose an embedded system for fast and accurate license plate segmentation and recognition using a modified single shot detector (SSD) with a feature extractor based on depthwise separable convolutions and linear bottlenecks. The feature extractor requires less parameters than the original SSD + VGG implementation, enabling fast inference. Tested on the Caltech Cars dataset, the proposed model achieves 96.46% segmentation and 96.23% recognition accuracy. Tested on the UCSD-Stills dataset, the proposed model achieves 99.79% segmentation and 99.79% recognition accuracy. The authors achieve a per-plate (resized to 300 × 300 px) processing time of 59 ms on an Intel Xeon CPU with 12 cores (2.60 GHz per core), 14 ms using the same CPU and OpenVINO (a neural network acceleration platform), and 66 ms using the proposed low-cost Raspberry Pi 3 and Intel Neural Compute Stick 2 with OpenVINO embedded system.

...read moreread less

43 citations

Proceedings Article•10.1109/IPDPSW50202.2020.00070•

Porting a Legacy CUDA Stencil Code to oneAPI

[...]

Steffen Christgau¹, Thomas Steinke¹•Institutions (1)

Zuse Institute Berlin¹

18 May 2020

TL;DR: This paper presents early experiences when using both the compatibility tool and oneAPI as well the employed extension to the SYCL programming standard for the tsunami simulation code easyWave and compares the original code running on Xeon processors using OpenMP as well as CUDA with the performance of the DPC++ counter part.

...read moreread less

Abstract: Recently, Intel released the oneAPI programming environment. With Data Parallel C++(DPC++), oneAPI enables codes to target multiple hardware architectures like multi-core CPUs, GPUs, and even FPGAs or other hardware using a single source. For legacy codes that were written for Nvidia GPUs, a compatibility tool is provided which facilitates the transition to the SYCL-based DPC++ programming language. This paper presents early experiences when using both the compatibility tool and oneAPI as well the employed extension to the SYCL programming standard for the tsunami simulation code easyWave. A performance study compares the original code running on Xeon processors using OpenMP as well as CUDA with the performance of the DPC++ counter part on multicore CPUs as well as integrated GPUs.

...read moreread less

42 citations

Journal Article•10.1016/J.FUTURE.2020.02.069•

A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit

[...]

Filip Petrovič¹, David Střelák², David Střelák¹, Jana Hozzová¹, Jaroslav Ol’ha¹, Richard Trembecký¹, Siegfried Benkner³, Jiří Filipovič¹, Jiří Filipovič³ - Show less +5 more•Institutions (3)

Masaryk University¹, Spanish National Research Council², University of Vienna³

01 Jul 2020-Future Generation Computer Systems

TL;DR: The Kernel Tuning Toolkit as discussed by the authors enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes, which is key to performance portability.

...read moreread less

35 citations

Proceedings Article•10.1109/PMBS51919.2020.00007•

The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing

[...]

Tan Nguyen¹, Samuel Williams¹, Marco Siracusa², Colin MacLean¹, Douglas W. Doerfler¹, Nicholas J. Wright¹ - Show less +2 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, Polytechnic University of Milan²

1 Nov 2020

TL;DR: In this article, the authors use FPGAs to evaluate the benefits of building specialized hardware for numerical kernels found in scientific applications, and compare Intel Arria 10 and Xilinx U280 performance against Intel Xeon, Intel Xeon Phi, and NVIDIA V100 GPUs.

...read moreread less

Abstract: Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware specialization with modest non-recurring engineering cost. In this paper, we use FPGAs to evaluate the benefits of building specialized hardware for numerical kernels found in scientific applications. In order to properly evaluate performance, we not only compare Intel Arria 10 and Xilinx U280 performance against Intel Xeon, Intel Xeon Phi, and NVIDIA V100 GPUs, but we also extend the Empirical Roofline Toolkit (ERT) to FPGAs in order to assess our results in terms of the Roofline Model. Although FPGA performance is known to be far less than that of a GPU, we also benchmark the energy efficiency of each platform for the scientific kernels comparing to microbenchmark and technological limits. Results show that while FPGAs struggle to compete in absolute terms with GPUs on memory- and compute-intensive kernels, they require far less power and can deliver nearly the same energy efficiency.

...read moreread less

34 citations

Proceedings Article•10.1109/HCS49909.2020.9220434•

New 3rd Gen Intel ® Xeon ® Scalable Processor (Codename: Ice Lake-SP)

[...]

Irma Esmer Papazian¹•Institutions (1)

Intel¹

1 Aug 2020

34 citations

Posted Content•

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach.

[...]

Animesh Jain, Shoubhik Bhattacharya, Masahiro Masuda, Vin Sharma, Yida Wang - Show less +1 more

18 Jun 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper addresses the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach that created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context.

...read moreread less

Abstract: A growing number of applications implement predictive functions using deep learning models, which require heavy use of compute and memory One popular technique for increasing resource efficiency is 8-bit integer quantization, in which 32-bit floating point numbers (fp32) are represented using shorter 8-bit integer numbers Although deep learning frameworks such as TensorFlow, TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy, they are not well suited to execute quantized models on a variety of hardware platforms For example, TFLite is optimized to run inference on ARM CPU edge devices but it does not have efficient support for Intel CPUs and Nvidia GPUs In this paper, we address the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach A deep learning compiler such as Apache TVM can enable the efficient execution of model from various frameworks on various targets Many deep learning compilers today, however, are designed primarily for fp32 computation and cannot optimize a pre-quantized INT8 model To address this issue, we created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context With this quantization context, the compiler can generate efficient code for pre-quantized models on various hardware platforms As implemented in Apache TVM, we observe that the QNN-augmented deep learning compiler achieves speedups of 235x, 215x, 135x and 140x on Intel Xeon Cascade Lake CPUs, Nvidia Tesla T4 GPUs, ARM Raspberry Pi3 and Pi4 respectively against well optimized fp32 execution, and comparable performance to the state-of-the-art framework-specific solutions

...read moreread less

30 citations

Journal Article•10.1109/ACCESS.2020.3023946•

FPGAN: An FPGA Accelerator for Graph Attention Networks With Software and Hardware Co-Optimization

[...]

Weian Yan¹, Weiqin Tong¹, Xiaoli Zhi¹•Institutions (1)

Shanghai University¹

14 Sep 2020-IEEE Access

TL;DR: This research implements an FPGA-based accelerator called FPGAN for graph attention networks that achieves significant improvement on performance and energy efficiency without losing accuracy compared with PyTorch baseline.

...read moreread less

Abstract: The Graph Attention Networks (GATs) exhibit outstanding performance in multiple authoritative node classification benchmark tests (including transductive and inductive). The purpose of this research is to implement an FPGA-based accelerator called FPGAN for graph attention networks that achieves significant improvement on performance and energy efficiency without losing accuracy compared with PyTorch baseline. It eliminates the dependence on digital signal processors (DSPs) and large amounts of on-chip memory and can even work well on low-end FPGA devices. We design FPGAN with software and hardware co-optimization across the full stack from algorithm through architecture. Specifically, we compress model to reduce the model size, quantify features to perform fixed-point calculation, replace multiplication addition cell (MAC) with shift addition units (SAUs) to eliminate the dependence on DSPs, and design an efficient algorithm to approximate SoftMax function. We also adjust the activation functions and fuse operations to further reduce the computation requirement. Moreover, all data is vectorized and aligned for scalable vector computation and efficient memory access. All the above optimizations are integrated into a universal hardware pipeline for various structures of GATs. We evaluate our design on an Inspur F10A board with an Intel Arria 10 GX1150 and 16 GB DDR3 memory. Experimental results show that FPGAN can achieve 7.34 times speedup over Nvidia Tesla V100 and 593 times over Xeon CPU Gold 5115 while maintaining accuracy, and 48 times and 2400 times on energy efficiency respectively.

...read moreread less

25 citations

Journal Article•10.1109/TSUSC.2020.3045195•

The Impact of CPU Voltage Margins on Power-Constrained Execution

[...]

Panos Koutsovasilis¹, Christos D. Antonopoulos¹, Nikolaos Bellas¹, Spyros Lalis¹, George N. Papadimitriou², Athanasios Chatzidimitriou², Dimitris Gizopoulos² - Show less +3 more•Institutions (2)

University of Thessaly¹, National and Kapodistrian University of Athens²

16 Dec 2020

TL;DR: The impact of reducing voltage margins beyond the nominal level on the efficiency of CPU power capping mechanisms is investigated, for three commercial systems, two Applied Micro ARMv8 micro-servers (X-Gene2 and X-Gene3) and an Intel x86-64 (Xeon E3).

...read moreread less

Abstract: CPUs typically operate at a voltage which is higher than what is strictly required, using voltage margins to account for process variability and anticipate any combination of adverse operating conditions. However, these worst-case scenarios occur rarely, if ever, thus the operating voltage is overly pessimistic resulting in excessive power dissipation which leads to decreased performance under power capping. In this paper, we investigate the impact of reducing voltage margins beyond the nominal level on the efficiency of CPU power capping mechanisms, for three commercial systems, two Applied Micro ARMv8 micro-servers (X-Gene2 and X-Gene3) and an Intel x86-64 (Xeon E3). We show that CPU power capping at reduced voltage margins compared with Intel's RAPL and Dynamic Frequency Scaling (DFS) mechanisms results in performance improvement by up to 64% and 24% on average, respectively. In combination with state-of-the-art thread packing, the reduction of CPU voltage margins results in 36%, 33% and 27% performance improvement compared with RAPL and DFS for the Xeon E3 and the X-Gene processors, respectively. Also, we validate the robustness of our approach with a set of long-running experiments and show that significant energy gains can be achieved even when considering the cost of checkpointing and recovery in large-scale systems.

...read moreread less

Journal Article•10.1016/J.CPC.2020.107351•

Semi-Lagrangian Vlasov simulation on GPUs

[...]

Lukas Einkemmer¹•Institutions (1)

University of Innsbruck¹

01 Sep 2020-Computer Physics Communications

TL;DR: In this article, a semi-Lagrangian discontinuous Galerkin (SLDG) scheme is used for discretization of the Vlasov equation on GPUs.

...read moreread less

Proceedings Article•10.1109/IPDPSW.2018.00027•

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

[...]

Hamid Reza Zohouri¹, Artur Podobas¹, Satoshi Matsuoka¹•Institutions (1)

Tokyo Institute of Technology¹

14 Feb 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this paper, the authors evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis (HLS) and show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, their design technique with combined spatial and temporal blocking remains effective.

...read moreread less

Abstract: In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-the-art YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.

...read moreread less

Proceedings Article•10.1109/FCCM48280.2020.00028•

FP-AMG: FPGA-Based Acceleration Framework for Algebraic Multigrid Solvers

[...]

Pouya Haghi¹, Tong Geng¹, Anqi Guo¹, Tianqi Wang², Martin C. Herbordt¹ - Show less +1 more•Institutions (2)

Boston University¹, University of Science and Technology of China²

3 May 2020

TL;DR: This work proposes an efficient FPGA-based reconfigurable framework, called FP-AMG, for high-performance AMG calculation, and proposes a novel and scalable architecture that can be reused for all kernels in AMG.

...read moreread less

Abstract: Partial Differential Equations (PDEs) are fundamental to many real-world scientific computing applications and so their optimization has undergone decades of study. Algebraic multigrid (AMG) is one of the most well-known solvers, being widely adopted in High Performance Computing (HPC) due to its good scalability. Acceleration of AMG is known to be very challenging, due to the following reasons: (1) irregular computation patterns, (2) random memory access, and (3) a large number of kernels with various computation types. To the best of our knowledge, there is no prior work on FPGA-based acceleration of AMG. To tackle these challenges, we propose an efficient FPGA-based reconfigurable framework, called FP-AMG, for high-performance AMG calculation. In order to obtain full pipeline utilization, we propose a novel and scalable architecture that can be reused for all kernels in AMG. Given that AMG is strictly memory-bound, we propose algorithmic and architectural optimizations to ensure nearly ideal use of memory bandwidth. The efficiency of FP-AMG is evaluated with six well-known benchmarks on two FPGA devices: one with and one without high bandwidth memory (HBM). The experimental results are compared with a highly optimized Intel Xeon E5-2680-V4 implementation of the state-of-the-art HYPRE library. Our experiments show that FP-AMG can achieve average speedups of $ 2.5\times$ and $ 6.6\times$, for FPGAs without and with HBM, respectively.

...read moreread less

Proceedings Article•10.1109/CCGRID49817.2020.00-47•

The Power of ARM64 in Public Clouds

[...]

Qingye Jiang¹, Young Choon Lee², Albert Y. Zomaya¹•Institutions (2)

University of Sydney¹, Macquarie University²

1 May 2020

TL;DR: This paper studies the performance characteristics of the Amazon Graviton Processor – an ARM64 processor with the Cortex-A72 micro-architecture – using the A1 (Graviton) product family on AWS EC2, with comparisons to the I3 and M5 product families based on Intel Xeon processors.

...read moreread less

Abstract: ARM processors, with their low power consumption and heat dissipation, have been highly successful in embedded systems. In the recent past, there have been attempts to adopt these energy-efficient processors for servers in data centers. However, a fundamental question remains open with ARM-based systems on server side is whether they are capable of handling compute-intensive workloads at scale. This paper gives our answer to this question with an empirical approach. We study the performance characteristics of the Amazon Graviton Processor – an ARM64 processor with the Cortex-A72 micro-architecture – using the A1 (Graviton) product family on AWS EC2, with comparisons to the I3 and M5 product families based on Intel Xeon processors. We use a combination of micro benchmark and performance counters to identify the lack of L3 cache and the slower memory access speed limit Graviton’s capability in achieving higher performance. We confirm Graviton’s capability in handling various large-scale horizontally scalable compute-intensive workloads, including multi-tier web service, video transcoding and terabyte scale sorting. In our large-scale evaluations, the test worker fleet has up to 1600 vCPU cores, which is by far the largest ARM64 cluster that has been reported. We observe that the A1 product family achieves the same price-performance in multi-tier web service, up to 37% cost saving in video transcoding, and up to 65% cost saving in terabyte scale sorting, as compared with the I3 and M5 product families.

...read moreread less

Book Chapter•10.1007/978-3-030-58144-2_8•

Towards an auto-tuned and task-based SpMV (LASs Library)

[...]

Sandra Catalán¹, Tetsuzo Usui², Leonel Toledo³, Xavier Martorell⁴, Jesús Labarta⁴, Pedro Valero-Lara³ - Show less +2 more•Institutions (4)

Complutense University of Madrid¹, Fujitsu², Barcelona Supercomputing Center³, Polytechnic University of Catalonia⁴

22 Sep 2020

TL;DR: A novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches.

...read moreread less

Abstract: We present a novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches. LASs is based on OmpSs, a task-based runtime that extends OpenMP directives, providing more flexibility to apply new strategies. Based on tasking and nesting, with the aim of improving the workload imbalance inherent to the SpMV operation, we present a strategy especially useful for highly imbalanced input matrices. In this approach, the number of created tasks is dynamically decided in order to maximize the use of the resources of the platform. Throughout this paper, SpMV behavior depending on the selected strategy (state of the art and proposed strategies) is deeply analyzed, setting in this way the base for a future auto-tunable code that is able to select the most suitable approach depending on the input matrix. The experiments of this work were carried out for a set of 12 matrices from the Suite Sparse Matrix Collection, all of them with different characteristics regarding their sparsity. The experiments of this work were performed on a node of Marenostrum 4 supercomputer (with two sockets Intel Xeon, 24 cores each) and on a node of Dibona cluster (using one ARM ThunderX2 socket with 32 cores). Our tests show that, for Intel Xeon, the best parallelization strategy reduces the execution time of the reference MKL multi-threaded version up to 67%. On ARM ThunderX2, the reduction is up to 56% with respect to the OmpSs parallel reference.

...read moreread less

Proceedings Article•10.1109/CCGRID49817.2020.00-58•

CSR2: A New Format for SIMD-accelerated SpMV

[...]

Bian Bian¹, Jianqiang Huang¹, Runting Dong¹, Lingbin Liu¹, Xiaoying Wang¹ - Show less +1 more•Institutions (1)

Qinghai University¹

11 May 2020

TL;DR: A new sparse matrix storage format CSR2 (Compressed Sparse Row 2) suitable for SIMD (Single Instruction Multiple Data)-accelerated SpMV (Sparse matrix-vector multiplication) and suitable for use on processor platforms with SIMD vectorization is proposed.

...read moreread less

Abstract: SpMV (Sparse matrix-vector multiplication) has attracted the attention of researchers in related fields at home and abroad. Of course, improving SpMV performance has also been a research hot spot for researchers in related fields. In this paper, we propose a new sparse matrix storage format CSR2 (Compressed Sparse Row 2) suitable for SIMD (Single Instruction Multiple Data)-accelerated SpMV. First, the format operation of CSR2 is easy to implement and has a low overhead of conversion. Second, CSR2 is a new single format and suitable for use on processor platforms with SIMD vectorization. We compare the SpMV algorithm based on CSR21 with the one based on the current most advanced single format CSR5 (Compressed Sparse Row 5) on two mainstream high-performance processors: Intel Core i7-7700HQ CPU and Intel Xeon CPU E5-2670 v3. We choose 10 sets of regular matrices and 3 sets of irregular matrices to be used as benchmark suit. Experiments show that for the 13 sets of regular and irregular matrices in the benchmark suite, CSR2 has an average performance improvement of more than 50% compared to CSR5 (up to 125% on Intel Core i77700HQ CPU and 303% on Intel Xeon CPU E5-2670 v3). For applications with multiple iterations, in reality, using our CSR2 can bring low-overhead format conversion and high-throughput computing performance.

...read moreread less

Journal Article•10.1016/J.FUTURE.2020.01.051•

Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images

[...]

Ahmad Salah¹, Ahmad Salah², Kenli Li¹, Khalid M. Hosny², Mohamed M. Darwish³, Qi Tian⁴ - Show less +2 more•Institutions (4)

Hunan University¹, Zagazig University², Assiut University³, University of Texas at San Antonio⁴

03 Feb 2020-Future Generation Computer Systems

TL;DR: This work presents a set of parallel implementations for parallelizing quaternion moment image representations on different parallel architectures and proposes the loop mitigation technique, proposed to boost the level of parallelism in massively parallel environments, balance the parallel workload, and reduce both the space complexity and synchronization overhead.

...read moreread less

Posted Content•

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

[...]

Jiannan Tian¹, Cody Rivera², Sheng Di³, Jieyang Chen⁴, Xin Liang⁴, Dingwen Tao¹, Franck Cappello³ - Show less +3 more•Institutions (4)

Washington State University¹, University of Alabama², Argonne National Laboratory³, Oak Ridge National Laboratory⁴

20 Oct 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper proposes and implements an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: how to parallelize the entire Huffman encode algorithm, including codebook construction, and how to fully utilize the high memory-bandwidth feature of modern GPU architecture.

...read moreread less

Abstract: Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0X and 6.8X on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3X over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs.

...read moreread less

Proceedings Article•10.1109/ICCD50377.2020.00105•

Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon

[...]

Haoran Zhao¹, Tian Xia¹, Chenyang Li¹, Wenzhe Zhao¹, Nanning Zheng¹, Pengju Ren¹ - Show less +2 more•Institutions (1)

Xi'an Jiaotong University¹

1 Oct 2020

TL;DR: In this article, a fast preprocessing method is proposed to divide the matrix into sub-matrices and determine the critical performance bound of submatrices according to the data distribution characteristics.

...read moreread less

Abstract: Sparse Matrix-Vector Multiplication (SpMV) is a fundamental workload of numerous applications. However, for today's high-end superscalar CPUs, such as Intel Xeon series, it is usually difficult to efficiently perform SpMV due to the irregular, matrix-dependent data access and computation pattern. While many researches focus on optimizing the memory bandwidth bound by improving data locality, this work dives into the execution of SpMV computation on Intel Xeon CPU and reveals that the bad-speculation penalty is significant in many sparse matrices and too expensive to be ignored. We study and characterize sparsity structure types that are more vulnerable to the cache miss penalty or the bad speculation penalty, respectively. Based on this insight, we proposed a fast preprocessing method, which divides the matrix into sub-matrices and determines the critical performance bound of sub-matrices according to the data distribution characteristics. On each submatrix, a combination of dedicated row reordering strategies is performed to efficiently alleviate its key performance bounds: bad speculation, cache miss, or both. Our matrix representation is based on standard Compressed Sparse Row (CSR) format, and can be easily adapted to existing SpMV libraries. Our approach is evaluated on Intel Xeon Gold 6146 Processor with a wide-range of matrices from the SuiteSparse benchmarks. The results demonstrate that the proposed approach achieves an average 1.8× speedup (up to 2.5×) on multi-threaded MKL Sparse Routines, with a quite low pre-processing cost. Additionally, when used in conjunction with MKL's original optimization method, our approach can further prompt the speedup, to average 3.6 × (up to 8.3 ×), This result indicates that our method can serve as a fast and wide-spectrum optimization method which is compatible with existing routines.

...read moreread less

Book Chapter•10.1007/978-3-030-13705-2_16•

Parallel Iterative Solution of Large Sparse Linear Equation Systems on the Intel MIC Architecture

[...]

Hana Alyahya, Rashid Mehmood¹, Iyad Katib¹•Institutions (1)

King Abdulaziz University¹

1 Jan 2020

TL;DR: This chapter investigates the performance of parallel implementations of the Jacobi method on Knights Corner (KNC), the first generation of the Intel MIC architectures, and measures their performance in terms of execution time, offloading time, and speedup.

...read moreread less

Abstract: Many important scientific, engineering, and smart city applications require solving large sparse linear equation systems. The numerical methods for solving linear equations can be categorised into direct methods and iterative methods. Jacobi method is one of the iterative solvers that has been widely used due to its simplicity and efficiency. Its performance is affected by factors including the storage format, the specific computational algorithm, and its implementation. While the performance of Jacobi has been studied extensively on conventional CPU architectures, research on its performance on emerging architectures, such as the Intel Many Integrated Core (MIC) architecture, is still in its infancy. In this chapter, we investigate the performance of parallel implementations of the Jacobi method on Knights Corner (KNC), the first generation of the Intel MIC architectures. We implement Jacobi with two storage formats, Compressed Sparse Row (CSR) and Modified Sparse Row (MSR), and measure their performance in terms of execution time, offloading time, and speedup. We report results of sparse matrices with over 28 million rows and 640 million non-zero elements acquired from 13 diverse application domains. The experimental results show that our Jacobi parallel implementation on MIC achieves speedups of up to 27.75× compared to the sequential implementation. It also delivers a speedup of up to 3.81× compared to a powerful node comprising 24 cores in two Intel Xeon E5-2695v2 processors.

...read moreread less

Proceedings Article•10.1109/FCCM48280.2020.00029•

Algorithm-Hardware Co-design for BQSR Acceleration in Genome Analysis ToolKit

[...]

Michael Lo¹, Zhenman Fang², Jie Wang¹, Peipei Zhou¹, Mau-Chung Frank Chang¹, Jason Cong¹ - Show less +2 more•Institutions (2)

University of California, Los Angeles¹, Simon Fraser University²

3 May 2020

TL;DR: This paper focuses on the algorithm and hardware co-design for the Base Quality Score Re-calibration (BQSR) step in GATK, which is an important and time-consuming step to correct systematic errors made by a sequencing machine.

...read moreread less

Abstract: Genome sequencing is one of the key applications in healthcare and has a great potential to realize precision medicine and personalized healthcare. However, its computing process is very time consuming. Even pre-processing the raw sequence data of a whole genome for a single person to the analysis ready data can take several days on a single-core CPU.In this paper, we propose to accelerate the performance of the widely used Genome Analysis ToolKit (GATK) using FPGAs. More specifically, we focus on the algorithm and hardware co-design for the Base Quality Score Re-calibration (BQSR) step in GATK, which is an important and time-consuming step to correct systematic errors made by a sequencing machine. Prior studies did not consider hardware acceleration for BQSR because it requires a large amount of memory with random access and has a lot of control flow. To address these challenges, we first adapt the algorithm to resolve the random memory access conflicts to achieve a fully pipelined accelerator design and reduce its dataset size. Second, we leverage the newly introduced large-capacity UltraRAM (URAM) in Xilinx UltraScale+ FPGAs to butter BQSR’s large dataset on chip, and further optimize its operating frequency. Finally, we also explore the coarse-grained pipeline and parallelism to improve the overall performance of the BQSR accelerator. Compared to the latest software implementation of BQSR on GATK 4.1, running on single-thread and 56-thread CPUs (14nm Xeon E5-2680 v4), our FPGA accelerator running on Xilinx 16nmUltraScale+VCUl525 board achieves up to 40. 7x and 8. 5x speedups, respectively.

...read moreread less

Journal Article•10.1109/TCBB.2018.2884701•

Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor

[...]

Jose Manuel Herruzo¹, Sonia Gonzalez-Navarro¹, Pablo Ibáñez-Marín², Víctor Viñals-Yúfera², Jesús Alastruey-Benedé², Oscar Plata¹ - Show less +2 more•Institutions (2)

University of Málaga¹, University of Zaragoza²

01 Jul 2020-IEEE/ACM Transactions on Computational Biology and Bioinformatics

TL;DR: This paper proposes a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high- Bandwidth stacked memory technology.

...read moreread less

Abstract: FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.

...read moreread less

Journal Article•10.1109/MM.2019.2949986•

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

[...]

Ammar Ahmad Awan¹, Arpan Jain¹, Ching-Hsiang Chu¹, Hari Subramoni¹, D.K. Panda¹ - Show less +1 more•Institutions (1)

Ohio State University¹

01 Jan 2020-IEEE Micro

TL;DR: A message-padding scheme for Horovod is designed, illustrated significantly smoother allreduce latency profiles are illustrated, and cases where improvement for end-to-end training are observed.

...read moreread less

Abstract: Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, we choose Horovod, a distributed training middleware, to analyze and profile various DNN training workloads using TensorFlow and PyTorch in addition to standard MPI microbenchmarks. We use a wide variety of systems with CPUs like Intel Xeon and IBM POWER9, GPUs like Volta V100, and various interconnects to analyze the following metrics: 1) message-size with Horovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number of MPI/NCCL calls; and 4) time taken by each MPI/NCCL call. We observed extreme performance variations for non-power-of-two message sizes on different platforms. To address this, we design a message-padding scheme for Horovod, illustrate significantly smoother allreduce latency profiles, and report cases where we observed improvement for end-to-end training.

...read moreread less

Journal Article•10.1109/ACCESS.2019.2959905•

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

[...]

Hamidreza Khaleghzadeh¹, Ravi Reddy Manumachu¹, Alexey Lastovetsky¹•Institutions (1)

University College Dublin¹

01 Jan 2020-IEEE Access

TL;DR: This paper proposes a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of identical nodes where each node has $h$ identical nodes and proposes an extension of the algorithm for clusters of non-identical nodes.

...read moreread less

Abstract: Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of $h$ identical nodes where each node has $c$ heterogeneous processors. This algorithm takes as input $c$ discrete speed functions of cardinality $m$ corresponding to the $c$ heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of $O(m^{2} \times h + m^{3} \times c^{3})$ unlike the state-of-the-art algorithm solving the same problem with the complexity of $O(m^{3} \times c^{3} \times h^{3})$ . We also propose an extension of the algorithm for clusters of $h$ non-identical nodes where each node has $c$ heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.

...read moreread less

Journal Article•10.1109/TCAD.2020.3012860•

ReSQM: Accelerating Database Operations Using ReRAM-Based Content Addressable Memory

[...]

Huize Li¹, Hai Jin¹, Long Zheng¹, Xiaofei Liao¹•Institutions (1)

Huazhong University of Science and Technology¹

02 Oct 2020-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: ReSQM, a novel ReCAM-based accelerator, which can dramatically reduce the response time of database systems, and a new data mapping mechanism that allows enjoying in-situ in-memory computations for SELECTION operating upon intermediate results.

...read moreread less

Abstract: The huge amount of data enforces great pressure on the processing efficiency of database systems. By leveraging the in-situ computing ability of emerging nonvolatile memory, processing-in-memory (PIM) technology shows great potential in accelerating database operations against traditional architectures without data movement overheads. In this article, we introduce ReSQM, a novel ReCAM-based accelerator, which can dramatically reduce the response time of database systems. The key novelty of ReSQM is that some commonly used database queries that would be otherwise processed inefficiently in previous studies can be in-situ accomplished with massively high parallelism by exploiting the PIM-enabled ReCAM array. ReSQM supports some typical database queries (such as SELECTION, SORT, and JOIN) effectively based on the limited computational mode of the ReCAM array. ReSQM is also equipped with a series of hardware-algorithm co-designs to maximize efficiency. We present a new data mapping mechanism that allows enjoying in-situ in-memory computations for SELECTION operating upon intermediate results. We also develop a count-based ReCAM-specific algorithm to enable the in-memory sorting without any row swapping. The relational comparisons are integrated for accelerating inequality join by making a few modifications to the ReCAM cells with negligible hardware overhead. The experimental results show that ReSQM can improve the (energy) efficiency by $611\times $ ( $193\times $ ), $19\times $ ( $17\times $ ), $59\times $ ( $43\times $ ), and $307\times $ ( $181\times $ ) in comparison to a 10-core Intel Xeon E5-2630v4 processor for SELECTION, SORT, equi-join, and inequality join, respectively. In contrast to state-of-the-art CMOS-based CAM, GPU, FPGA, NDP, and PIM solutions, ReSQM can also offer $2.2\times 39\times $ speedups.

...read moreread less

Journal Article•10.1145/3380934•

Acceleration of PageRank with Customized Precision Based on Mantissa Segmentation

[...]

Thomas Grützmacher¹, Terry Cojean¹, Goran Flegar², Hartwig Anzt¹, Enrique S. Quintana-Ortí³ - Show less +1 more•Institutions (3)

Karlsruhe Institute of Technology¹, James I University², Polytechnic University of Valencia³

9 Mar 2020

TL;DR: A communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges is described.

...read moreread less

Abstract: We describe the application of a communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges. Our variable-precision strategy, using a customized precision format based on mantissa segmentation (CPMS), abandons the IEEE 754 single- and double-precision number representation formats employed in the standard implementation of PageRank, and instead handles the data in memory using a customized floating-point format. The customized format enables fast data access in different accuracy, prevents overflow/underflow by preserving the IEEE 754 double-precision exponent, and efficiently avoids data duplication, since all bits of the original IEEE 754 double-precision mantissa are preserved in memory, but re-organized for efficient reduced precision access. With this approach, the truncated values (omitting significand bits), as well as the original IEEE double-precision values, can be retrieved without duplicating the data in different formats. Our numerical experiments on an NVIDIA V100 GPU (Volta architecture) and a server equipped with two Intel Xeon Platinum 8168 CPUs (48 cores in total) expose that, compared with a standard IEEE double-precision implementation, the CPMS-based PageRank completes about 10% faster if high-accuracy output is needed, and about 30% faster if reduced output accuracy is acceptable.

...read moreread less

Journal Article•10.1007/S11227-020-03174-5•

Design and implementation of an elastic processor with hyperthreading technology and virtualization for elastic server models

[...]

Parth Bir, Shylaja Vinaykumar Karatangi, Amrita Rai

24 Jan 2020-The Journal of Supercomputing

TL;DR: This paper presents design methodology and implementation of an elastic-natured 32-bit RISC-pipelined processor inspired from Intel Xeon and MIPS to function as a standard integrated platform for server models.

...read moreread less

Abstract: The server models functioning in the industry are required to be more elastic in nature. They are constantly scaling-up and scaling-down on required computation power depending on different conditions. These elastic cloud platforms use accelerators like DSP’s, TPU’s, GPU’s, FPGA’s, and multi-core processors to provide exponential computing power and outsource their services. This process is not only costly and non-efficient but is also responsible for damaging the server’s hardware architecture. Furthermore, these additions degrade the level of threading and symmetric parallel processing capability of the architecture. Intel uses hyperthreading technology (HTT) to split the workload between hardware and operating system to avoid additions, but that too is only possible up until a certain limit. This paper presents design methodology and implementation of an elastic-natured 32-bit RISC-pipelined processor inspired from Intel Xeon and MIPS to function as a standard integrated platform for server models. It implements concepts of hyperthreading technology (HTT) and virtualization on hardware basis. It will allow to derive multiple outputs from units on hardware basis to enhance security and performance without compromising compatibility. The designed elastic core uses a probabilistic node-based closed-queuing network model for server analysis and implementation. Hence, elastic behavior from individual core microarchitecture to server model architecture enables a generic automated scaling self-aware optimization architecture.

...read moreread less

Proceedings Article•10.1109/FPL50879.2020.00019•

HyperLogLog Sketch Acceleration on FPGA

[...]

Amit Kulkarni¹, Monica Chiosa¹, Thomas B. PreuBer¹, Kaan Kara¹, David Sidler², Gustavo Alonso¹ - Show less +2 more•Institutions (2)

ETH Zurich¹, Microsoft²

1 Aug 2020

TL;DR: In this paper, a multi-pipelined high-cardinality HyperLogLogLog implementation was proposed to improve the performance of data streams. But it does not count every data item but provides probabilistic guarantees on the result thereby reducing its memory footprint.

...read moreread less

Abstract: Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain such as URLs, IP addresses, user IDs, etc. Among the many data sketches available, HyperLogLog has become the reference for cardinality counting i.e., how many distinct data items there are in a data set. Although it does not count every data item but provides probabilistic guarantees on the result thereby reducing its memory footprint, and the result is often used to analyze data streams. In this paper, we explore how to implement HyperLogLog on an FPGA to benefit from the parallelism available and the ability to process data streams coming from high-speed networks. Our multi-pipelined high-cardinality HyperLogLog implementation delivers 1.8x higher throughput than the best-optimized multi-thread HyperLogLog running on a dual-socket Intel Xeon E5-2630 v3 system with a total of 16 cores and 32 hyper-threads.

...read moreread less

Journal Article•10.1016/J.ASCOM.2020.100386•

Radio-astronomical imaging on graphics processors

[...]

Bram Veenboer, J. W. Romein

01 Jul 2020-Astronomy and Computing

TL;DR: A thorough performance analysis of the image-domain gridding algorithm for an Intel Xeon CPU, Intel Xeon Phi, and GPUs from AMD and NVIDIA shows that, by evaluating trigonometric functions in hardware, GPUs are both much faster and more energy efficient than a CPU or Xeon Phi.

...read moreread less

...

Expand