TL;DR: This paper empirically evaluates the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy, and uses large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources.
Abstract: In this paper, we investigate large scale computers' capability of speeding up deep neural networks (DNN) training. Our approach is to use large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from a group of researchers at Facebook, our approach shows higher test accuracy on batch sizes that are larger than 16K. Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe v1.0.7.
TL;DR: The Neural Cache architecture as mentioned in this paper re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for deep neural networks, which is capable of fully executing convolutional, fully connected, and pooling layers in-cache.
Abstract: This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache. Our experimental results show that the proposed architecture can improve inference latency by 18.3X over state-of-art multi-core CPU (Xeon E5), 7.7X over server class GPU (Titan Xp), for Inception v3 model. Neural Cache improves inference throughput by 12.4X over CPU (2.2X over GPU), while reducing power consumption by 50% over CPU (53% over GPU).
TL;DR: OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM).
Abstract: Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.
TL;DR: This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks, and shows that the proposed architecture can improve inference latency and reduce power consumption.
Abstract: This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache. Our experimental results show that the proposed architecture can improve inference latency by 18.3x over state-of-art multi-core CPU (Xeon E5), 7.7x over server class GPU (Titan Xp), for Inception v3 model. Neural Cache improves inference throughput by 12.4x over CPU (2.2x over GPU), while reducing power consumption by 50% over CPU (53% over GPU).
TL;DR: It is demonstrated that despite the relatively small amount of DRAM, GraFBoost achieves high performance with very large graphs no other system can handle, and rivals the performance of the fastest software platforms on sizes of graphs that existing platforms can handle.
Abstract: We describe GraFBoost, a flash-based architecture with hardware acceleration for external analytics of multi-terabyte graphs. We compare the performance of GraFBoost with 1 GB of DRAM against various state-of-the-art graph analytics software including FlashGraph, running on a 32-thread Xeon server with 128 GB of DRAM. We demonstrate that despite the relatively small amount of DRAM, GraFBoost achieves high performance with very large graphs no other system can handle, and rivals the performance of the fastest software platforms on sizes of graphs that existing platforms can handle. Unlike in-memory and semi-external systems, GraFBoost uses a constant amount of memory for all problems, and its performance decreases very slowly as graph sizes increase, allowing GraFBoost to scale to much larger problems than possible with existing systems while using much less resources on a single-node system. The key component of GraFBoost is the sort-reduce accelerator, which implements a novel method to sequentialize fine-grained random accesses to flash storage. The sort-reduce accelerator logs random update requests and then uses hardware-accelerated external sorting with interleaved reduction functions. GraFBoost also stores newly updated vertex values generated in each superstep of the algorithm lazily with the old vertex values to further reduce I/O traffic. We evaluate the performance of GraFBoost for PageRank, breadth-first search and betweenness centrality on our FPGA-based prototype (Xilinx VC707 with 1 GB DRAM and 1 TB flash) and compare it to other graph processing systems including a pure software implementation of GrapFBoost.
TL;DR: This work presents a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads, and illustrates that reduced precision representations such as binary achieve the best performance.
Abstract: General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.
TL;DR: This paper introduces direct convolution kernels for x86 architectures, in particular for Xeon and Xeon Phi systems, which are implemented via a dynamic compilation approach, and demonstrates how these JIT-optimized kernels can be integrated into a light-weight multi-node graph execution model.
Abstract: Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, and direct convolution primarily targeting GPUs. In this paper, we introduce direct convolution kernels for x86 architectures, in particular for Xeon and Xeon Phi systems, which are implemented via a dynamic compilation approach. Our JIT-based implementation shows close to theoretical peak performance, depending on the setting and the CPU architecture at hand. We additionally demonstrate how these JIT-optimized kernels can be integrated into a light-weight multi-node graph execution model. This illustrates that single- and multi-node runs yield high efficiencies and high image-throughputs when executing state-of-the-art image recognition tasks on CPUs.
TL;DR: It has been observed that the GPU runs faster than the CPU in all tests performed, and in some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server.
Abstract: Deep learning approaches are machine learning methods used in many application fields today. Some core mathematical operations performed in deep learning are suitable to be parallelized. Parallel processing increases the operating speed. Graphical Processing Units (GPU) are used frequently for parallel processing. Parallelization capacities of GPUs are higher than CPUs, because GPUs have far more cores than Central Processing Units (CPUs). In this study, benchmarking tests were performed between CPU and GPU. Tesla k80 GPU and Intel Xeon Gold 6126 CPU was used during tests. A system for classifying Web pages with Recurrent Neural Network (RNN) architecture was used to compare performance during testing. CPUs and GPUs running on the cloud were used in the tests because the amount of hardware needed for the tests was high. During the tests, some hyperparameters were adjusted and the performance values were compared between CPU and GPU. It has been observed that the GPU runs faster than the CPU in all tests performed. In some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server. These values can be further increased by using a GPU server with more features.
TL;DR: This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized, and a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations.
Abstract: Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth utilization. Outer loop parallelization is simple for both engine types but suffers from high load imbalance. This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized. Our first contribution is a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations. This eliminates most write traffic and avoids all synchronization, leading to speedups of up to 50X. Our second contribution is the Vector-Sparse format, which addresses the obstacles to vectorization that stem from the commonly-used Compressed-Sparse data structure. Our new format eliminates unaligned memory accesses and bounds checks within vector operations, two common problems when processing low-degree vertices. Vectorization with Vector-Sparse leads to speedups of up to 2.5X. Our contributions are embodied in Grazelle, a hybrid graph processing framework. On a server equipped with four Intel Xeon E7-4850 v3 processors, Grazelle respectively outperforms Ligra, Polymer, GraphMat, and X-Stream by up to 15.2X, 4.6X, 4.7X, and 66.8X.
TL;DR: A new model-based data partitioning algorithm is proposed, which minimizes the execution time of computations in the parallel execution of the application, and is proved the correctness of the algorithm and its complexity of the cardinality of the input discrete speed functions.
Abstract: Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc. This has resulted in severe resource contention and Non-Uniform Memory Access (NUMA) that have posed serious challenges to model and algorithm developers. Moreover, the accelerators feature limited main memory compared to the multicore CPU host and are connected to it via limited bandwidth PCI-E links thereby requiring support for efficient out-of-card execution. To summarize, the complexities (resource contention, NUMA, accelerator-specific limitations, etc.) have introduced new challenges to optimization of data-parallel applications on these platforms for performance. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance. We then propose a new model-based data partitioning algorithm, which minimizes the execution time of computations in the parallel execution of the application. This algorithm takes as input a set of $p$ discrete speed functions corresponding to $p$ available heterogeneous processors. It does not make any assumptions about the shapes of these functions. We prove the correctness of the algorithm and its complexity of $O(m^3 \times p^3)$ , where $m$ is the cardinality of the input discrete speed functions. We experimentally demonstrate the optimality and efficiency of our algorithm using two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous cluster of nodes where each node contains an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor.
TL;DR: An analytic multi-core processor-performance model that takes as inputs a parametric microarchitecture-independent characterization of the target workload, and a hardware configuration of the core and the memory hierarchy makes automated design-space exploration of exascale systems possible, which is shown using a real-world case study from radio astronomy.
Abstract: Simulators help computer architects optimize system designs. The limited performance of simulators even of moderate size and detail makes the approach infeasible for design-space exploration of future exascale systems. Analytic models, in contrast, offer very fast turn-around times. In this paper we propose an analytic multi-core processor-performance model that takes as inputs a) a parametric microarchitecture-independent characterization of the target workload, and b) a hardware configuration of the core and the memory hierarchy. The processor-performance model considers instruction-level parallelism (ILP) per type, models single instruction, multiple data (SIMD) features, and considers cache and memory-bandwidth contention between cores. We validate our model by comparing its performance estimates with measurements from hardware performance counters on Intel Xeon and ARM Cortex-A15 systems. We estimate multi-core contention with a maximum error of 11.4 percent. The average single-thread error increases from 25 percent for a state-of-the-art simulator to 59 percent for our model, but the correlation is still 0.8, a high relative accuracy, while we achieve a speedup of several orders of magnitude. With a much higher capacity than simulators and more reliable insights than back-of-the-envelope calculations it makes automated design-space exploration of exascale systems possible, which we show using a real-world case study from radio astronomy.
TL;DR: RoboX is created, an end-to-end solution which exposes a high-level domain-specific language to roboticists which allows roboticists to express the physics of the robot and its task in a form close to its concise mathematical expressions.
Abstract: Novel algorithmic advances have paved the way for robotics to transform the dynamics of many social and enterprise applications To achieve true autonomy, robots need to continuously process and interact with their environment through computationally-intensive motion planning and control algorithms under a low power budget Specialized architectures offer a potent choice to provide low-power, high-performance accelerators for these algorithms Instead of taking a traditional route which profiles and maps hot code regions to accelerators, this paper delves into the algorithmic characteristics of the application domain We observe that many motion planning and control algorithms are formulated as a constrained optimization problems solved online through Model Predictive Control (MPC) While models and objective functions differ between robotic systems and tasks, the structure of the optimization problem and solver remain fixed Using this theoretical insight, we create RoboX, an end-to-end solution which exposes a high-level domain-specific language to roboticists This interface allows roboticists to express the physics of the robot and its task in a form close to its concise mathematical expressions The RoboX backend then automatically maps this high-level specification to a novel programmable architecture, which harbors a programmable memory access engine and compute-enabled interconnects Hops in the interconnect are augmented with simple functional units that either operate on in-fight data or are bypassed according a micro-program Evaluations with six different robotic systems and tasks show that RoboX provides a 294 x (73 x) speedup and 221 x (794 x) performance-per-watt improvement over an ARM Cortex A57 (Intel Xeon E3) Compared to GPUs, RoboX attains 78x, 655x, and 718x higher Performance-per-Watt to Tegra X2, GTX 650 Ti, and Tesla K40 with a power envelope of only 34 Watts at 45 nm
TL;DR: This work implements a key kernel from a material science application using OpenMP 3.0, OpenMP 4.5, OpenACC, and CUDA on Intel architectures, Xeon and Xeon Phi, and NVIDIA GPUs, P100 and V100 to achieve high performance.
Abstract: In recent years, the HPC landscape has shifted away from traditional multi-core CPU systems to energy-efficient architectures, such as many-core CPUs and accelerators like GPUs, to achieve high performance. The goal of performance portability is to enable developers to rapidly produce applications which can run efficiently on a variety of these architectures, with little to no architecture specific code adoptions required. We implement a key kernel from a material science application using OpenMP 3.0, OpenMP 4.5, OpenACC, and CUDA on Intel architectures, Xeon and Xeon Phi, and NVIDIA GPUs, P100 and V100. We will compare the performance of the OpenMP 4.5 implementation with that of the more architecture-specific implementations, examine the performance of the OpenMP 4.5 implementation on CPUs after back-porting, and share our experience optimizing large reduction loops, as well as discuss the latest compiler status for OpenMP 4.5 and OpenACC.
TL;DR: This paper presents the efficient inference techniques of IntelCaffe, the first Intel(R) optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel (R) Xeon( R) Scalable Processors.
Abstract: High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents the efficient inference techniques of IntelCaffe, the first Intel(R) optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel(R) Xeon(R) Scalable Processors. The 8-bit optimized model is automatically generated with a calibration process from FP32 model without the need of fine-tuning or retraining. We show that the inference throughput and latency with ResNet-50, Inception-v3 and SSD are improved by 1.38X-2.9X and 1.35X-3X respectively with neglectable accuracy loss from IntelCaffe FP32 baseline and by 56X-75X and 26X-37X from BVLC Caffe. All these techniques have been open-sourced on IntelCaffe GitHub (https://github.com/intel/caffe), and the artifact is provided to reproduce the result on Amazon AWS Cloud.
TL;DR: The proposed clfB-tree—a B-tree structure whose tree node fits in a single cache line— achieves atomicity and consistency via in-place update, which requires maximum four cache line flushes.
Abstract: Emerging byte-addressable non-volatile memory (NVRAM) is expected to replace block device storages as an alternative low-latency persistent storage device. If NVRAM is used as a persistent storage device, a cache line instead of a disk page will be the unit of data transfer, consistency, and durability.In this work, we design and develop clfB-tree—a B-tree structure whose tree node fits in a single cache line. We employ existing write combining store buffer and restricted transactional memory to provide a failure-atomic cache line write operation. Using the failure-atomic cache line write operations, we atomically update a clfB-tree node via a single cache line flush instruction without major changes in hardware. However, there exist many processors that do not provide SW interface for transactional memory. For those processors, our proposed clfB-tree achieves atomicity and consistency via in-place update, which requires maximum four cache line flushes. We evaluate the performance of clfB-tree on an NVRAM emulation board with ARM Cortex A-9 processor and a workstation that has Intel Xeon E7-4809 v3 processor. Our experimental results show clfB-tree outperforms wB-tree and CDDS B-tree by a large margin in terms of both insertion and search performance.
TL;DR: CompASS (COMputing Platform for Adaptive opticS System) is an end-to-end AO simulation soft¬ware based upon Nvidia's Compute Unified Device Architecture (CUDA) toolkit, which shows a speed up factor of up to 362 as compared to YAO, a feature-equivalent software with multi¬threaded CPU architecture.
Abstract: For ground-based astronomical telescopes, Image resolution is directly dependent on aperture diameter. Practically however, atmospheric turbulence disturbs incoming wavefronts and sets dramatic limitations on imaging quality, which issue led to using Adaptive Optics (AO) systems to compensate wavefront disturbance and achieve optimal resolution. With the arrival of Extremely Large Telescopes, with aperture diameter above 20 meters, High Performance Computing (HPC) techniques are required to simulate and operate AO systems. With the advent of General Purpose Graphics Processing Units (GPGPU), many-core architectures of GPUs can be leveraged to accelerate AO simulation software. COMPASS (COMputing Platform for Adaptive opticS System) is an end-to-end AO simulation soft¬ware based upon Nvidia's Compute Unified Device Architecture (CUDA) toolkit. Simulations performed on state-of-the-art Nvidia DGX-1 servers equipped with 8 Tesla V100-SXM2 and Dual 20- core Intel Xeon E5-2698 v4 show a speed up factor of up to 362 as compared to YAO, a feature-equivalent software with multi¬threaded CPU architecture. The COMPASS implementation ensures compatibility with any Nvidia architecture. A speed up by 5 was measured between a DGX-1 and a comparable server with 6-year old GPUs, with no line of code changed.
TL;DR: This initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture and provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.
Abstract: The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer" cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this initial characterization of the Emu Chick, we study the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix vector multiply. We compare the Emu Chick hardware to architectural simulation and Intel Xeon-based platforms. While it is difficult to accurately compare prototype hardware with existing systems, our initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture. Moreover, the Emu Chick provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.
TL;DR: This paper describes the experiences with hardening the king of middleboxes - Intrusion Detection Systems (IDS) - using Intel Software Guard Extensions (Intel SGX) technology, and develops SEC-IDS, an unmodified Snort 3 with a DPDK network layer that achieves 10Gbps line rate.
Abstract: Network Function Virtualization (NFV) promises the benefits of reduced infrastructure, personnel, and management costs by outsourcing network middleboxes to the public or private cloud Unfortunately, running network functions in the cloud entails security challenges, especially for complex stateful services In this paper, we describe our experiences with hardening the king of middleboxes - Intrusion Detection Systems (IDS) - using Intel Software Guard Extensions (Intel SGX) technology Our IDS secured using Intel SGX, called SEC-IDS, is an unmodified Snort 3 with a DPDK network layer that achieves 10Gbps line rate SEC-IDS guarantees computational integrity by running all Snort code inside an Intel SGX enclave At the same time, SEC-IDS achieves near-native performance, with throughput close to 100 percent of vanilla Snort 3, by retaining network I/O outside of the enclave Our experiments indicate that performance is only constrained by the modest Enclave Page Cache size available on current Intel SGX Skylake based E3 Xeon platforms Finally, we kept the porting effort minimal by using the Graphene-SGX library OS Only 27 Lines of Code (LoC) were modified in Snort and 178 LoC in Graphene-SGX itself
TL;DR: It is demonstrated that bricks are performance-portable across CPU and GPU architectures and afford performance advantages over various tiling strategies, particularly for modern multi-stencil and high-order stencil computations.
Abstract: Achieving high performance on stencil computations poses a number of challenges on modern architectures. The optimization strategy varies significantly across architectures, types of stencils, and types of applications. The standard approach to adapting stencil computations to different architectures, used by both compilers and application programmers, is through the use of iteration space tiling, whereby the data footprint of the computation and its computation partitioning are adjusted to match the memory hierarchy and available parallelism of different platforms. In this paper, we explore an alternative performance portability strategy for stencils, a data layout library for stencils called bricks, that adapts data footprint and parallelism through fine-grained data blocking. Bricks are designed to exploit the inherent multi-dimensional spatial locality of stencils, facilitating improved code generation that can adapt to CPUs or GPUs, and reducing pressure on the memory system. We demonstrate that bricks are performance-portable across CPU and GPU architectures and afford performance advantages over various tiling strategies, particularly for modern multi-stencil and high-order stencil computations. For a range of stencil computations, we achieve high performance on both the Intel Knights Landing (Xeon Phi) and Skylake (Xeon) CPUs as well as the NVIDIA P100 (Pascal) GPU delivering up to a 5x speedup against tiled code.
TL;DR: In this paper, the authors present IntelCaffe, the first Intel optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel Xeon Scalable Processors.
Abstract: High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents the efficient inference techniques of IntelCaffe, the first Intel optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel Xeon Scalable Processors. The 8-bit optimized model is automatically generated with a calibration process from FP32 model without the need of fine-tuning or retraining. We show that the inference throughput and latency with ResNet-50, Inception-v3 and SSD are improved by 1.38X-2.9X and 1.35X-3X respectively with neglectable accuracy loss from IntelCaffe FP32 baseline and by 56X-75X and 26X-37X from BVLC Caffe. All these techniques have been open-sourced on IntelCaffe GitHub1, and the artifact is provided to reproduce the result on Amazon AWS Cloud.
TL;DR: This work characterization results across a wide range of real-world big data applications, and various software stacks demonstrate how the choice of big- versus little-core-based server for energy-efficiency is significantly influenced by the size of data, performance constraints, and presence of accelerator.
Abstract: The rapid growth in data yields challenges to process data efficiently using current high-performance server architectures such as big Xeon cores. Furthermore, physical design constraints, such as power and density, have become the dominant limiting factor for scaling out servers. Low-power embedded cores in servers such as little Atom have emerged as a promising solution to enhance energy-efficiency to address these challenges. Therefore, the question of whether to process the big data applications on big Xeon- or Little Atom-based servers becomes important. In this work, through methodical investigation of power and performance measurements, and comprehensive application-level, system-level, and micro-architectural level analysis, we characterize dominant big data applications on big Xeon- and little Atom-based server architectures. The characterization results across a wide range of real-world big data applications, and various software stacks demonstrate how the choice of big- versus little-core-based server for energy-efficiency is significantly influenced by the size of data, performance constraints, and presence of accelerator. In addition, we analyze processor resource utilization of this important class of applications, such as memory footprints, CPU utilization, and disk bandwidth, to understand their run-time behavior. Furthermore, we perform micro-architecture-level analysis to highlight where improvement is needed in big- and little-core microarchitectures to address their performance bottlenecks.
TL;DR: A systematic series of experiments are described, showing that the increased run times were associated with increased DRAM traffic, that this was caused by increased L2 cache miss rates, and that these were caused by snoop filter evictions.
Abstract: Initial testing of a cluster equipped with Intel Xeon Platinum 8160 processors showed occasional slow single-node HPL benchmark performance — approximately 0.4% of single- node results were more than 10% slower than expected. We describe a systematic series of experiments using simplified benchmarks and hardware performance counters, showing that the increased run times were associated with increased DRAM traffic, that this was caused by increased L2 cache miss rates, and that these were caused by snoop filter evictions. These evictions resulted from associativity conflicts in the snoop filters, and were traced to pathological interactions of data physical addresses with the hash function that distributes addresses across the coherence agents on the processor. For the HPL benchmark, switching from 2MiB hugepages to 1GiB hugepages eliminated the conflict and the associated slowdowns. The snoop filter conflict was later reproduced using a simple array summation kernel, suggesting that other applications could be impacted.
TL;DR: GraphPhi, a new approach to graph processing on emerging Intel Xeon Phi-like architectures, is presented by addressing the restrictions of migrating existing graph processing frameworks on shared-memory multi-core CPUs to this new architecture.
Abstract: Modern parallel architecture design has increasingly turned to throughput-oriented devices to address concerns about energy efficiency and power consumption. However, graph applications cannot tap into the full potential of such architectures because of highly unstructured computations and irregular memory accesses. In this paper, we present GraphPhi, a new approach to graph processing on emerging Intel Xeon Phi-like architectures, by addressing the restrictions of migrating existing graph processing frameworks on shared-memory multi-core CPUs to this new architecture. Specifically, GraphPhi consists of 1) an optimized hierarchically blocked graph representation to enhance the data locality for both edges and vertices within and among threads, 2) a hybrid vertex-centric and edge-centric execution to efficiently find and process active edges, and 3) a uniform MIMD-SIMD scheduler integrated with a lock-free update support to achieve both good thread-level load balance and SIMD-level utilization. Besides, our efficient MIMD-SIMD execution is capable of hiding memory latency by increasing the number of concurrent memory access requests, thus benefiting more from the latest High-Bandwidth Memory technique. We evaluate our GraphPhi on six graph processing applications. Compared to two state-of-the-art shared-memory graph processing frameworks, it results in speedups up to 4X and 35X, respectively.
TL;DR: In this paper, a shared-memory version of the Space-Saving algorithm for the k-majority problem is presented, and a hybrid MPI/OpenMP version is compared against a pure MPI-based version.
Abstract: Summary
Given an array A of n elements and a value 2≤k≤n, a frequent item or k-majority element is an element occurring in A more than n/k times. The k-majority problem requires finding all of the k-majority elements. In this paper, we deal with parallel shared-memory algorithms for frequent items; we present a shared-memory version of the Space Saving algorithm, and we study its behavior with regard to accuracy and performance on many and multi-core processors, including the Intel Phi accelerator. We also investigate a hybrid MPI/OpenMP version against a pure MPI-based version. Through extensive experimental results, we prove that the MPI/OpenMP parallel version of the algorithm significantly enhances the performance of the earlier pure MPI version of the same algorithm. Results also prove that for this algorithm the Intel Phi accelerator does not introduce any improvement with respect to the Xeon octa–core processor.
TL;DR: This work presents a compact and configurable LSTM neural network hardware architecture, and adopts the second-order polynomial to approximate the activation functions in L STM, which balances the computational accuracy and resource utilization.
Abstract: Neural network has been one of the most useful techniques in the area of image analysis and speech recognition in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has widely been implemented on CPUs and GPUs. However, software implementation cannot offer large parallelism for the complicated computation of LSTM, and most of the LSTM hardware implementations proposed yet are intensive and non-configurable. In order to accelerate the computation and reduce the resources consumption, in this work, we present a compact and configurable LSTM neural network hardware architecture. To meet the requirements of different networks, we set a wide array of hardware parameters that can be configured to balance area, power and performance. And we adopt the second-order polynomial to approximate the activation functions in LSTM, which balances the computational accuracy and resource utilization, Implemented on XCZU6EG FPGA running at 238 MHz, our work has a performance of 7.64 GOP/s. Compared to the implementation on Intel Xeon E5-2620 CPU at 2.10 GHz, our parallel hardware architecture achieves $90\times$ speedup for a small network and 25 x speed-up for a large one. The total consumption of resources is 77% less than the state-of-the-art works', which implies the compactness of our work.
TL;DR: The VOLNA-OP2 tsunami model and implementation is presented; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations.
Abstract: . In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.
TL;DR: Together, the new hardware functions provide up to 1.8x speedup on HPC benchmark codes when compared with the previous generation Haswell processor core, providing much greater utility to a broader range of HPC applications that rely on this class of compute node.
Abstract: Despite significant advances in the porting of scientific applications to novel architectures such as compute-optimized graphics processors, many-core processor/accelerators and, even special-purpose function units, the vast majority of scientific calculations are still performed on high-performance, commodity server processors. Even in the cases of applications which have been ported to new architectures, frequent serial sections still require strong server-class processor cores to compute as fast as possible. In this paper we report on a set of benchmark studies which evaluate Intel's latest Skylake Xeon server processor. Skylake represents a significant change in the Xeon product line with wider SIMD vector units, a redesigned cache architecture, and, an increased number of memory channels. The wider vector units provide 2x improvement for some compute-intensive applications and the combined memory changes can provide close to 2x the memory bandwidth. We evaluate these new hardware features on several HPC-relevant mini-applications and benchmarks, including, STREAM, LULESH, XSBench, HPCG and SW4Lite. Together, the new hardware functions provide up to 1.8x speedup on HPC benchmark codes when compared with the previous generation Haswell processor core, providing much greater utility to a broader range of HPC applications that rely on this class of compute node.
TL;DR: This is the first work that explores the combination of GPU and cluster-based parallel computing with the population-based metaheuristics in association rule mining and shows that the proposed solution outperforms the HPC-based ARM approaches when exploring Webdocs instance.
Abstract: The application of population-based metaheuristics approaches to the association rules mining problem is explored in this paper. The combination of GPU and cluster-based parallel computing techniques is investigated for the purpose of accelerating the process of extracting the correlations between items in sizeable data instances. We propose four parallel-based approaches that benefit from the cluster intensive computing in the generation process and the massively GPU threading. This is by evaluating the association rules in parallel on GPU. To validate the proposed approaches, the most used population-based metaheuristics (GA, PSO, and BSO) have been executed on a cluster of GPUs to solve benchmarks of large and big ARM instances. We used Intel Xeon 64bit quad-core processor E5520 coupled to an Nvidia Tesla C2075 GPU device. The results show that the BSO outperforms GA and PSO. They also show that the proposed solution outperforms the HPC-based ARM approaches when exploring Webdocs instance (the largest instance existing on the web). To our knowledge, this is the first work that explores the combination of GPU and cluster-based parallel computing with the population-based metaheuristics in association rule mining.
TL;DR: Analysis of energy/performance trade-offs with power capping observed on four different modern CPUs for three different parallel applications such as 2D heat distribution, numerical integration and Fast Fourier Transform shows that using enforced power caps the authors can find points of lower than default energy consumption but mostly for desktop and mobile solutions at the cost of increased execution time.
Abstract: In the paper we present extensive results from analyzing energy/performance trade-offs with power capping observed on four different modern CPUs, for three different parallel applications such as 2D heat distribution, numerical integration and Fast Fourier Transform. The CPU tested represent both multi-core type CPUs such as Intel® Xeon®E5, desktop and mobile i7 as well as many-core Intel® Xeon PhiTMx200 but also server, desktop and mobile solutions used widely nowadays. We show that using enforced power caps we can find points of lower than default energy consumption but mostly for desktop and mobile solutions at the cost of increased execution time. We show with particular numbers how energy consumed, power consumption and execution time change for the point of minimum energy used versus the default configuration with no power limit, for each application and each tested CPU.
TL;DR: A library containing interfaces (HCLOOC) that addresses the programming challenges to execution of large instances of data-parallel applications using these accelerators in clouds by employing optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator.
Abstract: Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.