Top 1453 papers published in the topic of Central processing unit in 2017

Showing papers on "Central processing unit published in 2017"

Posted Content•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon - Show less +71 more

16 Apr 2017-arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

4,178 citations

Proceedings Article•10.1145/3079856.3080246•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi¹, Cliff Young¹, Nishant Patil¹, David A. Patterson¹, Gaurav Agrawal¹, Raminder Bajwa¹, Sarah Bates¹, Suresh Bhatia¹, Nan Boden¹, Albert T. Borchers¹, Rick Boyle¹, Pierre-luc Cantin¹, Clifford Chao¹, Christopher Aaron Clark¹, Jeremy Coriell¹, Michael J. Daley¹, Matt Dau¹, Jeffrey Dean¹, Ben Gelb¹, Tara Vazir Ghaemmaghami¹, Rajendra Gottipati¹, William John Gulland¹, Robert Hagmann¹, C. Richard Ho¹, Doug Hogberg¹, John Hu¹, Robert Hundt¹, D. Hurt¹, Julian Ibarz¹, Aaron Jaffey¹, Alek Jaworski¹, Alexander Kaplan¹, Khaitan Harshit¹, Daniel Killebrew¹, Andy Koch¹, Naveen Kumar¹, Steve Lacy¹, James Laudon¹, James Law¹, Diemthu Le¹, Chris Leary¹, Zhuyuan Liu¹, Kyle Lucke¹, Alan Lundin¹, Gordon MacKean¹, Adriana Maggiore¹, Maire Mahony¹, Kieran Miller¹, Rahul Nagarajan¹, Ravi Narayanaswami¹, Ray Ni¹, Kathy Nix¹, Thomas Norrie¹, Mark Omernick¹, Narayana Penukonda¹, Andrew Everett Phelps¹, Jonathan Ross¹, Matt Ross¹, Amir Salek¹, Emad Samadiani¹, Chris Severn¹, Gregory Sizikov¹, Matthew Snelham¹, Jed Souter¹, Dan Steinberg¹, Andy Swing¹, Mercedes Tan¹, Gregory Michael Thorson¹, Bo Tian¹, Horia Toma¹, Erick Tuttle¹, Vijay K. Vasudevan¹, Richard Walter¹, Walter Wang¹, Eric Wilcox¹, Doe Hyun Yoon¹ - Show less +72 more•Institutions (1)

Google¹

24 Jun 2017

TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

3,848 citations

Journal Article•10.1038/NCOMMS15199•

Face classification using electronic synapses

[...]

Peng Yao¹, Huaqiang Wu¹, Bin Gao¹, Sukru Burc Eryilmaz², Xueyao Huang¹, Wenqiang Zhang¹, Qingtian Zhang¹, Ning Deng¹, Luping Shi¹, H-S Philip Wong², He Qian¹ - Show less +7 more•Institutions (2)

Tsinghua University¹, Stanford University²

12 May 2017-Nature Communications

TL;DR: An analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials is presented and shows bidirectional continuous weight modulation behaviour, consolidating the feasibility of analogue synaptic array and paving the way toward building an energy efficient and large-scale neuromorphic system.

...read moreread less

Abstract: Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.

...read moreread less

870 citations

Proceedings Article•10.1145/3132747.3132756•

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

[...]

Bojie Li¹, Zhenyuan Ruan¹, Wencong Xiao¹, Yuanwei Lu¹, Yongqiang Xiong¹, Andrew Putnam¹, Enhong Chen, Lintao Zhang¹ - Show less +4 more•Institutions (1)

Microsoft¹

14 Oct 2017

TL;DR: KV-Direct is presented, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory, and can achieve near linear scalability with multiple NICs.

...read moreread less

Abstract: Performance of in-memory key-value store (KVS) continues to be of great importance as modern KVS goes beyond the traditional object-caching workload and becomes a key infrastructure to support distributed main-memory computation in data centers. Recent years have witnessed a rapid increase of network bandwidth in data centers, shifting the bottleneck of most KVS from the network to the CPU. RDMA-capable NIC partly alleviates the problem, but the primitives provided by RDMA abstraction are rather limited. Meanwhile, programmable NICs become available in data centers, enabling in-network processing. In this paper, we present KV-Direct, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory. We develop several novel techniques to maximize the throughput and hide the latency of the PCIe connection between the NIC and the host memory, which becomes the new bottleneck. Combined, these mechanisms allow a single NIC KV-Direct to achieve up to 180 M key-value operations per second, equivalent to the throughput of tens of CPU cores. Compared with CPU based KVS implementation, KV-Direct improves power efficiency by 3x, while keeping tail latency below 10 μs. Moreover, KV-Direct can achieve near linear scalability with multiple NICs. With 10 programmable NIC cards in a commodity server, we achieve 1.22 billion KV operations per second, which is almost an order-of-magnitude improvement over existing systems, setting a new milestone for a general-purpose in-memory key-value store.

...read moreread less

243 citations

Journal Article•10.1515/ITIT-2016-0040•

Cross-architecture bug search in binary executables

[...]

Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, Thorsten Holz - Show less +1 more

20 Jan 2017-Information Technology

TL;DR: This paper proposes a system to derive bug signatures for known bugs, and compute semantic hashes for the basic blocks of the binary to find code parts in the binary that behave similarly to the bug signature, effectively revealing code parts that contain the bug.

...read moreread less

Abstract: With the general availability of closed-source software for various CPU architectures, there is a need to identify security-critical vulnerabilities at the binary level to perform a vulnerability assessment. Unfortunately, existing bug finding methods fall short in that they i) require source code, ii) only work on a single architecture (typically x86), or iii) rely on dynamic analysis, which is inherently difficult for embedded devices. In this paper, we propose a system to derive bug signatures for known bugs. We then use these signatures to find bugs in binaries that have been deployed on different CPU architectures (e.g., x86 vs. MIPS). The variety of CPU architectures imposes many challenges, such as the incomparability of instruction set architectures between the CPU models. We solve this by first translating the binary code to an intermediate representation, resulting in assignment formulas with input and output variables. We then sample concrete inputs to observe the I/O behavior of basic blocks, which grasps their semantics. Finally, we use the I/O behavior to find code parts that behave similarly to the bug signature, effectively revealing code parts that contain the bug. We have designed and implemented a tool for cross architecture bug search in executables. Our prototype currently supports three instruction set architectures (x86, ARM, and MIPS) and can find vulnerabilities in buggy binary code for any of these architectures. We show that we can find Heart bleed vulnerabilities, regardless of the underlying software instruction set. Similarly, we apply our method to find backdoors in closed source firmware images of MIPS- and ARM-based routers.

...read moreread less

211 citations

Proceedings Article•10.1109/ASAP.2017.7995254•

Parallel Multi Channel convolution using General Matrix Multiplication

[...]

Aravind Vasudevan¹, Andrew Anderson¹, David Gregg¹•Institutions (1)

Trinity College, Dublin¹

10 Jul 2017

TL;DR: In this article, the authors proposed a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col, which eliminates the need for data replication on the input.

...read moreread less

Abstract: Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally-intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. This im2col conversion greatly increases the memory footprint of the input matrix and reduces data locality. In this paper we propose a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col. Our algorithm eliminates the need for data replication on the input thereby enabling us to apply the convolution kernels on the input images directly. We have implemented several variants of our algorithm on a CPU processor and an embedded ARM processor. On the CPU, our algorithm is faster than im2col in most cases.

...read moreread less

181 citations

Journal Article•10.1109/MM.2017.38•

Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake

[...]

Jack Doweck¹, Wen-fu Kao¹, Allen Kuan-yu Lu¹, Julius Mandelblat¹, Anirudha Rahatekar¹, Lihu Rappoport¹, Efraim Rotem¹, Ahmad Yasin¹, Adi Yoaz¹ - Show less +5 more•Institutions (1)

Intel¹

01 Mar 2017-IEEE Micro

TL;DR: The Intel Architecture core delivers higher power efficiency, higher frequency, and a wider dynamic power range, supporting smaller form factors, and offers a rich performance monitoring unit that enhances software developers' ability to optimize their applications.

...read moreread less

Abstract: Skylake's core, processor graphics, and system on chip were designed to meet a demanding set of requirements for a wide range of power-performance points. Its coherent fabric was designed to provide high-memory bandwidth from multiple memory sources. Skylake's power management, which includes Intel Speed Shift technology, was designed to provide the largest dynamic power range among prior Intel processors. The Intel Architecture core delivers higher power efficiency, higher frequency, and a wider dynamic power range, supporting smaller form factors. Skylake's Gen9 graphics provides new features designed to maximize energy efficiency and bring the best visual experience for gaming and media. Skylake offers a rich performance monitoring unit that enhances software developers' ability to optimize their applications.

...read moreread less

173 citations

Journal Article•10.1109/TCAD.2016.2562920•

Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs

[...]

Matthew J. Walker¹, Stephan Diestelhorst, Andreas Hansson, Anup Das¹, Sheng Yang¹, Bashir M. Al-Hashimi¹, Geoff V. Merrett¹ - Show less +3 more•Institutions (1)

University of Southampton¹

01 Jan 2017-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A statistically rigorous and novel methodology for building accurate run-time power models using performance monitoring counters (PMCs) for mobile and embedded devices, and how these models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability is presented.

...read moreread less

Abstract: Modern mobile and embedded devices are required to be increasingly energy-efficient while running more sophisticated tasks, causing the CPU design to become more complex and employ more energy-saving techniques. This has created a greater need for fast and accurate power estimation frameworks for both run-time CPU energy management and design-space exploration. We present a statistically rigorous and novel methodology for building accurate run-time power models using performance monitoring counters (PMCs) for mobile and embedded devices, and demonstrate how our models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability. Our robust model formulation reduces multicollinearity, allows separation of static and dynamic power, and allows a $100{\times }$ reduction in experiment time while sacrificing only 0.6% accuracy. We present a statistically detailed evaluation of our model, highlighting and addressing the problem of heteroscedasticity in power modeling. We present software implementing our methodology and build power models for ARM Cortex-A7 and Cortex-A15 CPUs, with 3.8% and 2.8% average error, respectively. We model the behavior of the nonideal CPU voltage regulator under dynamic CPU activity to improve modeling accuracy by up to 5.5% in situations where the voltage cannot be measured. To address the lack of research utilizing PMC data from real mobile devices, we also present our data acquisition method and experimental platform software. We support this paper with online resources including software tools, documentation, raw data and further results.

...read moreread less

119 citations

Journal Article•10.1145/3079758•

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

[...]

Zhiqiang Liu¹, Yong Dou¹, Jingfei Jiang¹, Jinwei Xu¹, Shijie Li¹, Yongmei Zhou¹, Yingnan Xu¹ - Show less +3 more•Institutions (1)

National University of Defense Technology¹

19 Jul 2017-ACM Transactions on Reconfigurable Technology and Systems

TL;DR: A scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration and a systematic design space exploration methodology is put forward to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints.

...read moreread less

Abstract: Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.

...read moreread less

110 citations

Proceedings Article•10.1109/FCCM.2017.37•

Centaur: A Framework for Hybrid CPU-FPGA Databases

[...]

Muhsen Owaida¹, David Sidler¹, Kaan Kara¹, Gustavo Alonso²•Institutions (2)

ETH Zurich¹, Instituto Politécnico Nacional²

30 Jun 2017

TL;DR: Centaur is presented, a framework running on theFPGA that allows the dynamic allocation of FPGA operator plans to query plans, pipelining these operators among themselves when needed, and the hybrid execution of operator pipelinesrunning on the CPU and the FPGa.

...read moreread less

Abstract: Accelerating relational databases in general and SQL in particular has become an important topic given thechallenges arising from large data collections and increasinglycomplex workloads. Most existing work, however, has beenfocused on either accelerating a single operator (e.g., a join) orin data reduction along the data path (e.g., from disk to CPU). In this paper we focus instead on the system aspects of accelerating a relational engine in hybrid CPU-FPGA architectures. In particular, we present Centaur, a framework running on theFPGA that allows the dynamic allocation of FPGA operatorsto query plans, pipelining these operators among themselveswhen needed, and the hybrid execution of operator pipelinesrunning on the CPU and the FPGA. Centaur is fully compatiblewith relational engines as we demonstrate through its seamlessintegration with MonetDB, a popular column store database. Inthe paper, we describe how this integration is achieved, andempirically demonstrate the advantages of such an approach. The main contribution of the paper is to provide a realisticsolution for accelerating SQL that is compatible with existingdatabase architectures, thereby opening up the possibilities forfurther exploration of FPGA based data processing.

...read moreread less

90 citations

Proceedings Article•10.1109/HPCA.2017.42•

Design and Analysis of an APU for Exascale Computing

[...]

Thiruvengadam Vijayaraghavany, Yasuko Eckert¹, Gabriel H. Loh¹, Michael J. Schulte¹, Mike Ignatowski¹, Bradford M. Beckmann¹, William C. Brantley¹, Joseph L. Greathouse¹, Wei Huang¹, Arun Karunanithi, Onur Kayiran¹, Mitesh R. Meswani¹, Indrani Paul¹, Matthew Poremba¹, Steven Raasch¹, Steven K. Reinhardt¹, Greg Sadowski¹, Vilas Sridharan¹ - Show less +14 more•Institutions (1)

Advanced Micro Devices¹

6 Feb 2017

TL;DR: This paper describes a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer, and presents initial experimental analysis to demonstrate the promise of this approach.

...read moreread less

Abstract: The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper presents a vision for an architecture that can be used to construct exascale systems. We describe a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer. The ENA consists of an Exascale Heterogeneous Processor (EHP) coupled with an advanced memory system. The EHP provides a high-performance accelerated processing unit (CPU+GPU), in-package high-bandwidth 3D memory, and aggressive use of die-stacking and chiplet technologies to meet the requirements for exascale computing in a balanced manner. We present initial experimental analysis to demonstrate the promise of our approach, and we discuss remaining open research challenges for the community.

...read moreread less

Proceedings Article•10.1109/ETFA.2017.8247615•

Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms

[...]

Roberto Cavicchioli¹, Nicola Capodieci¹, Marko Bertogna¹•Institutions (1)

University of Modena and Reggio Emilia¹

1 Sep 2017

TL;DR: The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms.

...read moreread less

Abstract: Most of today's mixed criticality platforms feature Systems on Chip (SoC) where a multi-core CPU complex (the host) competes with an integrated Graphic Processor Unit (iGPU, the device) for accessing central memory. The multi-core host and the iGPU share the same memory controller, which has to arbitrate data access to both clients through often undisclosed or non-priority driven mechanisms. Such aspect becomes critical when the iGPU is a high performance massively parallel computing complex potentially able to saturate the available DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms. We analyzed different well known and commercially available platforms in order to estimate variations in throughput and latencies within various memory access patterns, both at host and device side.

...read moreread less

Proceedings Article•10.1145/3035918.3058746•

doppioDB: A Hardware Accelerated Database

[...]

David Sidler¹, Zsolt István¹, Muhsen Owaida¹, Kaan Kara¹, Gustavo Alonso¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

9 May 2017

TL;DR: This work presents doppioDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs), and evaluates it on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables.

...read moreread less

Abstract: Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXP_LIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads.

...read moreread less

Proceedings Article•10.1109/PACT.2017.41•

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

[...]

Wei Han¹, Daniel Mawhirter¹, Bo Wu¹, Matthew Buland•Institutions (1)

Colorado School of Mines¹

1 Sep 2017

TL;DR: Graphie, a system to efficiently traverse large-scale graphs on a single GPU that stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing, and relies on two renaming algorithms for high performance.

...read moreread less

Abstract: Most GPU-based graph systems cannot handle large-scale graphs that do not fit in the GPU memory. The ever-increasing graph size demands a scale-up graph system, which can run on a single GPU with optimized memory access efficiency and well-controlled data transfer overhead. However, existing systems either incur redundant data transfers or fail to use shared memory. In this paper we present Graphie, a systemto efficiently traverse large-scale graphs on a single GPU. Graphie stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing. Graphie's high performance relies on two renaming algorithms. The first algorithm renames the vertices so that the source vertices can be easily loaded to the shared memory to reduce global memory accesses. The second algorithm inserts virtual vertices into the vertex set to rename real vertices, which enables the use of a small boolean array to track active partitions. The boolean array also resides in shared memory and can be updated in constant time. The renaming algorithms do not introduce any extra overhead in the GPU memory or graph storage on disk. Graphie's runtime overlaps data transfer with kernel execution and reuses transferred data in the GPU memory. The evaluation of Graphie on 7 real-world graphs with up to 1.8 billion edgesdemonstrates substantial speedups over X-Stream, a state-of-theart edge-centric graph processing framework on the CPU, and GraphReduce, an out-of-memory graph processing systems on GPUs.

...read moreread less

Journal Article•10.1109/TASC.2016.2642049•

High-Speed Operation of Random-Access-Memory-Embedded Microprocessor With Minimal Instruction Set Architecture Based on Rapid Single-Flux-Quantum Logic

[...]

Ryo Sato¹, Yuki Hatanaka¹, Yuki Ando², Masamitsu Tanaka¹, Akira Fujimaki¹, Kazuyoshi Takagi², Naofumi Takagi² - Show less +3 more•Institutions (2)

Nagoya University¹, Kyoto University²

01 Jun 2017-IEEE Transactions on Applied Superconductivity

TL;DR: The design and experimental results of a rapid single-flux-quantum (RSFQ) bit-serial microprocessor with reduced-size embedded random access memories (RAMs) and with a minimal instruction set, called CORE e2h are presented.

...read moreread less

Abstract: We present design and experimental results of a rapid single-flux-quantum (RSFQ) bit-serial microprocessor with reduced-size embedded random access memories (RAMs) and with a minimal instruction set, called CORE e2h. The microprocessors called CORE e series have been developed for demonstrating small-scale program execution, such as loop calculation and sorting, in order to show the first prototype of a stored-program computer using the RSFQ technology. The CORE e2h is the most simplified variation of the CORE e series, which is equipped with only two registers, and can execute 13 instructions. The target clock frequency for bit-serial operation is 50 GHz, while the designed system clock cycle is 2 GHz. We carefully designed every component, implementing functionality using a small number of Josephson junctions with a small footprint. We fabricated several chips of the CORE e2h microprocessor integrated with two 128-bit shift-register-based RAMs on the same die. We experimentally obtained correct operations for all the instructions, and confirmed high-speed transfer between the instruction memory and controller unit and between the data memory and datapath at around 50 GHz.

...read moreread less

Proceedings Article•

Garaph: efficient GPU-accelerated graph processing on a single machine with balanced replication

[...]

Lingxiao Ma¹, Zhi Yang¹, Chen Han¹, Jilong Xue², Yafei Dai¹ - Show less +1 more•Institutions (2)

Peking University¹, Microsoft²

12 Jul 2017

TL;DR: The evaluation with six widely used graph applications on seven real-world graphs shows that Garaph significantly outperforms existing state-of-art CPU-based and GPU-based graph processing systems, getting up to 5.36× speedup over the fastest among them.

...read moreread less

Abstract: Recent advances in storage (e.g., DDR4, SSD, NVM) and accelerators (e.g., GPU, Xeon-Phi, FPGA) provide the opportunity to efficiently process large-scale graphs on a single machine. In this paper, we present Garaph, a GPU-accelerated graph processing system on a single machine with secondary storage as memory extension. Garaph is novel in three ways. First, Garaph proposes a vertex replication degree customization scheme that maximizes the GPU utilization given vertices' degrees and space constraints. Second, Garaph adopts a balanced edge-based partition ensuring work balance over CPU threads, and also a hybrid of notify-pull and pull computation models optimized for fast graph processing on the CPU. Third, Garaph uses a dynamic workload assignment scheme which takes into account both characteristics of processing elements and graph algorithms. Our evaluation with six widely used graph applications on seven real-world graphs shows that Garaph significantly outperforms existing state-of-art CPU-based and GPU-based graph processing systems, getting up to 5.36× speedup over the fastest among them.

...read moreread less

Journal Article•10.1016/J.JPDC.2016.12.023•

A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems

[...]

Yang Wangdong¹, Yang Wangdong², Kenli Li¹, Keqin Li¹, Keqin Li³ - Show less +1 more•Institutions (3)

Hunan University¹, Hunan City University², State University of New York System³

01 Jun 2017-Journal of Parallel and Distributed Computing

TL;DR: An optimization strategy of sparse matrix partitioning using a distribution function is proposed to improve the computing performance of SpMV on the heterogeneous computing platform and the experimental results on two test machines demonstrate noticeable performance improvement.

...read moreread less

Journal Article•10.1088/1742-6596/837/1/012017•

AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics

[...]

Shahab Fatemi¹, Shahab Fatemi², Andrew R. Poppe¹, Andrew R. Poppe², Gregory T. Delory², Gregory T. Delory¹, William M. Farrell³, William M. Farrell² - Show less +4 more•Institutions (3)

University of California, Berkeley¹, NASA Lunar Science Institute², Goddard Space Flight Center³

1 May 2017

TL;DR: The AMITIS energy conservation is examined and it is shown that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.

...read moreread less

Abstract: We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs) The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms We show that AMITIS runs ~10 times faster than the parallel CPU-based hybrid model We also introduce an implicit solver for computation of Faraday's Equation, resulting in an explicit-implicit scheme for the hybrid model equation We show that the proposed scheme is stable and accurate We examine the AMITIS energy conservation and show that the energy is conserved with an error < 02% after 500,000 timesteps, even when a very low number of particles per cell is used

...read moreread less

Proceedings Article•10.23919/FPL.2017.8056784•

Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms

[...]

Muhsen Owaida¹, Hantian Zhang¹, Ce Zhang¹, Gustavo Alonso¹•Institutions (1)

ETH Zurich¹

1 Jan 2017

TL;DR: This paper presents an FPGA tree ensemble classifier together with a software driver to efficiently manage theFPGA's memory resources, delivering up to 20× speedup over a 10-threaded CPU implementation when fully processing the tree ensemble on the FPGAs.

...read moreread less

Abstract: Decision tree ensembles are commonly used in a wide range of applications and becoming the de facto algorithm for decision tree based classifiers. Different trees in an ensemble can be processed in parallel during tree inference, making them a suitable use case for FPGAs. Large tree ensembles, however, require careful mapping of trees to on-chip memory and management of memory accesses. As a result, existing FPGA solutions suffer from the inability to scale beyond tens of trees and lack the flexibility to support different tree ensembles. In this paper we present an FPGA tree ensemble classifier together with a software driver to efficiently manage the FPGA's memory resources. The classifier architecture efficiently utilizes the FPGA's resources to fit half a million tree nodes in on-chip memory, delivering up to 20× speedup over a 10-threaded CPU implementation when fully processing the tree ensemble on the FPGA. It can also combine the CPU and FPGA to scale to tree ensembles that do not fit in on-chip memory, achieving up to an order of magnitude speedup compared to a pure CPU implementation. In addition, the classifier architecture can be programmed at runtime to process varying tree ensemble sizes.

...read moreread less

Posted Content•10.5194/GMD-2016-307•

Accelerating the Global Nested Air Quality Prediction Modeling System(GNAQPMS) model on Intel Xeon Phi processors

[...]

Hui Wang, Huansheng Chen¹, Qizhong Wu², Junming Lin³, Xueshun Chen¹, Xinwei Xie³, Rongrong Wang², Xiao Tang¹, Zifa Wang¹ - Show less +5 more•Institutions (3)

Chinese Academy of Sciences¹, Beijing Normal University², Intel³

22 Feb 2017-Geoscientific Model Development Discussions

TL;DR: This study presented the work of porting and optimizing the GNAQPMS model on the second generation Intel Xeon Phi processor codename “Knights Landing” (KNL), and described the five optimizations applied to the key modules of GNAZPMS – CBM-Z gas chemistry, advection, convection and wet deposition.

...read moreread less

Abstract: The GNAQPMS model is the global version of the Nested Air Quality Prediction Modelling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present our work of porting and optimizing the GNAQPMS model on the second generation Intel Xeon Phi processor codename “Knights Landing” (KNL). Compared with the first generation Xeon Phi coprocessor, KNL introduced many new hardware features such as a bootable processor, high performance in-package memory and ISA compatibility with Intel Xeon processor. In particular, we described the five optimizations we applied to the key modules of GNAQPMS – CBM-Z gas chemistry, advection, convection and wet deposition. These optimizations work well on both the KNL 7250 processor as well as the Intel Xeon processor E5-2697 V4. They include: 1) updating the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP in emission, advection, convection and chemistry modules; 2) fully employ the 512-bit wide vector processing units (VPU) on the KNL platform; 3) reducing unnecessary memory access to improve caches efficiency; 4) reducing thread local storage (TLS) in CBM-Z gas phase chemistry module to improve its OpenMP performance; 5) changing global communication from interface-files writing/reading to using Message Passing Interface (MPI) functions to improve the performance and the parallel scalability. These optimizations improved GNAQPMS performance great. The same optimizations also work well for the Intel Xeon Broadwell processor, specifically, E5-2697v4. Compared with the baseline version of GNAQPMS, the optimized version is 3.34x faster on KNL and 2.39x faster on CPU. Furthermore, the optimized version on KNL runs at 26 % lower average power compare to CPU. Combining the performance and energy improvement, the KNL platform is 47% more efficient compare to the CPU platform. The optimizations also enables much further parallel scalability on both the CPU cluster and KNL cluster – scale to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 % and 42.2 %, respectively.

...read moreread less

Corrfunc: Blazing fast correlation functions on the CPU

[...]

Manodeep Sinha, Lehman H. Garrison

1 Mar 2017

Proceedings Article•10.1109/CIACT.2017.7977382•

A novel approach for CPU utilization on a multicore paradigm using parallel quicksort

[...]

Tinku Singh, Durgesh Kumar Srivastava, Alok Aggarwal

1 Feb 2017

TL;DR: The result shows parallel version of quicksort better utilize the CPU individual cores compared to its sequential version, which exploits more parallelism that leads the better CPU utilization.

...read moreread less

Abstract: Multicore architecture of CPU is popular because of its performance; the challenge for the Multicore environment are-writing the effective code that can exploit the parallelism, measuring the performance in terms of CPU individual core utilization. The effective code using multithreading (parallel code) leads to performance speedup. Various multithreading applications are getting developed now days to utilize the CPU cores. In this paper, tools are developed, one by using C# console viz. application for measuring the performance of the CPU cores individually. Performance is measured in terms of load on each core in percentage. Second tool is designed using windows C# viz. application for plotting the graph with respect to time of CPU load in percentage. By both the tools performance is measured while quicksort is getting executed in the serial and parallel for a large number of data elements. Experiment is done on dual core and quad core CPU and results are stored in the table. Comparison graphs are drawn for running time of quicksort as well as CPU individual core utilization. The result shows parallel version of quicksort better utilize the CPU individual cores compared to its sequential version. It exploits more parallelism that leads the better CPU utilization.

...read moreread less

Journal Article•10.1145/3005448•

SPARCNet: A Hardware Accelerator for Efficient Deployment of Sparse Convolutional Networks

[...]

Adam Page¹, Ali Jafari¹, Colin Shea¹, Tinoosh Mohsenin¹•Institutions (1)

University of Maryland, Baltimore County¹

12 May 2017-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: The proposed SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks, looks to enable deploying networks in embedded, resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed.

...read moreread less

Abstract: Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep convolutional neural networks have been shown to dominate on several popular public benchmarks such as the ImageNet database. Unfortunately, the benefits of deep networks have yet to be fully exploited in embedded, resource-bound settings that have strict power and area budgets. Graphical processing unit (GPU) have been shown to improve throughput and energy-efficiency over central processing unit (CPU) due to their highly parallel architecture yet still impose a significant power burden. In a similar fashion, field programmable gate array (FPGA) can be used to improve performance while further allowing more fine-grained control over implementation to improve efficiency. In order to reduce power and area while still achieving required throughput, classification-efficient network architectures are required in addition to optimal deployment on efficient hardware. In this work, we target both of these enterprises. For the first objective, we analyze simple, biologically inspired reduction strategies that are applied both before and after training. The central theme of the techniques is the introduction of sparsification to help dissolve away the dense connectivity that is often found at different levels in convolutional neural networks. The sparsification techniques include feature compression partition, structured filter pruning, and dynamic feature pruning. Additionally, we explore filter factorization and filter quantization approximation techniques to further reduce the complexity of convolutional layers. In the second contribution, we propose SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks. The accelerator looks to enable deploying networks in such resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed. To demonstrate both contributions, modern deep convolutional network architectures containing millions of parameters are explored within the context of the computer vision dataset CIFAR. Utilizing the reduction techniques, we demonstrate the ability to reduce computation and memory by 60% and 93% with less than 0.03% impact on accuracy when compared to the best baseline network with 93.47% accuracy. The SPARCNet accelerator with different numbers of processing engines is implemented on a low-power Artix-7 FPGA platform. Additionally, the same networks are optimally implemented on a number of embedded commercial-off-the-shelf platforms including NVIDIAs CPU+GPU SoCs TK1 and TX1 and Intel Edison. Compared to NVIDIAs TK1 and TX1, the FPGA-based accelerator obtains 11.8 × and 7.5 × improvement in energy efficiency while maintaining a classification throughput of 72 images/s. When further compared to a number of recent FPGA-based accelerators, SPARCNet is able to achieve up to 15 × improvement in energy efficiency while consuming less than 2W of total board power at 100MHz. In addition to improving efficiency, the accelerator has built-in support for sparsification techniques and ability to perform in-place rectified linear unit (ReLU) activation function, max-pooling, and batch normalization.

...read moreread less

Journal Article•10.1175/BAMS-D-15-00278.1•

Parallelization and Performance of the NIM Weather Model on CPU, GPU, and MIC Processors

[...]

Mark Govett¹, Jim Rosinski², Jacques Middlecoff², Tom Henderson², Jin Lee¹, Alexander E. MacDonald¹, Ning Wang², Paul Madden³, Julie Schramm², Antonio Duarte³ - Show less +6 more•Institutions (3)

Earth System Research Laboratory¹, Colorado State University², Cooperative Institute for Research in Environmental Sciences³

01 Oct 2017-Bulletin of the American Meteorological Society

TL;DR: The code structure and parallelization of NIM is described using standards-compliant open multiprocessing (OpenMP) and open accelerator (OpenACC) directives to support a single, performance-portable code that runs on CPU, GPU, and MIC systems.

...read moreread less

Abstract: The design and performance of the Non-Hydrostatic Icosahedral Model (NIM) global weather prediction model is described. NIM is a dynamical core designed to run on central processing unit (CPU), graphics processing unit (GPU), and Many Integrated Core (MIC) processors. It demonstrates efficient parallel performance and scalability to tens of thousands of compute nodes and has been an effective way to make comparisons between traditional CPU and emerging fine-grain processors. The design of the NIM also serves as a useful guide in the fine-grain parallelization of the finite volume cubed (FV3) model recently chosen by the National Weather Service (NWS) to become its next operational global weather prediction model.This paper describes the code structure and parallelization of NIM using standards-compliant open multiprocessing (OpenMP) and open accelerator (OpenACC) directives. NIM uses the directives to support a single, performance-portable code that runs on CPU, GPU, and MIC systems. Performance r...

...read moreread less

Patent•

Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing

[...]

Pieter Van Rooyen, Michael Ruehle, Rami Mehio, Gavin Stone, Mark David Hahm, Eric Ojard, Amnon Ptashek - Show less +3 more

7 Jun 2017

TL;DR: In this article, a system, method and apparatus for executing a bioinformatics analysis on genetic sequence data is provided, which includes one or more of a first integrated circuit, where each first circuit forms a central processing unit (CPU) that is responsive to software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline.

...read moreread less

Abstract: A system, method and apparatus for executing a bioinformatics analysis on genetic sequence data is provided. Particularly, a genomics analysis platform for executing a sequence analysis pipeline is provided. The genomics analysis platform includes one or more of a first integrated circuit, where each first integrated circuit forms a central processing unit (CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also provided, where each second integrated circuit forming a field programmable gate array (FPGA), the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical interconnects to perform a second set of genomic processing steps of the sequence analysis pipeline, the set of hardwired digital logic circuits of each FPGA being arranged as a set of processing engines to perform the second set of genomic processing steps. A shared memory is also provided.

...read moreread less

Proceedings Article•10.1109/HOTI.2017.13•

An FPGA Platform for Hyperscalers

[...]

Francois Abel¹, Jagath Weerasinghe¹, Christoph Hagleitner¹, Beat Weiss¹, Stephan Paredes - Show less +1 more•Institutions (1)

IBM¹

1 Aug 2017

TL;DR: An infrastructure which integrates 64 FPGAs (Kintex* UltraScale* XCKU060) from Xilinx* in a 19" × 2U chassis, and provides a bi-sectional bandwidth of 640 Gb/s is described, which turns the FPGA into a disaggregated standalone computing resource that can be deployed at large scale into emerging hyperscale data centers.

...read moreread less

Abstract: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same volume and power envelopes. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and storage modules.To embrace this evolution, we developed a platform that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network. This proposal turns the FPGA into a disaggregated standalone computing resource that can be deployed at large scale into emerging hyperscale data centers.This paper describes an infrastructure which integrates 64 FPGAs (Kintex* UltraScale* XCKU060) from Xilinx* in a 19" × 2U chassis, and provides a bi-sectional bandwidth of 640 Gb/s. The platform is designed for cost effectiveness and makes use of hot-water cooling for optimized energy efficiency. As a result, a DC rack can fit 16 platforms, for a total of 1024 FPGAs + 16 TB of DRR4 memory.

...read moreread less

Proceedings Article•10.18653/V1/D17-1300•

Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU

[...]

Jacob Devlin¹•Institutions (1)

Microsoft¹

1 Sep 2017

TL;DR: This article proposed a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep.

...read moreread less

Abstract: Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a significant increase in training and decoding cost compared to phrase-based systems. In this work we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/throughput close to that of a phrasal decoder. We approach this problem from two angles: First, we describe several techniques for speeding up an NMT beam search decoder, which obtain a 4.4x speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 BLEU on WMT English-French NewsTest2014, while decoding at 100 words/sec on single-threaded CPU. We believe this is the best published accuracy/speed trade-off of an NMT system.

...read moreread less

Proceedings Article•10.23919/DATE.2017.7927008•

GPUguard: Towards supporting a predictable execution model for heterogeneous SoC

[...]

Björn Forsberg¹, Andrea Marongiu¹, Luca Benini¹•Institutions (1)

ETH Zurich¹

27 Mar 2017

TL;DR: This work presents the ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs, and shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.

...read moreread less

Abstract: The deployment of real-time workloads on commercial off-the-shelf (COTS) hardware is attractive, as it reduces the cost and time-to-market of new products. Most modern high-end embedded SoCs rely on a heterogeneous design, coupling a general-purpose multi-core CPU to a massively parallel accelerator, typically a programmable GPU, sharing a single global DRAM. However, because of non-predictable hardware arbiters designed to maximize average or peak performance, it is very difficult to provide timing guarantees on such systems. In this work we present our ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs. A prototype implementation for the NVIDIA Tegra TX1 SoC shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.

...read moreread less

Proceedings Article•

DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices

[...]

Dawei Li¹, Xiaolong Wang¹, Deguang Kong¹•Institutions (1)

Samsung¹

16 Aug 2017

TL;DR: DeepRebirth as mentioned in this paper proposes a novel acceleration framework to reduce the execution time of non-tensor layers such as pooling and normalization without tensor-like trainable parameters.

...read moreread less

Abstract: Deploying deep neural networks on mobile devices is a challenging task. Current model compression methods such as matrix decomposition effectively reduce the deployed model size, but still cannot satisfy real-time processing requirement. This paper first discovers that the major obstacle is the excessive execution time of non-tensor layers such as pooling and normalization without tensor-like trainable parameters. This motivates us to design a novel acceleration framework: DeepRebirth through "slimming" existing consecutive and parallel non-tensor and tensor layers. The layer slimming is executed at different substructures: (a) streamline slimming by merging the consecutive non-tensor and tensor layer vertically; (b) branch slimming by merging non-tensor and tensor branches horizontally. The proposed optimization operations significantly accelerate the model execution and also greatly reduce the run-time memory cost since the slimmed model architecture contains less hidden layers. To maximally avoid accuracy loss, the parameters in new generated layers are learned with layer-wise fine-tuning based on both theoretical analysis and empirical verification. As observed in the experiment, DeepRebirth achieves more than 3x speed-up and 2.5x run-time memory saving on GoogLeNet with only 0.4% drop on top-5 accuracy in ImageNet. Furthermore, by combining with other model compression techniques, DeepRebirth offers an average of 106.3ms inference time on the CPU of Samsung Galaxy S5 with 86.5% top-5 accuracy, 14% faster than SqueezeNet which only has a top-5 accuracy of 80.5%.

...read moreread less

Journal Article•10.1109/TC.2017.2710317•

Enhancing Energy Efficiency of Multimedia Applications in Heterogeneous Mobile Multi-Core Processors

[...]

Young Geun Kim¹, Minyong Kim¹, Sung Woo Chung¹•Institutions (1)

Korea University¹

01 Nov 2017-IEEE Transactions on Computers

TL;DR: An advanced task scheduler for heterogeneous multi-core processors is proposed, which provides appropriate amount of CPU resources for multimedia applications and saves system-wide energy consumption and improves performance of non-multimedia applications.

...read moreread less

Abstract: Recent smart devices have adopted heterogeneous multi-core processors which have high-performance big cores and low-power small cores Unfortunately, the conventional task scheduler for heterogeneous multi-core processors does not provide appropriate amount of CPU resources for multimedia applications (whose QoS is important to users), resulting in energy waste; it often executes multimedia applications and non-multimedia applications on the same core In this paper, we propose an advanced task scheduler for heterogeneous multi-core processors, which provides appropriate amount of CPU resources for multimedia applications Our proposed task scheduler isolates multimedia applications from non-multimedia applications at runtime, exploiting the fact that multimedia applications have a specific thread for video/audio playback (to play video/audio, a multimedia application should use a function that generates the specific thread) Since multimedia applications usually require a smaller amount of CPU resources than non-multimedia applications due to dedicated hardware decoders, our proposed task scheduler allocates the former to the small cores and the latter to the big cores In our experiments on an Android-based development board, our proposed task scheduler saves system-wide (not just CPU) energy consumption by 89 percent, on average, compared to the conventional task scheduler, preserving QoS of multimedia applications In addition, it improves performance of non-multimedia applications by 137 percent, on average, compared to the conventional task scheduler

...read moreread less

...

Expand