Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Central processing unit
  4. 2017
  1. Home
  2. Topics
  3. Central processing unit
  4. 2017
Showing papers on "Central processing unit published in 2017"
Posted Content•
In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon 
16 Apr 2017-arXiv: Hardware Architecture
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

4,178 citations

Proceedings Article•10.1145/3079856.3080246•
In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi1, Cliff Young1, Nishant Patil1, David A. Patterson1, Gaurav Agrawal1, Raminder Bajwa1, Sarah Bates1, Suresh Bhatia1, Nan Boden1, Albert T. Borchers1, Rick Boyle1, Pierre-luc Cantin1, Clifford Chao1, Christopher Aaron Clark1, Jeremy Coriell1, Michael J. Daley1, Matt Dau1, Jeffrey Dean1, Ben Gelb1, Tara Vazir Ghaemmaghami1, Rajendra Gottipati1, William John Gulland1, Robert Hagmann1, C. Richard Ho1, Doug Hogberg1, John Hu1, Robert Hundt1, D. Hurt1, Julian Ibarz1, Aaron Jaffey1, Alek Jaworski1, Alexander Kaplan1, Khaitan Harshit1, Daniel Killebrew1, Andy Koch1, Naveen Kumar1, Steve Lacy1, James Laudon1, James Law1, Diemthu Le1, Chris Leary1, Zhuyuan Liu1, Kyle Lucke1, Alan Lundin1, Gordon MacKean1, Adriana Maggiore1, Maire Mahony1, Kieran Miller1, Rahul Nagarajan1, Ravi Narayanaswami1, Ray Ni1, Kathy Nix1, Thomas Norrie1, Mark Omernick1, Narayana Penukonda1, Andrew Everett Phelps1, Jonathan Ross1, Matt Ross1, Amir Salek1, Emad Samadiani1, Chris Severn1, Gregory Sizikov1, Matthew Snelham1, Jed Souter1, Dan Steinberg1, Andy Swing1, Mercedes Tan1, Gregory Michael Thorson1, Bo Tian1, Horia Toma1, Erick Tuttle1, Vijay K. Vasudevan1, Richard Walter1, Walter Wang1, Eric Wilcox1, Doe Hyun Yoon1 •
Google1
24 Jun 2017
TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

3,848 citations

Journal Article•10.1038/NCOMMS15199•
Face classification using electronic synapses

[...]

Peng Yao1, Huaqiang Wu1, Bin Gao1, Sukru Burc Eryilmaz2, Xueyao Huang1, Wenqiang Zhang1, Qingtian Zhang1, Ning Deng1, Luping Shi1, H-S Philip Wong2, He Qian1 •
Tsinghua University1, Stanford University2
12 May 2017-Nature Communications
TL;DR: An analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials is presented and shows bidirectional continuous weight modulation behaviour, consolidating the feasibility of analogue synaptic array and paving the way toward building an energy efficient and large-scale neuromorphic system.
Abstract: Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.

870 citations

Proceedings Article•10.1145/3132747.3132756•
KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

[...]

Bojie Li1, Zhenyuan Ruan1, Wencong Xiao1, Yuanwei Lu1, Yongqiang Xiong1, Andrew Putnam1, Enhong Chen, Lintao Zhang1 •
Microsoft1
14 Oct 2017
TL;DR: KV-Direct is presented, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory, and can achieve near linear scalability with multiple NICs.
Abstract: Performance of in-memory key-value store (KVS) continues to be of great importance as modern KVS goes beyond the traditional object-caching workload and becomes a key infrastructure to support distributed main-memory computation in data centers. Recent years have witnessed a rapid increase of network bandwidth in data centers, shifting the bottleneck of most KVS from the network to the CPU. RDMA-capable NIC partly alleviates the problem, but the primitives provided by RDMA abstraction are rather limited. Meanwhile, programmable NICs become available in data centers, enabling in-network processing. In this paper, we present KV-Direct, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory. We develop several novel techniques to maximize the throughput and hide the latency of the PCIe connection between the NIC and the host memory, which becomes the new bottleneck. Combined, these mechanisms allow a single NIC KV-Direct to achieve up to 180 M key-value operations per second, equivalent to the throughput of tens of CPU cores. Compared with CPU based KVS implementation, KV-Direct improves power efficiency by 3x, while keeping tail latency below 10 μs. Moreover, KV-Direct can achieve near linear scalability with multiple NICs. With 10 programmable NIC cards in a commodity server, we achieve 1.22 billion KV operations per second, which is almost an order-of-magnitude improvement over existing systems, setting a new milestone for a general-purpose in-memory key-value store.

243 citations

Journal Article•10.1515/ITIT-2016-0040•
Cross-architecture bug search in binary executables

[...]

Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, Thorsten Holz 
20 Jan 2017-Information Technology
TL;DR: This paper proposes a system to derive bug signatures for known bugs, and compute semantic hashes for the basic blocks of the binary to find code parts in the binary that behave similarly to the bug signature, effectively revealing code parts that contain the bug.
Abstract: With the general availability of closed-source software for various CPU architectures, there is a need to identify security-critical vulnerabilities at the binary level to perform a vulnerability assessment. Unfortunately, existing bug finding methods fall short in that they i) require source code, ii) only work on a single architecture (typically x86), or iii) rely on dynamic analysis, which is inherently difficult for embedded devices. In this paper, we propose a system to derive bug signatures for known bugs. We then use these signatures to find bugs in binaries that have been deployed on different CPU architectures (e.g., x86 vs. MIPS). The variety of CPU architectures imposes many challenges, such as the incomparability of instruction set architectures between the CPU models. We solve this by first translating the binary code to an intermediate representation, resulting in assignment formulas with input and output variables. We then sample concrete inputs to observe the I/O behavior of basic blocks, which grasps their semantics. Finally, we use the I/O behavior to find code parts that behave similarly to the bug signature, effectively revealing code parts that contain the bug. We have designed and implemented a tool for cross architecture bug search in executables. Our prototype currently supports three instruction set architectures (x86, ARM, and MIPS) and can find vulnerabilities in buggy binary code for any of these architectures. We show that we can find Heart bleed vulnerabilities, regardless of the underlying software instruction set. Similarly, we apply our method to find backdoors in closed source firmware images of MIPS- and ARM-based routers.

211 citations

Proceedings Article•10.1109/ASAP.2017.7995254•
Parallel Multi Channel convolution using General Matrix Multiplication

[...]

Aravind Vasudevan1, Andrew Anderson1, David Gregg1•
Trinity College, Dublin1
10 Jul 2017
TL;DR: In this article, the authors proposed a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col, which eliminates the need for data replication on the input.
Abstract: Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally-intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. This im2col conversion greatly increases the memory footprint of the input matrix and reduces data locality. In this paper we propose a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col. Our algorithm eliminates the need for data replication on the input thereby enabling us to apply the convolution kernels on the input images directly. We have implemented several variants of our algorithm on a CPU processor and an embedded ARM processor. On the CPU, our algorithm is faster than im2col in most cases.

181 citations

Journal Article•10.1109/MM.2017.38•
Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake

[...]

Jack Doweck1, Wen-fu Kao1, Allen Kuan-yu Lu1, Julius Mandelblat1, Anirudha Rahatekar1, Lihu Rappoport1, Efraim Rotem1, Ahmad Yasin1, Adi Yoaz1 •
Intel1
01 Mar 2017-IEEE Micro
TL;DR: The Intel Architecture core delivers higher power efficiency, higher frequency, and a wider dynamic power range, supporting smaller form factors, and offers a rich performance monitoring unit that enhances software developers' ability to optimize their applications.
Abstract: Skylake's core, processor graphics, and system on chip were designed to meet a demanding set of requirements for a wide range of power-performance points. Its coherent fabric was designed to provide high-memory bandwidth from multiple memory sources. Skylake's power management, which includes Intel Speed Shift technology, was designed to provide the largest dynamic power range among prior Intel processors. The Intel Architecture core delivers higher power efficiency, higher frequency, and a wider dynamic power range, supporting smaller form factors. Skylake's Gen9 graphics provides new features designed to maximize energy efficiency and bring the best visual experience for gaming and media. Skylake offers a rich performance monitoring unit that enhances software developers' ability to optimize their applications.

173 citations

Journal Article•10.1109/TCAD.2016.2562920•
Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs

[...]

Matthew J. Walker1, Stephan Diestelhorst, Andreas Hansson, Anup Das1, Sheng Yang1, Bashir M. Al-Hashimi1, Geoff V. Merrett1 •
University of Southampton1
01 Jan 2017-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
TL;DR: A statistically rigorous and novel methodology for building accurate run-time power models using performance monitoring counters (PMCs) for mobile and embedded devices, and how these models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability is presented.
Abstract: Modern mobile and embedded devices are required to be increasingly energy-efficient while running more sophisticated tasks, causing the CPU design to become more complex and employ more energy-saving techniques. This has created a greater need for fast and accurate power estimation frameworks for both run-time CPU energy management and design-space exploration. We present a statistically rigorous and novel methodology for building accurate run-time power models using performance monitoring counters (PMCs) for mobile and embedded devices, and demonstrate how our models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability. Our robust model formulation reduces multicollinearity, allows separation of static and dynamic power, and allows a $100{\times }$ reduction in experiment time while sacrificing only 0.6% accuracy. We present a statistically detailed evaluation of our model, highlighting and addressing the problem of heteroscedasticity in power modeling. We present software implementing our methodology and build power models for ARM Cortex-A7 and Cortex-A15 CPUs, with 3.8% and 2.8% average error, respectively. We model the behavior of the nonideal CPU voltage regulator under dynamic CPU activity to improve modeling accuracy by up to 5.5% in situations where the voltage cannot be measured. To address the lack of research utilizing PMC data from real mobile devices, we also present our data acquisition method and experimental platform software. We support this paper with online resources including software tools, documentation, raw data and further results.

119 citations

Journal Article•10.1145/3079758•
Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

[...]

Zhiqiang Liu1, Yong Dou1, Jingfei Jiang1, Jinwei Xu1, Shijie Li1, Yongmei Zhou1, Yingnan Xu1 •
National University of Defense Technology1
19 Jul 2017-ACM Transactions on Reconfigurable Technology and Systems
TL;DR: A scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration and a systematic design space exploration methodology is put forward to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints.
Abstract: Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.

110 citations

Proceedings Article•10.1109/FCCM.2017.37•
Centaur: A Framework for Hybrid CPU-FPGA Databases

[...]

Muhsen Owaida1, David Sidler1, Kaan Kara1, Gustavo Alonso2•
ETH Zurich1, Instituto Politécnico Nacional2
30 Jun 2017
TL;DR: Centaur is presented, a framework running on theFPGA that allows the dynamic allocation of FPGA operator plans to query plans, pipelining these operators among themselves when needed, and the hybrid execution of operator pipelinesrunning on the CPU and the FPGa.
Abstract: Accelerating relational databases in general and SQL in particular has become an important topic given thechallenges arising from large data collections and increasinglycomplex workloads. Most existing work, however, has beenfocused on either accelerating a single operator (e.g., a join) orin data reduction along the data path (e.g., from disk to CPU). In this paper we focus instead on the system aspects of accelerating a relational engine in hybrid CPU-FPGA architectures. In particular, we present Centaur, a framework running on theFPGA that allows the dynamic allocation of FPGA operatorsto query plans, pipelining these operators among themselveswhen needed, and the hybrid execution of operator pipelinesrunning on the CPU and the FPGA. Centaur is fully compatiblewith relational engines as we demonstrate through its seamlessintegration with MonetDB, a popular column store database. Inthe paper, we describe how this integration is achieved, andempirically demonstrate the advantages of such an approach. The main contribution of the paper is to provide a realisticsolution for accelerating SQL that is compatible with existingdatabase architectures, thereby opening up the possibilities forfurther exploration of FPGA based data processing.

90 citations

Proceedings Article•10.1109/HPCA.2017.42•
Design and Analysis of an APU for Exascale Computing

[...]

Thiruvengadam Vijayaraghavany, Yasuko Eckert1, Gabriel H. Loh1, Michael J. Schulte1, Mike Ignatowski1, Bradford M. Beckmann1, William C. Brantley1, Joseph L. Greathouse1, Wei Huang1, Arun Karunanithi, Onur Kayiran1, Mitesh R. Meswani1, Indrani Paul1, Matthew Poremba1, Steven Raasch1, Steven K. Reinhardt1, Greg Sadowski1, Vilas Sridharan1 •
Advanced Micro Devices1
6 Feb 2017
TL;DR: This paper describes a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer, and presents initial experimental analysis to demonstrate the promise of this approach.
Abstract: The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper presents a vision for an architecture that can be used to construct exascale systems. We describe a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer. The ENA consists of an Exascale Heterogeneous Processor (EHP) coupled with an advanced memory system. The EHP provides a high-performance accelerated processing unit (CPU+GPU), in-package high-bandwidth 3D memory, and aggressive use of die-stacking and chiplet technologies to meet the requirements for exascale computing in a balanced manner. We present initial experimental analysis to demonstrate the promise of our approach, and we discuss remaining open research challenges for the community.
Proceedings Article•10.1109/ETFA.2017.8247615•
Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms

[...]

Roberto Cavicchioli1, Nicola Capodieci1, Marko Bertogna1•
University of Modena and Reggio Emilia1
1 Sep 2017
TL;DR: The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms.
Abstract: Most of today's mixed criticality platforms feature Systems on Chip (SoC) where a multi-core CPU complex (the host) competes with an integrated Graphic Processor Unit (iGPU, the device) for accessing central memory. The multi-core host and the iGPU share the same memory controller, which has to arbitrate data access to both clients through often undisclosed or non-priority driven mechanisms. Such aspect becomes critical when the iGPU is a high performance massively parallel computing complex potentially able to saturate the available DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms. We analyzed different well known and commercially available platforms in order to estimate variations in throughput and latencies within various memory access patterns, both at host and device side.
Proceedings Article•10.1145/3035918.3058746•
doppioDB: A Hardware Accelerated Database

[...]

David Sidler1, Zsolt István1, Muhsen Owaida1, Kaan Kara1, Gustavo Alonso1 •
ETH Zurich1
9 May 2017
TL;DR: This work presents doppioDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs), and evaluates it on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables.
Abstract: Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXP_LIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads.
Proceedings Article•10.1109/PACT.2017.41•
Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

[...]

Wei Han1, Daniel Mawhirter1, Bo Wu1, Matthew Buland•
Colorado School of Mines1
1 Sep 2017
TL;DR: Graphie, a system to efficiently traverse large-scale graphs on a single GPU that stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing, and relies on two renaming algorithms for high performance.
Abstract: Most GPU-based graph systems cannot handle large-scale graphs that do not fit in the GPU memory. The ever-increasing graph size demands a scale-up graph system, which can run on a single GPU with optimized memory access efficiency and well-controlled data transfer overhead. However, existing systems either incur redundant data transfers or fail to use shared memory. In this paper we present Graphie, a systemto efficiently traverse large-scale graphs on a single GPU. Graphie stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing. Graphie's high performance relies on two renaming algorithms. The first algorithm renames the vertices so that the source vertices can be easily loaded to the shared memory to reduce global memory accesses. The second algorithm inserts virtual vertices into the vertex set to rename real vertices, which enables the use of a small boolean array to track active partitions. The boolean array also resides in shared memory and can be updated in constant time. The renaming algorithms do not introduce any extra overhead in the GPU memory or graph storage on disk. Graphie's runtime overlaps data transfer with kernel execution and reuses transferred data in the GPU memory. The evaluation of Graphie on 7 real-world graphs with up to 1.8 billion edgesdemonstrates substantial speedups over X-Stream, a state-of-theart edge-centric graph processing framework on the CPU, and GraphReduce, an out-of-memory graph processing systems on GPUs.
Journal Article•10.1109/TASC.2016.2642049•
High-Speed Operation of Random-Access-Memory-Embedded Microprocessor With Minimal Instruction Set Architecture Based on Rapid Single-Flux-Quantum Logic

[...]

Ryo Sato1, Yuki Hatanaka1, Yuki Ando2, Masamitsu Tanaka1, Akira Fujimaki1, Kazuyoshi Takagi2, Naofumi Takagi2 •
Nagoya University1, Kyoto University2
01 Jun 2017-IEEE Transactions on Applied Superconductivity
TL;DR: The design and experimental results of a rapid single-flux-quantum (RSFQ) bit-serial microprocessor with reduced-size embedded random access memories (RAMs) and with a minimal instruction set, called CORE e2h are presented.
Abstract: We present design and experimental results of a rapid single-flux-quantum (RSFQ) bit-serial microprocessor with reduced-size embedded random access memories (RAMs) and with a minimal instruction set, called CORE e2h. The microprocessors called CORE e series have been developed for demonstrating small-scale program execution, such as loop calculation and sorting, in order to show the first prototype of a stored-program computer using the RSFQ technology. The CORE e2h is the most simplified variation of the CORE e series, which is equipped with only two registers, and can execute 13 instructions. The target clock frequency for bit-serial operation is 50 GHz, while the designed system clock cycle is 2 GHz. We carefully designed every component, implementing functionality using a small number of Josephson junctions with a small footprint. We fabricated several chips of the CORE e2h microprocessor integrated with two 128-bit shift-register-based RAMs on the same die. We experimentally obtained correct operations for all the instructions, and confirmed high-speed transfer between the instruction memory and controller unit and between the data memory and datapath at around 50 GHz.
Proceedings Article•
Garaph: efficient GPU-accelerated graph processing on a single machine with balanced replication

[...]

Lingxiao Ma1, Zhi Yang1, Chen Han1, Jilong Xue2, Yafei Dai1 •
Peking University1, Microsoft2
12 Jul 2017
TL;DR: The evaluation with six widely used graph applications on seven real-world graphs shows that Garaph significantly outperforms existing state-of-art CPU-based and GPU-based graph processing systems, getting up to 5.36× speedup over the fastest among them.
Abstract: Recent advances in storage (e.g., DDR4, SSD, NVM) and accelerators (e.g., GPU, Xeon-Phi, FPGA) provide the opportunity to efficiently process large-scale graphs on a single machine. In this paper, we present Garaph, a GPU-accelerated graph processing system on a single machine with secondary storage as memory extension. Garaph is novel in three ways. First, Garaph proposes a vertex replication degree customization scheme that maximizes the GPU utilization given vertices' degrees and space constraints. Second, Garaph adopts a balanced edge-based partition ensuring work balance over CPU threads, and also a hybrid of notify-pull and pull computation models optimized for fast graph processing on the CPU. Third, Garaph uses a dynamic workload assignment scheme which takes into account both characteristics of processing elements and graph algorithms. Our evaluation with six widely used graph applications on seven real-world graphs shows that Garaph significantly outperforms existing state-of-art CPU-based and GPU-based graph processing systems, getting up to 5.36× speedup over the fastest among them.
Journal Article•10.1016/J.JPDC.2016.12.023•
A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems

[...]

Yang Wangdong1, Yang Wangdong2, Kenli Li1, Keqin Li1, Keqin Li3 •
Hunan University1, Hunan City University2, State University of New York System3
01 Jun 2017-Journal of Parallel and Distributed Computing
TL;DR: An optimization strategy of sparse matrix partitioning using a distribution function is proposed to improve the computing performance of SpMV on the heterogeneous computing platform and the experimental results on two test machines demonstrate noticeable performance improvement.
Journal Article•10.1088/1742-6596/837/1/012017•
AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics

[...]

Shahab Fatemi1, Shahab Fatemi2, Andrew R. Poppe1, Andrew R. Poppe2, Gregory T. Delory2, Gregory T. Delory1, William M. Farrell3, William M. Farrell2 •
University of California, Berkeley1, NASA Lunar Science Institute2, Goddard Space Flight Center3
1 May 2017
TL;DR: The AMITIS energy conservation is examined and it is shown that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
Abstract: We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs) The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms We show that AMITIS runs ~10 times faster than the parallel CPU-based hybrid model We also introduce an implicit solver for computation of Faraday's Equation, resulting in an explicit-implicit scheme for the hybrid model equation We show that the proposed scheme is stable and accurate We examine the AMITIS energy conservation and show that the energy is conserved with an error < 02% after 500,000 timesteps, even when a very low number of particles per cell is used
Proceedings Article•10.23919/FPL.2017.8056784•
Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms

[...]

Muhsen Owaida1, Hantian Zhang1, Ce Zhang1, Gustavo Alonso1•
ETH Zurich1
1 Jan 2017
TL;DR: This paper presents an FPGA tree ensemble classifier together with a software driver to efficiently manage theFPGA's memory resources, delivering up to 20× speedup over a 10-threaded CPU implementation when fully processing the tree ensemble on the FPGAs.
Abstract: Decision tree ensembles are commonly used in a wide range of applications and becoming the de facto algorithm for decision tree based classifiers. Different trees in an ensemble can be processed in parallel during tree inference, making them a suitable use case for FPGAs. Large tree ensembles, however, require careful mapping of trees to on-chip memory and management of memory accesses. As a result, existing FPGA solutions suffer from the inability to scale beyond tens of trees and lack the flexibility to support different tree ensembles. In this paper we present an FPGA tree ensemble classifier together with a software driver to efficiently manage the FPGA's memory resources. The classifier architecture efficiently utilizes the FPGA's resources to fit half a million tree nodes in on-chip memory, delivering up to 20× speedup over a 10-threaded CPU implementation when fully processing the tree ensemble on the FPGA. It can also combine the CPU and FPGA to scale to tree ensembles that do not fit in on-chip memory, achieving up to an order of magnitude speedup compared to a pure CPU implementation. In addition, the classifier architecture can be programmed at runtime to process varying tree ensemble sizes.
Posted Content•10.5194/GMD-2016-307•
Accelerating the Global Nested Air Quality Prediction Modeling System(GNAQPMS) model on Intel Xeon Phi processors

[...]

Hui Wang, Huansheng Chen1, Qizhong Wu2, Junming Lin3, Xueshun Chen1, Xinwei Xie3, Rongrong Wang2, Xiao Tang1, Zifa Wang1 •
Chinese Academy of Sciences1, Beijing Normal University2, Intel3
22 Feb 2017-Geoscientific Model Development Discussions
TL;DR: This study presented the work of porting and optimizing the GNAQPMS model on the second generation Intel Xeon Phi processor codename “Knights Landing” (KNL), and described the five optimizations applied to the key modules of GNAZPMS – CBM-Z gas chemistry, advection, convection and wet deposition.
Abstract: The GNAQPMS model is the global version of the Nested Air Quality Prediction Modelling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present our work of porting and optimizing the GNAQPMS model on the second generation Intel Xeon Phi processor codename “Knights Landing” (KNL). Compared with the first generation Xeon Phi coprocessor, KNL introduced many new hardware features such as a bootable processor, high performance in-package memory and ISA compatibility with Intel Xeon processor. In particular, we described the five optimizations we applied to the key modules of GNAQPMS – CBM-Z gas chemistry, advection, convection and wet deposition. These optimizations work well on both the KNL 7250 processor as well as the Intel Xeon processor E5-2697 V4. They include: 1) updating the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP in emission, advection, convection and chemistry modules; 2) fully employ the 512-bit wide vector processing units (VPU) on the KNL platform; 3) reducing unnecessary memory access to improve caches efficiency; 4) reducing thread local storage (TLS) in CBM-Z gas phase chemistry module to improve its OpenMP performance; 5) changing global communication from interface-files writing/reading to using Message Passing Interface (MPI) functions to improve the performance and the parallel scalability. These optimizations improved GNAQPMS performance great. The same optimizations also work well for the Intel Xeon Broadwell processor, specifically, E5-2697v4. Compared with the baseline version of GNAQPMS, the optimized version is 3.34x faster on KNL and 2.39x faster on CPU. Furthermore, the optimized version on KNL runs at 26 % lower average power compare to CPU. Combining the performance and energy improvement, the KNL platform is 47% more efficient compare to the CPU platform. The optimizations also enables much further parallel scalability on both the CPU cluster and KNL cluster – scale to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 % and 42.2 %, respectively.
Corrfunc: Blazing fast correlation functions on the CPU

[...]

Manodeep Sinha, Lehman H. Garrison
1 Mar 2017
Proceedings Article•10.1109/CIACT.2017.7977382•
A novel approach for CPU utilization on a multicore paradigm using parallel quicksort

[...]

Tinku Singh, Durgesh Kumar Srivastava, Alok Aggarwal
1 Feb 2017
TL;DR: The result shows parallel version of quicksort better utilize the CPU individual cores compared to its sequential version, which exploits more parallelism that leads the better CPU utilization.
Abstract: Multicore architecture of CPU is popular because of its performance; the challenge for the Multicore environment are-writing the effective code that can exploit the parallelism, measuring the performance in terms of CPU individual core utilization. The effective code using multithreading (parallel code) leads to performance speedup. Various multithreading applications are getting developed now days to utilize the CPU cores. In this paper, tools are developed, one by using C# console viz. application for measuring the performance of the CPU cores individually. Performance is measured in terms of load on each core in percentage. Second tool is designed using windows C# viz. application for plotting the graph with respect to time of CPU load in percentage. By both the tools performance is measured while quicksort is getting executed in the serial and parallel for a large number of data elements. Experiment is done on dual core and quad core CPU and results are stored in the table. Comparison graphs are drawn for running time of quicksort as well as CPU individual core utilization. The result shows parallel version of quicksort better utilize the CPU individual cores compared to its sequential version. It exploits more parallelism that leads the better CPU utilization.
Journal Article•10.1145/3005448•
SPARCNet: A Hardware Accelerator for Efficient Deployment of Sparse Convolutional Networks

[...]

Adam Page1, Ali Jafari1, Colin Shea1, Tinoosh Mohsenin1•
University of Maryland, Baltimore County1
12 May 2017-ACM Journal on Emerging Technologies in Computing Systems
TL;DR: The proposed SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks, looks to enable deploying networks in embedded, resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed.
Abstract: Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep convolutional neural networks have been shown to dominate on several popular public benchmarks such as the ImageNet database. Unfortunately, the benefits of deep networks have yet to be fully exploited in embedded, resource-bound settings that have strict power and area budgets. Graphical processing unit (GPU) have been shown to improve throughput and energy-efficiency over central processing unit (CPU) due to their highly parallel architecture yet still impose a significant power burden. In a similar fashion, field programmable gate array (FPGA) can be used to improve performance while further allowing more fine-grained control over implementation to improve efficiency. In order to reduce power and area while still achieving required throughput, classification-efficient network architectures are required in addition to optimal deployment on efficient hardware. In this work, we target both of these enterprises. For the first objective, we analyze simple, biologically inspired reduction strategies that are applied both before and after training. The central theme of the techniques is the introduction of sparsification to help dissolve away the dense connectivity that is often found at different levels in convolutional neural networks. The sparsification techniques include feature compression partition, structured filter pruning, and dynamic feature pruning. Additionally, we explore filter factorization and filter quantization approximation techniques to further reduce the complexity of convolutional layers. In the second contribution, we propose SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks. The accelerator looks to enable deploying networks in such resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed. To demonstrate both contributions, modern deep convolutional network architectures containing millions of parameters are explored within the context of the computer vision dataset CIFAR. Utilizing the reduction techniques, we demonstrate the ability to reduce computation and memory by 60% and 93% with less than 0.03% impact on accuracy when compared to the best baseline network with 93.47% accuracy. The SPARCNet accelerator with different numbers of processing engines is implemented on a low-power Artix-7 FPGA platform. Additionally, the same networks are optimally implemented on a number of embedded commercial-off-the-shelf platforms including NVIDIAs CPU+GPU SoCs TK1 and TX1 and Intel Edison. Compared to NVIDIAs TK1 and TX1, the FPGA-based accelerator obtains 11.8 × and 7.5 × improvement in energy efficiency while maintaining a classification throughput of 72 images/s. When further compared to a number of recent FPGA-based accelerators, SPARCNet is able to achieve up to 15 × improvement in energy efficiency while consuming less than 2W of total board power at 100MHz. In addition to improving efficiency, the accelerator has built-in support for sparsification techniques and ability to perform in-place rectified linear unit (ReLU) activation function, max-pooling, and batch normalization.
Journal Article•10.1175/BAMS-D-15-00278.1•
Parallelization and Performance of the NIM Weather Model on CPU, GPU, and MIC Processors

[...]

Mark Govett1, Jim Rosinski2, Jacques Middlecoff2, Tom Henderson2, Jin Lee1, Alexander E. MacDonald1, Ning Wang2, Paul Madden3, Julie Schramm2, Antonio Duarte3 •
Earth System Research Laboratory1, Colorado State University2, Cooperative Institute for Research in Environmental Sciences3
01 Oct 2017-Bulletin of the American Meteorological Society
TL;DR: The code structure and parallelization of NIM is described using standards-compliant open multiprocessing (OpenMP) and open accelerator (OpenACC) directives to support a single, performance-portable code that runs on CPU, GPU, and MIC systems.
Abstract: The design and performance of the Non-Hydrostatic Icosahedral Model (NIM) global weather prediction model is described. NIM is a dynamical core designed to run on central processing unit (CPU), graphics processing unit (GPU), and Many Integrated Core (MIC) processors. It demonstrates efficient parallel performance and scalability to tens of thousands of compute nodes and has been an effective way to make comparisons between traditional CPU and emerging fine-grain processors. The design of the NIM also serves as a useful guide in the fine-grain parallelization of the finite volume cubed (FV3) model recently chosen by the National Weather Service (NWS) to become its next operational global weather prediction model.This paper describes the code structure and parallelization of NIM using standards-compliant open multiprocessing (OpenMP) and open accelerator (OpenACC) directives. NIM uses the directives to support a single, performance-portable code that runs on CPU, GPU, and MIC systems. Performance r...
Patent•
Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing

[...]

Pieter Van Rooyen, Michael Ruehle, Rami Mehio, Gavin Stone, Mark David Hahm, Eric Ojard, Amnon Ptashek 
7 Jun 2017
TL;DR: In this article, a system, method and apparatus for executing a bioinformatics analysis on genetic sequence data is provided, which includes one or more of a first integrated circuit, where each first circuit forms a central processing unit (CPU) that is responsive to software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline.
Abstract: A system, method and apparatus for executing a bioinformatics analysis on genetic sequence data is provided. Particularly, a genomics analysis platform for executing a sequence analysis pipeline is provided. The genomics analysis platform includes one or more of a first integrated circuit, where each first integrated circuit forms a central processing unit (CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also provided, where each second integrated circuit forming a field programmable gate array (FPGA), the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical interconnects to perform a second set of genomic processing steps of the sequence analysis pipeline, the set of hardwired digital logic circuits of each FPGA being arranged as a set of processing engines to perform the second set of genomic processing steps. A shared memory is also provided.
Proceedings Article•10.1109/HOTI.2017.13•
An FPGA Platform for Hyperscalers

[...]

Francois Abel1, Jagath Weerasinghe1, Christoph Hagleitner1, Beat Weiss1, Stephan Paredes •
IBM1
1 Aug 2017
TL;DR: An infrastructure which integrates 64 FPGAs (Kintex* UltraScale* XCKU060) from Xilinx* in a 19" × 2U chassis, and provides a bi-sectional bandwidth of 640 Gb/s is described, which turns the FPGA into a disaggregated standalone computing resource that can be deployed at large scale into emerging hyperscale data centers.
Abstract: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same volume and power envelopes. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and storage modules.To embrace this evolution, we developed a platform that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network. This proposal turns the FPGA into a disaggregated standalone computing resource that can be deployed at large scale into emerging hyperscale data centers.This paper describes an infrastructure which integrates 64 FPGAs (Kintex* UltraScale* XCKU060) from Xilinx* in a 19" × 2U chassis, and provides a bi-sectional bandwidth of 640 Gb/s. The platform is designed for cost effectiveness and makes use of hot-water cooling for optimized energy efficiency. As a result, a DC rack can fit 16 platforms, for a total of 1024 FPGAs + 16 TB of DRR4 memory.
Proceedings Article•10.18653/V1/D17-1300•
Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU

[...]

Jacob Devlin1•
Microsoft1
1 Sep 2017
TL;DR: This article proposed a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep.
Abstract: Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a significant increase in training and decoding cost compared to phrase-based systems. In this work we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/throughput close to that of a phrasal decoder. We approach this problem from two angles: First, we describe several techniques for speeding up an NMT beam search decoder, which obtain a 4.4x speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 BLEU on WMT English-French NewsTest2014, while decoding at 100 words/sec on single-threaded CPU. We believe this is the best published accuracy/speed trade-off of an NMT system.
Proceedings Article•10.23919/DATE.2017.7927008•
GPUguard: Towards supporting a predictable execution model for heterogeneous SoC

[...]

Björn Forsberg1, Andrea Marongiu1, Luca Benini1•
ETH Zurich1
27 Mar 2017
TL;DR: This work presents the ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs, and shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.
Abstract: The deployment of real-time workloads on commercial off-the-shelf (COTS) hardware is attractive, as it reduces the cost and time-to-market of new products. Most modern high-end embedded SoCs rely on a heterogeneous design, coupling a general-purpose multi-core CPU to a massively parallel accelerator, typically a programmable GPU, sharing a single global DRAM. However, because of non-predictable hardware arbiters designed to maximize average or peak performance, it is very difficult to provide timing guarantees on such systems. In this work we present our ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs. A prototype implementation for the NVIDIA Tegra TX1 SoC shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.
Proceedings Article•
DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices

[...]

Dawei Li1, Xiaolong Wang1, Deguang Kong1•
Samsung1
16 Aug 2017
TL;DR: DeepRebirth as mentioned in this paper proposes a novel acceleration framework to reduce the execution time of non-tensor layers such as pooling and normalization without tensor-like trainable parameters.
Abstract: Deploying deep neural networks on mobile devices is a challenging task. Current model compression methods such as matrix decomposition effectively reduce the deployed model size, but still cannot satisfy real-time processing requirement. This paper first discovers that the major obstacle is the excessive execution time of non-tensor layers such as pooling and normalization without tensor-like trainable parameters. This motivates us to design a novel acceleration framework: DeepRebirth through "slimming" existing consecutive and parallel non-tensor and tensor layers. The layer slimming is executed at different substructures: (a) streamline slimming by merging the consecutive non-tensor and tensor layer vertically; (b) branch slimming by merging non-tensor and tensor branches horizontally. The proposed optimization operations significantly accelerate the model execution and also greatly reduce the run-time memory cost since the slimmed model architecture contains less hidden layers. To maximally avoid accuracy loss, the parameters in new generated layers are learned with layer-wise fine-tuning based on both theoretical analysis and empirical verification. As observed in the experiment, DeepRebirth achieves more than 3x speed-up and 2.5x run-time memory saving on GoogLeNet with only 0.4% drop on top-5 accuracy in ImageNet. Furthermore, by combining with other model compression techniques, DeepRebirth offers an average of 106.3ms inference time on the CPU of Samsung Galaxy S5 with 86.5% top-5 accuracy, 14% faster than SqueezeNet which only has a top-5 accuracy of 80.5%.
Journal Article•10.1109/TC.2017.2710317•
Enhancing Energy Efficiency of Multimedia Applications in Heterogeneous Mobile Multi-Core Processors

[...]

Young Geun Kim1, Minyong Kim1, Sung Woo Chung1•
Korea University1
01 Nov 2017-IEEE Transactions on Computers
TL;DR: An advanced task scheduler for heterogeneous multi-core processors is proposed, which provides appropriate amount of CPU resources for multimedia applications and saves system-wide energy consumption and improves performance of non-multimedia applications.
Abstract: Recent smart devices have adopted heterogeneous multi-core processors which have high-performance big cores and low-power small cores Unfortunately, the conventional task scheduler for heterogeneous multi-core processors does not provide appropriate amount of CPU resources for multimedia applications (whose QoS is important to users), resulting in energy waste; it often executes multimedia applications and non-multimedia applications on the same core In this paper, we propose an advanced task scheduler for heterogeneous multi-core processors, which provides appropriate amount of CPU resources for multimedia applications Our proposed task scheduler isolates multimedia applications from non-multimedia applications at runtime, exploiting the fact that multimedia applications have a specific thread for video/audio playback (to play video/audio, a multimedia application should use a function that generates the specific thread) Since multimedia applications usually require a smaller amount of CPU resources than non-multimedia applications due to dedicated hardware decoders, our proposed task scheduler allocates the former to the small cores and the latter to the big cores In our experiments on an Android-based development board, our proposed task scheduler saves system-wide (not just CPU) energy consumption by 89 percent, on average, compared to the conventional task scheduler, preserving QoS of multimedia applications In addition, it improves performance of non-multimedia applications by 137 percent, on average, compared to the conventional task scheduler
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve