TL;DR: This paper proposes an optimization framework of offloading from a single mobile device (MD) to multiple edge devices and proposes a linear relaxation-based approach and a semidefinite relaxation (SDR)-based approach for the fixed CPU frequency case, and an exhaustive search- based approach and an SDR-based approaches for the elasticCPU frequency case.
Abstract: In this paper, we propose an optimization framework of offloading from a single mobile device (MD) to multiple edge devices. We aim to minimize both total tasks’ execution latency and the MD’s energy consumption by jointly optimizing the task allocation decision and the MD’s central process unit (CPU) frequency. This paper considers two cases for the MD, i.e., fixed CPU frequency and elastic CPU frequency. Since these problems are NP-hard, we propose a linear relaxation-based approach and a semidefinite relaxation (SDR)-based approach for the fixed CPU frequency case, and an exhaustive search-based approach and an SDR-based approach for the elastic CPU frequency case. Our simulation results show that the SDR-based algorithms achieve near optimal performance. Performance improvement can be obtained with the proposed scheme in terms of energy consumption and tasks’ execution latency when multiple edge devices and elastic CPU frequency are considered. Finally, we show that the MD’s flexible CPU range can have an impact on the task allocation.
TL;DR: Compared to deterministic scheduling and voltage scaling, GRACE-OS saves energy by 7% to 72% while delivering statistical performance guarantees, and delivers soft performance guarantees by bounding the deadline miss ratio under application-specific requirements.
Abstract: This paper presents GRACE-OS, an energy-efficient soft real-time CPU scheduler for mobile devices that primarily run multimedia applications. The major goal of GRACE-OS is to support application quality of service and save energy. To achieve this goal, GRACE-OS integrates dynamic voltage scaling into soft real-time scheduling and decides how fast to execute applications in addition to when and how long to execute them. GRACE-OS makes such scheduling decisions based on the probability distribution of application cycle demands, and obtains the demand distribution via online profiling and estimation. We have implemented GRACE-OS in the Linux kernel and evaluated it on an HP laptop with a variable-speed CPU and multimedia codecs. Our experimental results show that (1) the demand distribution of the studied codecs is stable or changes smoothly. This stability implies that it is feasible to perform stochastic scheduling and voltage scaling with low overhead; (2) GRACE-OS delivers soft performance guarantees by bounding the deadline miss ratio under application-specific requirements; and (3) GRACE-OS reduces CPU idle time and spends more busy time in lower-power speeds. Our measurement indicates that compared to deterministic scheduling and voltage scaling, GRACE-OS saves energy by 7% to 72% while delivering statistical performance guarantees.
TL;DR: This paper explores the design of a distributed in-memory key-value store called Pilaf that takes advantage of Remote Direct Memory Access to achieve high performance with low CPU overhead and introduces the notion of self-verifying data structures that can detect read-write races without client-server coordination.
Abstract: Recent technological trends indicate that future datacenter networks will incorporate High Performance Computing network features, such as ultra-low latency and CPU bypassing. How can these features be exploited in datacenter-scale systems infrastructure? In this paper, we explore the design of a distributed in-memory key-value store called Pilaf that takes advantage of Remote Direct Memory Access to achieve high performance with low CPU overhead.
In Pilaf, clients directly read from the server's memory via RDMA to perform gets, which commonly dominate key-value store workloads. By contrast, put operations are serviced by the server to simplify the task of synchronizing memory accesses. To detect inconsistent RDMA reads with concurrent CPU memory modifications, we introduce the notion of self-verifying data structures that can detect read-write races without client-server coordination. Our experiments show that Pilaf achieves low latency and high throughput while consuming few CPU resources. Specifically, Pilaf can surpass 1.3 million ops/sec (90% gets) using a single CPU core compared with 55K for Memcached and 59K for Redis.
TL;DR: In this paper, the multiprocessor logic controller is interposed between the CPU and the remainder of the computer system to intercept and replace control signals that are passed over certain critical control signal lines associated with the CPU.
Abstract: A security enhanced computer system arrangement includes a coprocessor and a multiprocessor logic controller inserted into the architecture of a conventional computer system. The coprocessor and multiprocessor logic controller is interposed between the CPU of the conventional computer system to intercept and replace control signals that are passed over certain of the critical control signal lines associated with the CPU. The multiprocessor logic controller arrangement thereby isolates the CPU of the conventional computer system from the remainder of the conventional computer system, permitting separate control over the CPU and separate control over the remainder of the computer system. By controlling the control signals that are normally passed between the CPU and the remainder of the computer system, the multiprocessor logic controller permits the coprocessor to perform highly secure operations. These secure operations, selectable by a trusted operator or built in to a cooperating operating system, verify that the computer system is a trusted computing base which can be relied upon to perform its operations properly and without compromise.
TL;DR: Gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems, able to simulate many system configurations, ranging from a system with coherent caches and a single virtual address space across the CPU and GPU to a system that maintains separate GPU and CPU physical address spaces.
Abstract: gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems. It builds on gem5, a modular full-system CPU simulator, and GPGPU-Sim, a detailed GPGPU simulator. gem5-gpu routes most memory accesses through Ruby, which is a highly configurable memory system in gem5. By doing this, it is able to simulate many system configurations, ranging from a system with coherent caches and a single virtual address space across the CPU and GPU to a system that maintains separate GPU and CPU physical address spaces. gem5-gpu can run most unmodified CUDA 3.2 source code. Applications can launch non-blocking kernels, allowing the CPU and GPU to execute simultaneously. We present gem5-gpu ’s software architecture and a brief performance validation. We also discuss possible extensions to the simulator. gem5-gpu is open source and available at gem5-gpu.cs.wisc.edu.