TL;DR: This paper describes practical attacks that combine methodology from side channel attacks, fault attacks, and return-oriented programming that can read arbitrary memory from the victim's process that violate the security assumptions underpinning numerous software security mechanisms.
Abstract: Modern processors use branch prediction and speculative execution to maximize performance. For example, if the destination of a branch depends on a memory value that is in the process of being read, CPUs will try guess the destination and attempt to execute ahead. When the memory value finally arrives, the CPU either discards or commits the speculative computation. Speculative logic is unfaithful in how it executes, can access to the victim's memory and registers, and can perform operations with measurable side effects.
Spectre attacks involve inducing a victim to speculatively perform operations that would not occur during correct program execution and which leak the victim's confidential information via a side channel to the adversary. This paper describes practical attacks that combine methodology from side channel attacks, fault attacks, and return-oriented programming that can read arbitrary memory from the victim's process. More broadly, the paper shows that speculative execution implementations violate the security assumptions underpinning numerous software security mechanisms, including operating system process separation, static analysis, containerization, just-in-time (JIT) compilation, and countermeasures to cache timing/side-channel attacks. These attacks represent a serious threat to actual systems, since vulnerable speculative execution capabilities are found in microprocessors from Intel, AMD, and ARM that are used in billions of devices.
While makeshift processor-specific countermeasures are possible in some cases, sound solutions will require fixes to processor designs as well as updates to instruction set architectures (ISAs) to give hardware architects and software developers a common understanding as to what computation state CPU implementations are (and are not) permitted to leak.
TL;DR: In this paper, the authors proposed a simple and fast algorithm that runs on the CPU and relies only on basic image processing operations to perform depth completion of sparse LIDAR depth data.
Abstract: With the rise of data driven deep neural networks as a realization of universal function approximators, most research on computer vision problems has moved away from handcrafted classical image processing algorithms. This paper shows that with a well designed algorithm, we are capable of outperforming neural network based methods on the task of depth completion. The proposed algorithm is simple and fast, runs on the CPU, and relies only on basic image processing operations to perform depth completion of sparse LIDAR depth data. We evaluate our algorithm on the challenging KITTI depth completion benchmark, and at the time of submission, our method ranks first on the KITTI test server among all published methods. Furthermore, our algorithm is data independent, requiring no training data to perform the task at hand. The code written in Python is publicly available at https://github.com/kujason/ip_basic
TL;DR: This paper shows that with a well designed algorithm, it is capable of outperforming neural network based methods on the task of depth completion, and at the time of submission, the method ranks first on the KITTI test server among all published methods.
Abstract: With the rise of data driven deep neural networks as a realization of universal function approximators, most research on computer vision problems has moved away from hand crafted classical image processing algorithms. This paper shows that with a well designed algorithm, we are capable of outperforming neural network based methods on the task of depth completion. The proposed algorithm is simple and fast, runs on the CPU, and relies only on basic image processing operations to perform depth completion of sparse LIDAR depth data. We evaluate our algorithm on the challenging KITTI depth completion benchmark, and at the time of submission, our method ranks first on the KITTI test server among all published methods. Furthermore, our algorithm is data independent, requiring no training data to perform the task at hand. The code written in Python will be made publicly available at this https URL.
TL;DR: Nemesis is presented, a previously overlooked side-channel attack vector that abuses the CPU's interrupt mechanism to leak microarchitectural instruction timings from enclaved execution environments such as Intel SGX, Sancus, and TrustLite.
Abstract: Recent research on transient execution vulnerabilities shows that current processors exceed our levels of understanding. The prominent Meltdown and Spectre attacks abruptly revealed fundamental design flaws in CPU pipeline behavior and exception handling logic, urging the research community to systematically study attack surface from microarchitectural interactions. We present Nemesis, a previously overlooked side-channel attack vector that abuses the CPU's interrupt mechanism to leak microarchitectural instruction timings from enclaved execution environments such as Intel SGX, Sancus, and TrustLite. At its core, Nemesis abuses the same subtle microarchitectural behavior that enables Meltdown, i.e., exceptions and interrupts are delayed until instruction retirement. We show that by measuring the latency of a carefully timed interrupt, an attacker controlling the system software is able to infer instruction-granular execution state from hardware-enforced enclaves. In contrast to speculative execution vulnerabilities, our novel attack vector is applicable to the whole computing spectrum, from small embedded sensor nodes to high-end commodity x86 hardware. We present practical interrupt timing attacks against the open-source Sancus embedded research processor, and we show that interrupt latency reveals microarchitectural instruction timings from off-the-shelf Intel SGX enclaves. Finally, we discuss challenges for mitigating Nemesis-type attacks at the hardware and software levels.
TL;DR: A co-processing approach is adopted with the control loop of FDEM executed serially on the CPU and compute-intensive tasks off-loaded to the GPU, indicating speedups of up to 100 × compared to sequential CPU execution.
TL;DR: This paper focuses on analyzing and organizing the extensive body of literature on near- memory computing across various dimensions: starting from the memory level where this paradigm is applied, to the granularity of the application that could be executed on the near-memory units.
Abstract: The conventional approach of moving stored data to the CPU for computation has become a major performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in integration technologies have made the decade-old concept of coupling compute units close to the memory (called Near-Memory Computing) more viable. Processing right at the "home" of data can completely diminish the data movement problem of data-intensive applications. This paper focuses on analyzing and organizing the extensive body of literature on near-memory computing across various dimensions: starting from the memory level where this paradigm is applied, to the granularity of the application that could be executed on the near-memory units. We highlight the challenges as well as the critical need of evaluation methodologies that can be employed in designing these special architectures. Using a case study, we present our methodology and also identify topics for future research to unlock the full potential of near-memory computing.
TL;DR: HyperLoop is presented, a new framework that removes CPU from the critical path of replicated transactions in storage systems by offloading them to commodity RDMA NICs, with non-volatile memory as the storage medium and demonstrates that popular storage applications can be easily optimized using these primitives.
Abstract: Storage systems in data centers are an important component of large-scale online services. They typically perform replicated transactional operations for high data availability and integrity. Today, however, such operations suffer from high tail latency even with recent kernel bypass and storage optimizations, and thus affect the predictability of end-to-end performance of these services. We observe that the root cause of the problem is the involvement of the CPU, a precious commodity in multi-tenant settings, in the critical path of replicated transactions. In this paper, we present HyperLoop, a new framework that removes CPU from the critical path of replicated transactions in storage systems by offloading them to commodity RDMA NICs, with non-volatile memory as the storage medium. To achieve this, we develop new and general NIC offloading primitives that can perform memory operations on all nodes in a replication group while guaranteeing ACID properties without CPU involvement. We demonstrate that popular storage applications can be easily optimized using our primitives. Our evaluation results with microbenchmarks and application benchmarks show that HyperLoop can reduce 99th percentile latency ≈ 800X with close to 0% CPU consumption on replicas.
TL;DR: All the requirements and key performance indicators of a network to disaggregate IT resources are identified while summarizing the progress and importance of optical interconnects are summarized, and it is shown that the more diverse the VM requests are, the higher the net financial gain is.
Abstract: Disaggregated rack-scale data centers have been proposed as the only promising avenue to break the barrier of the fixed CPU-to-memory proportionality caused by main-tray direct-attached conventional/traditional server-centric systems However, memory disaggregation has stringent network requirements in terms of latency, energy efficiency, bandwidth, and bandwidth density This paper identifies all the requirements and key performance indicators of a network to disaggregate IT resources while summarizing the progress and importance of optical interconnects Crucially, it proposes a rack-and-cluster scale architecture, which supports the disaggregation of CPU, memory, storage, and/or accelerator blocks Optical circuit switching forms the core of this architecture, whereas the end-points (IT resources) are equipped with on-chip programmable hybrid electrical packet/circuit switches This architecture offers dynamically reconfigurable physical topology to form virtual ones, each embedded with a set of functions It analyzes the latency overhead of disaggregated DDR4 (parallel) and the proposed hybrid memory cube (serial) memory elements on the conventional and the proposed architecture A set of resource allocation algorithms are introduced to (1) optimally select disaggregated IT resources with the lowest possible latency, (2) pool them together by means of a virtual network interconnect, and (3) compose virtual disaggregated servers Simulation findings show up to a 34% resource utilization increase over traditional data centers while highlighting the importance of the placement and locality among compute, memory, and storage resources In particular, the network-aware locality-based resource allocation algorithm achieves as low as 15 ns, 95 ns, and 315 ns memory transaction round-trip latency on 63%, 22%, and 15% of the allocated virtual machines (VMs) accordingly while utilizing 100% of the CPU resources Furthermore, a formulation to parameterize and evaluate the additional financial costs endured by disaggregation is reported It is shown that the more diverse the VM requests are, the higher the net financial gain is Finally, an experiment was carried out using silicon photonic midboard optics and an optical circuit switch, which demonstrates forward error correction free 10−12 bit error rate performance on up to five-tier scale-out networks
TL;DR: It has been observed that the GPU runs faster than the CPU in all tests performed, and in some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server.
Abstract: Deep learning approaches are machine learning methods used in many application fields today. Some core mathematical operations performed in deep learning are suitable to be parallelized. Parallel processing increases the operating speed. Graphical Processing Units (GPU) are used frequently for parallel processing. Parallelization capacities of GPUs are higher than CPUs, because GPUs have far more cores than Central Processing Units (CPUs). In this study, benchmarking tests were performed between CPU and GPU. Tesla k80 GPU and Intel Xeon Gold 6126 CPU was used during tests. A system for classifying Web pages with Recurrent Neural Network (RNN) architecture was used to compare performance during testing. CPUs and GPUs running on the cloud were used in the tests because the amount of hardware needed for the tests was high. During the tests, some hyperparameters were adjusted and the performance values were compared between CPU and GPU. It has been observed that the GPU runs faster than the CPU in all tests performed. In some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server. These values can be further increased by using a GPU server with more features.
TL;DR: This paper describes two existing implementations of memory tagging, one is the full hardware implementation in SPARC; the other is a partially hardware-assisted compiler-based tool for AArch64.
Abstract: Memory safety in C and C++ remains largely unresolved. A technique usually called "memory tagging" may dramatically improve the situation if implemented in hardware with reasonable overhead. This paper describes two existing implementations of memory tagging: one is the full hardware implementation in SPARC; the other is a partially hardware-assisted compiler-based tool for AArch64. We describe the basic idea, evaluate the two implementations, and explain how they improve memory safety. This paper is intended to initiate a wider discussion of memory tagging and to motivate the CPU and OS vendors to add support for it in the near future.
TL;DR: This work proposes and evaluates two general-purpose solutions that minimize unnecessary off-chip communication for PIM architectures and shows that both mechanisms improve the performance and energy consumption of many important memory-intensive applications.
Abstract: Poor DRAM technology scaling over the course of many years has caused DRAM-based main memory to increasingly become a larger system bottleneck A major reason for the bottleneck is that data stored within DRAM must be moved across a pin-limited memory channel to the CPU before any computation can take place This requires a high latency and energy overhead, and the data often cannot benefit from caching in the CPU, making it difficult to amortize the overhead
Modern 3D-stacked DRAM architectures include a logic layer, where compute logic can be integrated underneath multiple layers of DRAM cell arrays within the same chip Architects can take advantage of the logic layer to perform processing-in-memory (PIM), or near-data processing In a PIM architecture, the logic layer within DRAM has access to the high internal bandwidth available within 3D-stacked DRAM (which is much greater than the bandwidth available between DRAM and the CPU) Thus, PIM architectures can effectively free up valuable memory channel bandwidth while reducing system energy consumption
A number of important issues arise when we add compute logic to DRAM In particular, the logic does not have low-latency access to common CPU structures that are essential for modern application execution, such as the virtual memory and cache coherence mechanisms To ease the widespread adoption of PIM, we ideally would like to maintain traditional virtual memory abstractions and the shared memory programming model This requires efficient mechanisms that can provide logic in DRAM with access to CPU structures without having to communicate frequently with the CPU To this end, we propose and evaluate two general-purpose solutions that minimize unnecessary off-chip communication for PIM architectures We show that both mechanisms improve the performance and energy consumption of many important memory-intensive applications
TL;DR: A scheduling method called interlacing peak that can balance loads and improve the effects of resource allocation and utilization effectively is proposed and shows advantage over other similar standard algorithms.
Abstract: In cloud computing, resources are dynamic, and the demands placed on the resources allocated to a particular task are diverse. These factors could lead to load imbalances, which affect scheduling efficiency and resource utilization. A scheduling method called interlacing peak is proposed. First, the resource load information, such as CPU, I/O, and memory usage, is periodically collected and updated, and the task information regarding CPU, I/O, and memory is collected. Second, resources are sorted into three queues according to the loads of the CPU, I/O, and memory: CPU intensive, I/O intensive, and memory intensive, according to their demands for resources. Finally, once the tasks have been scheduled, they need to interlace the resource load peak. Some types of tasks need to be matched with the resources whose loads correspond to a lighter types of tasks. In other words, CPU-intensive tasks should be matched with resources with low CPU utilization; I/O-intensive tasks should be matched with resources with shorter I/O wait times; and memory-intensive tasks should be matched with resources that have low memory usage. The effectiveness of this method is proved from the theoretical point of view. It has also been proven to be less complex in regard to time and place. Four experiments were designed to verify the performance of this method. Experiments leverage four metrics: 1) average response time; 2) load balancing; 3) deadline violation rates; and 4) resource utilization. The experimental results show that this method can balance loads and improve the effects of resource allocation and utilization effectively. This is especially true when resources are limited. In this way, many tasks will compete for the same resources. However, this method shows advantage over other similar standard algorithms.
TL;DR: A cloud-based real-time modeling system for supporting decision makers in assessing flood risk, built using Amazon Web Services, that automates access and pre-processing of forecast data, execution of a computationally expensive and high-resolution 2D hydrodynamic model, Two-dimensional Unsteady Flow (TUFLOW), and map-based visualization of model outputs.
Abstract: The ability to quickly and accurately forecast flooding is increasingly important as extreme weather events become more common. This work focuses on designing a cloud-based real-time modeling system for supporting decision makers in assessing flood risk. The system, built using Amazon Web Services (AWS), automates access and pre-processing of forecast data, execution of a computationally expensive and high-resolution 2D hydrodynamic model, Two-dimensional Unsteady Flow (TUFLOW), and map-based visualization of model outputs. A graphical processing unit (GPU) version of TUFLOW was used, resulting in an 80x execution time speed-up compared to the central processing unit (CPU) version. The system is designed to run automatically to produce near real-time results and consume minimal computational resources until triggered by an extreme weather event. The system is demonstrated for a case study in the coastal plain of Virginia to forecast flooding vulnerability of transportation infrastructure during extreme weather events.
TL;DR: This work applies dynamic parallelism for synaptic updating in SNN simulations on a GPU, which eliminates the need to start many parallel applications at each time-step, and the associated lags of data transfer between CPU and GPU memories.
TL;DR: A new model-based data partitioning algorithm is proposed, which minimizes the execution time of computations in the parallel execution of the application, and is proved the correctness of the algorithm and its complexity of the cardinality of the input discrete speed functions.
Abstract: Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc. This has resulted in severe resource contention and Non-Uniform Memory Access (NUMA) that have posed serious challenges to model and algorithm developers. Moreover, the accelerators feature limited main memory compared to the multicore CPU host and are connected to it via limited bandwidth PCI-E links thereby requiring support for efficient out-of-card execution. To summarize, the complexities (resource contention, NUMA, accelerator-specific limitations, etc.) have introduced new challenges to optimization of data-parallel applications on these platforms for performance. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance. We then propose a new model-based data partitioning algorithm, which minimizes the execution time of computations in the parallel execution of the application. This algorithm takes as input a set of $p$ discrete speed functions corresponding to $p$ available heterogeneous processors. It does not make any assumptions about the shapes of these functions. We prove the correctness of the algorithm and its complexity of $O(m^3 \times p^3)$ , where $m$ is the cardinality of the input discrete speed functions. We experimentally demonstrate the optimality and efficiency of our algorithm using two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous cluster of nodes where each node contains an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor.
TL;DR: In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU, and all algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing.
Abstract: Signal, image and Synthetic Aperture Radar imagery algorithms in recent time are used in a daily routine. Due to huge data and complexity, their processing is almost impossible in a real time. Often image processing algorithms are inherently parallel in nature, so they fit nicely into parallel architectures multicore Central Processing Unit (CPU) and Graphics Processing Unit GPUs. In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU. All algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing. Relative speedups compared to CPU were given for all algorithms. TensorFlow GPU implementation can outperform multi-core CPUs for tested algorithms, obtained speedups range from 3.6 to 15 times.
TL;DR: DHL is presented, a novel CPU-FPGA co-design framework for NFV with both high performance and flexibility and a prototype of DHL with Intel DPDK, which greatly reduces the programming efforts to access FPGA, brings significantly higher throughput and lower latency over CPU-only implementation, and minimizes the CPU resources.
Abstract: Network function virtualization (NFV) aims to run software network functions (NFs) in commodity servers. As CPU is general-purpose hardware, one has to use many CPU cores to handle complex packet processing at line rate. Owing to its performance and programmability, FPGA has emerged as a promising platform for NFV. However, the programmable logic blocks on an FPGA board are limited and expensive. Implementing the entire NFs on FPGA is thus resource-demanding. Further, FPGA needs to be reprogrammed when the NF logic changes which can take hours to synthesize the code. It is thus inflexible to use FPGA to implement the entire NFV service chain. We present dynamic hardware library (DHL), a novel CPU-FPGA co-design framework for NFV with both high performance and flexibility. DHL employs FPGA as accelerators only for complex packet processing. It abstracts accelerator modules in FPGA as a hardware function library, and provides a set of transparent APIs for developers. DHL supports running multiple concurrent software NFs with distinct accelerator functions on the same FPGA and provides data isolation among them. We implement a prototype of DHL with Intel DPDK. Experimental results demonstrate that DHL greatly reduces the programming efforts to access FPGA, brings significantly higher throughput and lower latency over CPU-only implementation, and minimizes the CPU resources.
TL;DR: The experiments undertaken in this study demonstrate that the vectorized algorithm can greatly reduce the computation time, especially in the environment of a vector programming language, and it is possible to parallelize the algorithm as the data volume increases.
Abstract: Cellular automata (CA) models can simulate complex urban systems through simple rules and have become important tools for studying the spatio-temporal evolution of urban land use. However, the multiple and large-volume data layers, massive geospatial processing and complicated algorithms for automatic calibration in the urban CA models require a high level of computational capability. Unfortunately, the limited performance of sequential computation on a single computing unit (i.e. a central processing unit (CPU) or a graphics processing unit (GPU)) and the high cost of parallel design and programming make it difficult to establish a high-performance urban CA model. As a result of its powerful computational ability and scalability, the vectorization paradigm is becoming increasingly important and has received wide attention with regard to this kind of computational problem. This paper presents a high-performance CA model using vectorization and parallel computing technology for the computation-inte...
TL;DR: It is shown that PRINS-based processing-in-storage architecture may outperform existing in-storage designs and accelerator-based designs and achieve an order-of-magnitude speedup and improved power efficiency relative to all compared platforms.
Abstract: Machine learning algorithms have become a major tool in various applications. The high-performance requirements on large-scale datasets pose a challenge for traditional von Neumann architectures. We present two machine learning implementations and evaluations on PRINS, a novel processing-in-storage system based on resistive content addressable memory (ReCAM). PRINS functions simultaneously as a storage and a massively parallel associative processor. PRINS processing-in-storage resolves the bandwidth wall faced by near-data von Neumann architectures, such as three-dimensional DRAM and CPU stack or SSD with embedded CPU, by keeping the computing inside the storage arrays, thus implementing in-data, rather than near-data, processing. We show that PRINS-based processing-in-storage architecture may outperform existing in-storage designs and accelerator-based designs. Multiple performance comparisons for the ReCAM processing-in-storage implementations of $K$ -means and K -nearest neighbors are performed. Compared platforms include CPU, GPU, FPGA, and Automata Processor. We show that PRINS may achieve an order-of-magnitude speedup and improved power efficiency relative to all compared platforms.
TL;DR: The results demonstrate that the learning-directed DVFS method can accurately predict the suitable central processing unit (CPU) frequency, given the runtime statistical information of a running program, and achieve an energy savings rate up to 42%.
Abstract: Dynamic voltage and frequency scaling (DVFS) is a well-known method for saving energy consumption Several DVFS studies have applied learning-based methods to implement the DVFS prediction model instead of complicated mathematical models This paper proposes a lightweight learning-directed DVFS method that involves using counter propagation networks to sense and classify the task behavior and predict the best voltage/frequency setting for the system An intelligent adjustment mechanism for performance is also provided to users under various performance requirements The comparative experimental results of the proposed algorithms and other competitive techniques are evaluated on the NVIDIA JETSON Tegra K1 multicore platform and Intel PXA270 embedded platforms The results demonstrate that the learning-directed DVFS method can accurately predict the suitable central processing unit (CPU) frequency, given the runtime statistical information of a running program, and achieve an energy savings rate up to 42% Through this method, users can easily achieve effective energy consumption and performance by specifying the factors of performance loss
TL;DR: A novel parallel strategy by leveraging the parallel power of a multi-core CPU and that of a many-core GPU for HW/SW partitioning on CPU-GPU is presented and the solution quality obtained is competitive with existing heuristic methods in reasonable time.
Abstract: In hardware/software (HW/SW) co-design, hardware/software partitioning is an essential step in that it determines which components to be implemented in hardware and which ones in software. Most of HW/SW partitioning problems are NP hard. For large-size problems, heuristic methods have to be utilized. This paper presents a parallel genetic algorithm with dispersion correction for HW/SW partitioning on CPU-GPU. First, an enhanced genetic algorithm with dispersion correction is presented. The under-constraint individuals are marched to feasible region step by step. In this way, the intensification can be enhanced as well as the constraint problem can be handled. Second, the individuals performing costs computation and dispersion correction are run in parallel. For a given problem size, the overall run-time can be reduced while the diversity of genetic algorithm can be kept. Third, especially when a number of under-constraint individuals should be corrected in an irregular way, the computation process is complicated and the computation overhead is large. Therefore, we present a novel parallel strategy by leveraging the parallel power of a multi-core CPU and that of a many-core GPU. The proposed strategy computes the costs of each individual in parallel on GPU and corrects the under-constraint individuals in parallel on the multi-core CPU. In this way, a highly efficient parallel computing can be achieved in which dozens of irregular correction computing steps are mapped to the multi-core CPU and thousands of regular cost computing steps are mapped to the many-core GPU. Fourth, at each iteration of the hybrid parallel strategy, the solution vectors of individuals are transferred to the GPU and their costs are transferred back to the CPU. In order to further improve the efficiency of proposed algorithm, we propose an asynchronous transfer pattern (stream concurrency pattern) for CPU-GPU, in which the transfer process and computation process are overlapped and eventually the overall run-time can be reduced further. Finally, the experiments show that the solution quality obtained by our method is competitive with existing heuristic methods in reasonable time. Furthermore, by combining with the multi-core CPU and many-core GPU, the running time of the proposed method is efficiently reduced.
TL;DR: Performance and accuracy of the MMC model and the hybrid CPU/FPGA-based architecture are evaluated based on a set of case studies on a 401-level HB-MMC-based HVDC station and verified based on offline simulation results in the PSCAD/EMTDC environment.
Abstract: This paper presents i) an equivalent model of the half-bridge modular multilevel converter (HB-MMC) which is suitable for real-time applications, ii) a hybrid central-processing unit/field-programmable gate array (CPU/FPGA)-based architecture for real-time simulation of electromagnetic transients of systems which include HB-MMC, and iii) a novel arrangement for sorting results referred to as the “sub-module (SM) rank list”, which tackles the bottleneck for parallel implementation of the MMC arm model solver on the FPGA. The Adam–Bashforth (AB) method is used for numerical integration of the HB-SM capacitor model. The second-order AB method provides a constant admittance matrix of the HB-MMC and, thus, reduces computational burden while offering the same accuracy as that of the widely used Trapezoidal method. The CPU/FPGA-based architecture is optimized to obtain maximum parallelism of the HB-MMC model implementation, adopting a standard, single-precision, floating-point computational engine. The proposed sorting arrangement is independent of the utilized sorting algorithm and its application to the odd–even bubble sorting scheme is presented in this paper. The proposed architecture offers a simulation time-step of 825 ns while including the sorting module as the SM capacitor voltage-balancing control unit. This enables accurate analysis of MMC controls based on either software-in-the-loop or hardware-in-the-loop approaches. Performance and accuracy of the MMC model and the hybrid CPU/FPGA-based architecture are evaluated based on a set of case studies on a 401-level HB-MMC-based HVDC station and verified based on offline simulation results in the PSCAD/EMTDC environment.
TL;DR: This work proposes a new ReRAM-based processing-in-memory architecture called RPBFS, in which graph data can be persistently stored and processed in place and shows a significant performance improvement compared with both the CPU-based and the GPU-based BFS implementations.
Abstract: Graph algorithms such as graph traversal have been gaining ever-increasing importance in the era of big data. However, graph processing on traditional architectures issues many random and irregular memory accesses, leading to a huge number of data movements and the consumption of very large amounts of energy. To minimize the waste of memory bandwidth, we investigate utilizing processing-in-memory (PIM), combined with non-volatile metal-oxide resistive random access memory (ReRAM), to improve both computation and I/O performance.We propose a new ReRAM-based processing-in-memory architecture called RPBFS, in which graph data can be persistently stored and processed in place. We study the problem of graph traversal, and we design an efficient graph traversal algorithm in RPBFS. Benefiting from low data movement overhead and high bank-level parallel computation, RPBFS shows a significant performance improvement compared with both the CPU-based and the GPU-based BFS implementations. On a suite of real-world graphs, our architecture yields a speedup in graph traversal performance of up to 33.8×, and achieves a reduction in energy over conventional systems of up to 142.8×.
TL;DR: A novel toolkit is proposed that uses a configurable/interchangeable learning technique to automatically learn the power model of a CPU, independently of the features and the complexity it exhibits, and is shown to estimate the power consumption of the whole CPU or individual processes with an accuracy of 98.5% on average.
TL;DR: A loadable kernel module framework, called real-time ROS extension on transparent CPU/GPU coordination mechanism (ROSCH-G), for scheduling ROS in a heterogeneous environment without modifying the OS kernel and device drivers and then evaluates it experimentally.
Abstract: Robot Operating System (ROS) promotes fault isolation, faster development, modularity, and core reusability and is therefore widely studied and used as the de facto standard for autonomous driving systems. Graphics processing units (GPUs) also facilitate high-performance computing and are therefore used for autonomous driving. As the requirements for real-time processing increase, methods for satisfying real-time constraints for ROS and GPUs are being developed. Unfortunately, scheduling algorithms specifying ROS's transportation (publish/subscribe) model, which can have execution order restrictions, are not being investigated, leading to the introduction of waiting time and degrading the responsiveness of the entire system. Furthermore, GPU tasks on ROS are also affected by the ROS transportation model, because central processing unit (CPU) time is occupied when GPU functions are launched. This paper proposes a loadable kernel module framework, called real-time ROS extension on transparent CPU/GPU coordination mechanism (ROSCH-G), for scheduling ROS in a heterogeneous environment without modifying the OS kernel and device drivers and then evaluates it experimentally. ROSCH-G provides a scheduling algorithm that considers ROS's execution order restrictions and a CPU/GPU coordination mechanism. Experimental results demonstrate that the proposed algorithm reduces the deadline miss rate and, compared with previous studies, makes effective use of the benefits of parallel processing. In addition, the results for the coordination mechanism demonstrate that ROSCH-G can schedule multiple GPU applications successfully.
TL;DR: This paper is devoted to the optimizations carried out in the TermoFluids CFD code to efficiently run it on the Mont-Blanc system, showing time reductions of up to 2 .
TL;DR: A general reconfigurable embedded system design of convolution neural networks (CNN) based on FPGA and soft-core CPU that is able to accomplish image identification and other CNN works with high speed and low power.
Abstract: To satisfy the demand of mobile computing and low-power application scenes, we propose a general reconfigurable embedded system design of convolution neural networks (CNN) based on FPGA and soft-core CPU. The basic computing modules are located in the hardware circuits, including convolution, pooling and active layers. Several controlling logic and serial data processing units are executed in the soft-core CPU. Based on the cooperation and interaction between hardware and software, the system is able to accomplish image identification and other CNN works with high speed and low power. As a case study to demonstrate the system's performance and validity, experiments with CNN project are implemented on an Altera Cyclone-IV FPGA board. Our system achieves a peak performance of 59.52GOPS under 120MHz working frequency, with the power of 1.35 W, achieving the power efficiency of 44.1GOPS/W.
TL;DR: In this paper, the authors propose a multientry row-based write buffer to increase the buffer hit rate and reduce the number of memory core accesses for 3D-stacked memory architectures.
Abstract: Memory bandwidth is one of the major performance bottlenecks for chip multiprocessors (CMPs), which continue to integrate an increasing number of cores with the help of Moore's Law. The growing disparity between the CPU clock rate and off-chip memory access speed is known as the Memory Wall. This problem has been actively studied in the past two decades. It is addressed by placing memory closer to the processor, such as stacking the memory directly on top of a CMP, thereby significantly reducing the interconnect latency between them. However, previous 3D-stacked memory architectures use through-silicon via (TSV)-based three-dimensional (3-D) integration, which bonds multiple dies with TSVs that have diameters in the 1-5 $\mu \text{m}$ range. Unlike TSV-based 3-D integration, monolithic 3-D integration builds device tiers sequentially on a single substrate. Different tiers are connected using monolithic inter-tier vias (MIVs), which have a diameter (around 50 $\text{nm}$ ) that is the same as that of a local via. Main memory typically consists of DRAM, which is volatile and thus requires periodic refresh to maintain the stored data. This increases both the energy consumption and access latency. However, various nonvolatile RAMs (NVRAMs) have emerged as possible universal memory technologies, which promise low power, fast read access, high density, and nonvolatility. In this paper, we present an efficient memory interface for monolithic 3D-stacked RAM (both DRAM and NVRAMs such as resistive RAM and nanotube RAM). It takes advantage of the tremendous bandwidth made available by MIVs to implement an on-chip memory bus in order to hide the latency of large data transfers. We propose a multientry row-based write buffer to increase the buffer hit rate and reduce the number of memory core accesses. We decouple read and write accesses using extra interconnects available through MIVs to increase memory throughput. We also present an adaptive power-down policy to maintain balance between energy efficiency and performance. Simulation results show that the proposed architecture can achieve both high performance and energy efficiency, and is thus attractive for low-power/high-performance computing.
TL;DR: The proposed parallel model provides a fast and reliable tool with which to quickly assess flood hazards in large-scale areas and, thus, has a bright application prospect for dynamic inundation risk identification and disaster assessment.
Abstract: Computing speed is a significant issue of large-scale flood simulations for real-time response to disaster prevention and mitigation. Even today, most of the large-scale flood simulations are generally run on supercomputers due to the massive amounts of data and computations necessary. In this work, a two-dimensional shallow water model based on an unstructured Godunov-type finite volume scheme was proposed for flood simulation. To realize a fast simulation of large-scale floods on a personal computer, a Graphics Processing Unit (GPU)-based, high-performance computing method using the OpenACC application was adopted to parallelize the shallow water model. An unstructured data management method was presented to control the data transportation between the GPU and CPU (Central Processing Unit) with minimum overhead, and then both computation and data were offloaded from the CPU to the GPU, which exploited the computational capability of the GPU as much as possible. The parallel model was validated using various benchmarks and real-world case studies. The results demonstrate that speed-ups of up to one order of magnitude can be achieved in comparison with the serial model. The proposed parallel model provides a fast and reliable tool with which to quickly assess flood hazards in large-scale areas and, thus, has a bright application prospect for dynamic inundation risk identification and disaster assessment.
TL;DR: To maximize energy savings for a given approximation constraints, a hybrid approach is presented combining both voltage and precision scaling, which can be applied to an associative memory-based architecture that can be implemented today using SRAM but can be seamlessly scaled to emerging ReRAM-based memory technology later with minimal effort.
Abstract: The complexity of the computational problems is rising faster than the computational platforms’ capabilities which are also becoming increasingly costly to operate due to their increased need for energy. This forces researchers to find alternative paradigms and methods for efficient computing. One promising paradigm is accelerating compute-intensive kernels using in-memory computing accelerators, where data movements are significantly reduced. Another increasingly popular method for improving energy efficiency is approximate computing. In this paper, we propose a methodology for efficient approximate in-memory computing. To maximize energy savings for a given approximation constraints, a hybrid approach is presented combining both voltage and precision scaling. This can be applied to an associative memory-based architecture that can be implemented today using CMOS memories (SRAM) but can be seamlessly scaled to emerging ReRAM-based memory technology later with minimal effort. For the evaluation of the proposed methodology, a diverse set of domains is covered, such as image processing, machine learning, machine vision, and digital signal processing. When compared to full-precision, unscaled implementations, average energy savings of $5.17{\times}$ and $59.11{\times}$ , and speedups of $2.1{\times}$ and $3.24{\times}$ in SRAM-based and ReRAM-based architectures, respectively, are reported.