Journal Article10.1109/LES.2021.3087707
Deep Learning Inference Parallelization on Heterogeneous Processors with TensorRT
53
TL;DR: This letter proposes a parallelization methodology to maximize the throughput of a single DL application using both GPU and NPU by exploiting various types of parallelism on TensorRT.
read more
Abstract: As deep learning inference applications are increasing, an embedded device tends to equip neural processing units (NPUs) in addition to a CPU and a GPU. For fast and efficient development of deep learning applications, TensorRT is provided as the SDK for NVIDIA hardware platform, including optimizer and runtime that delivers low latency and high-throughput for deep learning inference. Like most deep learning frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU or NPU, not both. In this paper, we propose a parallelization methodology to maximize the throughput of a single deep learning application using both GPU and NPU by exploiting various types of parallelism on TensorRT. With six real-life benchmarks, we could achieve 81% 391% throughput improvement over the baseline inference using GPU only.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards
Eunji Jeong,Jangryul Kim,Soonhoi Ha +2 more
TL;DR: TensorRT as mentioned in this paper is a framework supporting various optimization parameters to accelerate a deep learning application targeted on an NVIDIA Jetson embedded platform with heterogeneous processors, including multi-threading, pipelining, buffer assignment, and network duplication.
77
Light-YOLOv4: An Edge-Device Oriented Target Detection Method for Remote Sensing Images
TL;DR: Light-YOLOv4 as discussed by the authors performs sparsity training by applying L1 regularization to the channel scaling factors, so the less important channels and layers can be found and prune the network to reduce the network's width and depth.
YOLO-GD: A Deep Learning-Based Object Detection Algorithm for Empty-Dish Recycling Robots
TL;DR: A deep learning-based object detection algorithm for empty-dish recycling robots to automatically recycle dishes in restaurants and canteens, etc, using a lightweight object detection model YOLO-GD.
Automatic defogging, deblurring, and real-time segmentation system for sewer pipeline defects
TL;DR: Wang et al. as discussed by the authors used an attention-based algorithm for defogging and a generative adversarial network (GAN) for deblurring to improve the sharpness of pipeline images, which achieved the highest mean average precision (mAP) of 92.65% and the fastest speed of 41.23 frames per second (fps).
43
An architecture-level analysis on deep learning models for low-impact computations
TL;DR: In this paper , the authors conduct a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms.
References
DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices.
Nicholas D. Lane,Sourav Bhattacharya,Petko Georgiev,Claudio Forlivesi,Lei Jiao,Lorena Qendro,Fahim Kawsar +6 more
TL;DR: DeepX, a software accelerator, optimizes deep learning inference on mobile devices by decomposing models into unit-blocks and scaling resources, reducing memory, computation, and energy usage, enabling efficient execution of large-scale models on modern mobile processors.