Journal Article10.48550/arxiv.2409.04390
Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences
Rui Yu,Runkai Zhao,Cong Nie,Heng Wang,Hui Yan,Meng Wang +5 more
- 06 Sep 2024
TL;DR: This paper proposes LiSTM, a novel LiDAR 3D object detection framework that incorporates temporal motion estimation to enhance spatial-temporal feature learning, achieving superior detection performance on Waymo and nuScenes datasets through motion-guided feature aggregation and dual correlation weighting.
read more
Abstract: Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
References
SSD: Single Shot MultiBox Detector
Wei Liu,Dragomir Anguelov,Dumitru Erhan,Christian Szegedy,Scott Reed,Cheng-Yang Fu,Alexander C. Berg +6 more
- 08 Oct 2016
TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
R. Qi Charles,Hao Su,Mo Kaichun,Leonidas J. Guibas +3 more
- 21 Jul 2017
TL;DR: This paper designs a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input and provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.
SECOND: Sparsely Embedded Convolutional Detection
Yan Yan,Yuxing Mao,Bo Li +2 more
TL;DR: An improved sparse convolution method for Voxel-based 3D convolutional networks is investigated, which significantly increases the speed of both training and inference and introduces a new form of angle loss regression to improve the orientation estimation performance.
3.2K
•Posted Content
Objects as Points
TL;DR: The center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors and performs competitively with sophisticated multi-stage methods and runs in real-time.
2.7K
•Posted Content
PointPillars: Fast Encoders for Object Detection from Point Clouds
TL;DR: PointPillars as mentioned in this paper utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars), which can be used with any standard 2D convolutional detection architecture.
2.3K