Future Does Matter: Boosting 3D Object Detection with Temporal Motion
  Estimation in Point Cloud Sequences

doi:10.48550/arxiv.2409.04390

Journal Article10.48550/arxiv.2409.04390

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences

Rui Yu, +5 more

- 06 Sep 2024

TL;DR: This paper proposes LiSTM, a novel LiDAR 3D object detection framework that incorporates temporal motion estimation to enhance spatial-temporal feature learning, achieving superior detection performance on Waymo and nuScenes datasets through motion-guided feature aggregation and dual correlation weighting.

Abstract: Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

References

•Book Chapter•10.1007/978-3-319-46448-0_2

SSD: Single Shot MultiBox Detector

Wei Liu, +6 more

- 08 Oct 2016

TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.

...read moreread less

35.5K

•Proceedings Article•10.1109/CVPR.2017.16

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

R. Qi Charles, +3 more

- 21 Jul 2017

TL;DR: This paper designs a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input and provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.

...read moreread less

15.7K

•Journal Article•10.3390/S18103337

SECOND: Sparsely Embedded Convolutional Detection

Yan Yan, +2 more

- 06 Oct 2018

- Sensors

TL;DR: An improved sparse convolution method for Voxel-based 3D convolutional networks is investigated, which significantly increases the speed of both training and inference and introduces a new form of angle loss regression to improve the orientation estimation performance.

...read moreread less

3.2K

•Posted Content

Objects as Points

Xingyi Zhou, +2 more

- 16 Apr 2019

- arXiv: Computer Vision and Pattern Recog...

TL;DR: The center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors and performs competitively with sophisticated multi-stage methods and runs in real-time.

...read moreread less

2.7K

•Posted Content

PointPillars: Fast Encoders for Object Detection from Point Clouds

Alex H. Lang, +5 more

- 14 Dec 2018

- arXiv: Learning

TL;DR: PointPillars as mentioned in this paper utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars), which can be used with any standard 2D convolutional detection architecture.

...read moreread less

2.3K

...

Expand