Journal Article10.48550/arXiv.2303.03366
Referring Multi-Object Tracking
37
TL;DR: The authors proposed referring multi-object tracking (RMOT), which employs a language expression as a semantic cue to guide the prediction of multiobject tracking in videos, achieving an arbitrary number of referent object predictions in videos.
read more
Abstract: Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts. The dataset and code will be available at https://github.com/wudongming97/RMOT.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
DriveLM: Driving with Graph Visual Question Answering
Chonghao Sima,Katrin Renz,Kashyap Chitta,Li Chen,Hanxue Zhang,Chengen Xie,Ping Luo,Andreas Geiger,Hongyang Li +8 more
TL;DR: This work instantiate datasets built upon nuScenes and CARLA, and proposes a VLM-based baseline approach for jointly performing Graph VQA and end-to-end driving, demonstrating that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task.
82
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario
TL;DR: Wen et al. as mentioned in this paper proposed NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34k visual scenes and 460k question-answer pairs.
Language Prompt for Autonomous Driving
Dongming Wu,Wenmin Han,Tiancai Wang,Ying Hao Liu,Xiangyu Zhang,Jianbing Shen +5 more
TL;DR: The first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt is proposed, and a new prompt-based driving task is formulated, employing a language prompt to predict the described object trajectory across views and frames.
Vision Language Models in Autonomous Driving and Intelligent Transportation Systems
Xingcheng Zhou,Mingyu Liu,Bare Luka Žagar,Ekim Yurtsever,Alois C. Knoll +4 more
TL;DR: A comprehensive survey of the advances in language models in this domain is presented, encompassing current models and datasets, and the potential applications and emerging research directions are explored.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Xiaoyu Tian,Junru Gu,Bailin Li,Yicheng Liu,Chenxu Hu,Yang Wang,Kun Zhan,Peng Jia,Xianpeng Lang,Hang Zhao +9 more
TL;DR: This work introduces DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities and proposes DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline.
References
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.