Referring Multi-Object Tracking

doi:10.48550/arXiv.2303.03366

Journal Article10.48550/arXiv.2303.03366

Referring Multi-Object Tracking

Dongming Wu, +5 more

- 06 Mar 2023

- arXiv.org

- Vol. abs/2303.03366

37

TL;DR: The authors proposed referring multi-object tracking (RMOT), which employs a language expression as a semantic cue to guide the prediction of multiobject tracking in videos, achieving an arbitrary number of referent object predictions in videos.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arxiv.2312.14150

DriveLM: Driving with Graph Visual Question Answering

Chonghao Sima, +8 more

- 21 Dec 2023

- arXiv.org

TL;DR: This work instantiate datasets built upon nuScenes and CARLA, and proposes a VLM-based baseline approach for jointly performing Graph VQA and end-to-end driving, demonstrating that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task.

...read moreread less

82

Journal Article•10.48550/arXiv.2305.14836

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Tian-Bai Qian, +4 more

- 24 May 2023

- arXiv.org

TL;DR: Wen et al. as mentioned in this paper proposed NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34k visual scenes and 460k question-answer pairs.

...read moreread less

63

Journal Article•10.48550/arxiv.2309.04379

Language Prompt for Autonomous Driving

Dongming Wu, +5 more

- 08 Sep 2023

- arXiv.org

TL;DR: The first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt is proposed, and a new prompt-based driving task is formulated, employing a language prompt to predict the described object trajectory across views and frames.

...read moreread less

50

Journal Article•10.48550/arxiv.2310.14414

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

Xingcheng Zhou, +4 more

- 22 Oct 2023

- arXiv.org

TL;DR: A comprehensive survey of the advances in language models in this domain is presented, encompassing current models and datasets, and the potential applications and emerging research directions are explored.

...read moreread less

39

Journal Article•10.48550/arxiv.2402.12289

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, +9 more

- 19 Feb 2024

- arXiv.org

TL;DR: This work introduces DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities and proposes DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline.

...read moreread less

38

...

Expand

References

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

...

Expand