Pix2seq: A Language Modeling Framework for Object Detection

Open AccessPosted Content

Pix2seq: A Language Modeling Framework for Object Detection

- 22 Sep 2021

- arXiv: Computer Vision and Pattern Recog...

108

TL;DR: Pix2Seq as mentioned in this paper cast object detection as a language modeling task conditioned on the observed pixel inputs, where object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens and train a neural network to perceive the image and generate the desired sequence.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3505244

Transformers in Vision: A Survey

31 Jan 2022

- ACM Computing Surveys

TL;DR: Transformer networks as mentioned in this paper enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).

...read moreread less

1.7K

•Proceedings Article•10.1109/cvpr52688.2022.01325

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

01 Jun 2022

TL;DR: DN-DETR as discussed by the authors proposes to feed ground-truth bounding boxes with noises into Transformer decoder and train the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to faster convergence.

...read moreread less

398

Journal Article•10.1109/cvpr52729.2023.00935

Autoregressive Visual Tracking

Xing Wei, +4 more

- 01 Jun 2023

TL;DR: This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy.

...read moreread less

116

Journal Article•10.1109/cvpr52729.2023.01032

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Antoine Yang, +7 more

- 01 Jun 2023

TL;DR: Vid2Seq is a large-scale pre-trained visual language model for dense video captioning that augments a language model with special time tokens to predict event boundaries and textual descriptions in the same sequence.

...read moreread less

89

Journal Article•10.1109/cvpr52729.2023.00660

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Xinlong Wang, +4 more

- 01 Jun 2023

TL;DR: Images Speak in Images: A generalist painter for in-context visual learning is a novel approach that redefines the output of core vision tasks as images and uses image-centric prompts to enable rapid task adaptation.

...read moreread less

67

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K