Open AccessPosted Content
Pix2seq: A Language Modeling Framework for Object Detection
TL;DR: Pix2Seq as mentioned in this paper cast object detection as a language modeling task conditioned on the observed pixel inputs, where object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens and train a neural network to perceive the image and generate the desired sequence.
read more
Abstract: This paper presents Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we simply cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural net to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Transformers in Vision: A Survey
31 Jan 2022
TL;DR: Transformer networks as mentioned in this paper enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).
1.7K
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising
01 Jun 2022
TL;DR: DN-DETR as discussed by the authors proposes to feed ground-truth bounding boxes with noises into Transformer decoder and train the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to faster convergence.
Autoregressive Visual Tracking
Xing Wei,Yifan Bai,Yongchao Zheng,Dahu Shi,Yihong Gong +4 more
- 01 Jun 2023
TL;DR: This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy.
116
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang,Arsha Nagrani,Paul Hongsuck Seo,Antoine Miech,Jordi Pont-Tuset,Ivan Laptev,Josef Šivic,Cordelia Schmid +7 more
- 01 Jun 2023
TL;DR: Vid2Seq is a large-scale pre-trained visual language model for dense video captioning that augments a language model with special time tokens to predict event boundaries and textual descriptions in the same sequence.
89
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
Xinlong Wang,Wen Wang,Yong Cao,Chunhua Shen,Tiejun Huang +4 more
- 01 Jun 2023
TL;DR: Images Speak in Images: A generalist painter for in-context visual learning is a novel approach that redefines the output of core vision tasks as images and uses image-centric prompts to enable rapid task adaptation.
67
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton +2 more
- 03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.