Generative Prompt Model for Weakly Supervised Object Localization
Yuzhong Zhao,Qixiang Ye,Weijia Wu,Chien-Yeh Shen,Fang Wang +4 more
- 19 Jul 2023
TL;DR: GenPromp as discussed by the authors proposes a generative pipeline to localize less discriminative object parts by formulating weakly supervised object localization (WSOL) as a conditional image denoising procedure.
read more
Abstract: Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models
Wei Wu,Yuzhong Zhao,Hao Chen,Yu-Chao Gu,Rui-Wei Zhao,Yefei He,Hong Zhou,Mike Zheng Shou,Chunhua Shen +8 more
TL;DR: This paper builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation, and shows that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
SD-FSOD: Self-Distillation Paradigm via Distribution Calibration for Few-Shot Object Detection
Han Chen,Qi Wang,Kailin Xie,Liang Lei,M. Lin,Tian Lv,Yong-Jin Liu,Jiebo Luo +7 more
TL;DR: A novel self-distillation paradigm exclusively for the fine-tuning stage (SD-FSOD) is proposed, which enhances the fine-tuning process and shows the superiority of the FSOD self-distillation methodologies.
6
Misclassification in Weakly Supervised Object Detection.
Yonghua Xu,Jian Yang,Xuelong Li +2 more
TL;DR: Misclassification in weakly supervised object detection (WSOD) is a problem where some proposals exhibit semantic similarities with objects from other categories due to viewing perspective and background interference. MCC and MCT methods alleviate this problem by summarizing misclassification cases and decreasing loss weights of misclassified classes.
4
DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution
Yuzhong Zhao,Feng Li,Yue Liu,Mingxiang Liao,Chen Gong,Qixiang Ye,Fang Wan +6 more
- 25 May 2024
TL;DR: DynRefer is a novel approach to region-level multi-modality tasks that improves the representational adaptability of existing models by mimicking human visual cognition.
Adaptive Segmentation Network for Scene Text Detection
Genming Zhao
- 01 Jan 2023
TL;DR: The proposed Adaptive Segmentation Network (ASNet) achieves state-of-the-art performance on scene text detection by automatically learning the discriminate segmentation threshold and designing a Global-information Enhanced Feature Pyramid Network (GE-FPN) for capturing text instances with macro size and extreme aspect ratios.
1
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
•Posted Content
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,Xiaohua Zhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,Jakob Uszkoreit,Neil Houlsby +11 more
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
•Journal Article
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Michael S. Bernstein,Li Fei-Fei,Alexander C. Berg,Aditya Khosla +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has been running annually for five years (since 2010) and has become the standard benchmark for large-scale object recognition.
23.9K
•Posted Content
Image-to-Image Translation with Conditional Adversarial Networks
TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
15.5K