Generative Prompt Model for Weakly Supervised Object Localization

Generative Prompt Model for Weakly Supervised Object Localization

- 19 Jul 2023

13

TL;DR: GenPromp as discussed by the authors proposes a generative pipeline to localize less discriminative object parts by formulating weakly supervised object localization (WSOL) as a conditional image denoising procedure.

Abstract: Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arxiv.2308.06160

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Wei Wu, +8 more

- 11 Aug 2023

- arXiv.org

TL;DR: This paper builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation, and shows that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.

...read moreread less

55

Journal Article•10.1109/tcsvt.2023.3343397

SD-FSOD: Self-Distillation Paradigm via Distribution Calibration for Few-Shot Object Detection

Han Chen, +7 more

TL;DR: A novel self-distillation paradigm exclusively for the fine-tuning stage (SD-FSOD) is proposed, which enhances the fine-tuning process and shows the superiority of the FSOD self-distillation methodologies.

...read moreread less

6

Journal Article•10.1109/tip.2024.3402981

Misclassification in Weakly Supervised Object Detection.

Yonghua Xu, +2 more

- 24 May 2024

- IEEE Transactions on Image Processing

TL;DR: Misclassification in weakly supervised object detection (WSOD) is a problem where some proposals exhibit semantic similarities with objects from other categories due to viewing perspective and background interference. MCC and MCT methods alleviate this problem by summarizing misclassification cases and decreasing loss weights of misclassified classes.

...read moreread less

4

Preprint•10.48550/arxiv.2405.16071

DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution

Yuzhong Zhao, +6 more

- 25 May 2024

TL;DR: DynRefer is a novel approach to region-level multi-modality tasks that improves the representational adaptability of existing models by mimicking human visual cognition.

...read moreread less

3

Preprint•10.48550/arxiv.2307.15029

Adaptive Segmentation Network for Scene Text Detection

Genming Zhao

- 01 Jan 2023

TL;DR: The proposed Adaptive Segmentation Network (ASNet) achieves state-of-the-art performance on scene text detection by automatically learning the discriminate segmentation threshold and designing a Global-information Enhanced Feature Pyramid Network (GE-FPN) for capturing text instances with macro size and extreme aspect ratios.

...read moreread less

1

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

•Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

36.9K

•Journal Article

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Apr 2015

- Springer US

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has been running annually for five years (since 2010) and has become the standard benchmark for large-scale object recognition.

...read moreread less

23.9K

•Posted Content

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola, +3 more

- 21 Nov 2016

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

...read moreread less

15.5K

...

Expand