Weakly Supervised Open-Vocabulary Object Detection

doi:10.1609/aaai.v38i4.28127

Proceedings Article10.1609/aaai.v38i4.28127

Weakly Supervised Open-Vocabulary Object Detection

Jianghang Lin, +5 more

- 24 Mar 2024

- Proceedings of the ... AAAI Conference o...

- Vol. 38, Iss: 4, pp 3404-3412

4

TL;DR: WSOVOD extends traditional weakly supervised object detection to open-vocabulary and cross-dataset learning, achieving state-of-the-art performance.

Abstract: Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1145/3664647.3680852

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

Yansong Qu, +6 more

- 26 Oct 2024

2

Journal Article•10.1145/3664647.3680850

Adaptive Selection based Referring Image Segmentation

Pengfei Yue, +10 more

- 26 Oct 2024

TL;DR: This paper introduces Adaptive Selection with Dual Alignment (ASDA), a novel framework for Referring Image Segmentation that adaptively aligns vision and language features, outperforming state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref benchmarks with improved robustness and adaptability.

...read moreread less

Journal Article•10.1145/3746027.3755509

CSDN: CLIP-Driven Similarity-Aligned Distillation Network for Weakly-Supervised Object Localization

Sifan Zuo, +2 more

- 25 Oct 2025

TL;DR: This paper proposes CSDN, a CLIP-driven similarity-aligned distillation network for weakly-supervised object localization, which integrates CAM and FPM to generate a semantic-enhanced FPM, improving completeness and boundary accuracy of target localization under weak supervision.

...read moreread less

Journal Article•10.36548/jiip.2025.4.008

Generating Detection Labels from Class-Level Explanations for Deep Learning-Based Eye Disease Diagnosis

Ali Abdulazeez Mohammed Baqer Qazzaz, +1 more

- 22 Oct 2025

- Journal of Innovative Image Processing

Abstract: Lack of good pixel-level expert annotations has traditionally impaired the development of robust object detection models for medical diagnosis. This article proposes a weakly supervised approach that generates accurate bounding box labels with minimal user interaction through image-level classification. The weakly supervised nature of the proposed approach tackles the annotation bottleneck by converting cheaper and more available class-level labels into spatial annotations of high value. The proposed two-stage method first trains a classifier on diagnostic labels and then applies Class Activation Mapping (Grad-CAM) to generate high-quality pseudo-labels. These machine-generated annotations are then used to train a state-of-the-art YOLOv8s detector for the final diagnosis task. The system performed cataract detection from fundus images with a mean Average Precision (mAP@50) of 99% and a stricter mAP@50-95 of 96.9%. An important recall rate of 97.1% was achieved in the cataract class, making the possibility of a missed diagnosis almost negligible. These results hold competitive status when compared with fully supervised methods that require extensive manual annotation, reaffirming our method as data-efficient, highly scalable, and a robust collaborator in fast-tracking the development of medical AI tools.

...read moreread less

References

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

•Proceedings Article•10.1109/CVPR.2016.91

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, +3 more

- 27 Jun 2016

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

45.7K

•Journal Article•10.1007/S11263-015-0816-Y

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015

- International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

41.6K

•Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 04 Jun 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.

...read moreread less

25.3K

...

Expand