Proceedings Article10.1609/aaai.v38i4.28127
Weakly Supervised Open-Vocabulary Object Detection
Jianghang Lin,Yunhang Shen,Bingquan Wang,Shaohui Lin,Huanlai Xing,Liujuan Cao +5 more
4
TL;DR: WSOVOD extends traditional weakly supervised object detection to open-vocabulary and cross-dataset learning, achieving state-of-the-art performance.
read more
Abstract: Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane
Yansong Qu,Shaohui Dai,Xinyang Li,Jianghang Lin,Liujuan Cao,Shengchuan Zhang,Rongrong Ji +6 more
- 26 Oct 2024
2
Adaptive Selection based Referring Image Segmentation
Pengfei Yue,Jianghang Lin,Shengchuan Zhang,Jie Hu,Yilin Lu,Hai Ming Niu,Hui Ding,Yan Zhang,Guannan Jiang,Liujuan Cao,Rongrong Ji +10 more
- 26 Oct 2024
TL;DR: This paper introduces Adaptive Selection with Dual Alignment (ASDA), a novel framework for Referring Image Segmentation that adaptively aligns vision and language features, outperforming state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref benchmarks with improved robustness and adaptability.
CSDN: CLIP-Driven Similarity-Aligned Distillation Network for Weakly-Supervised Object Localization
Sifan Zuo,Youfa Liu,Du Bo +2 more
- 25 Oct 2025
TL;DR: This paper proposes CSDN, a CLIP-driven similarity-aligned distillation network for weakly-supervised object localization, which integrates CAM and FPM to generate a semantic-enhanced FPM, improving completeness and boundary accuracy of target localization under weak supervision.
Generating Detection Labels from Class-Level Explanations for Deep Learning-Based Eye Disease Diagnosis
Ali Abdulazeez Mohammed Baqer Qazzaz,Yousif Samer Mudhafar +1 more
Abstract: Lack of good pixel-level expert annotations has traditionally impaired the development of robust object detection models for medical diagnosis. This article proposes a weakly supervised approach that generates accurate bounding box labels with minimal user interaction through image-level classification. The weakly supervised nature of the proposed approach tackles the annotation bottleneck by converting cheaper and more available class-level labels into spatial annotations of high value. The proposed two-stage method first trains a classifier on diagnostic labels and then applies Class Activation Mapping (Grad-CAM) to generate high-quality pseudo-labels. These machine-generated annotations are then used to train a state-of-the-art YOLOv8s detector for the final diagnosis task. The system performed cataract detection from fundus images with a mean Average Precision (mAP@50) of 99% and a stricter mAP@50-95 of 96.9%. An important recall rate of 97.1% was achieved in the cataract class, making the possibility of a missed diagnosis almost negligible. These results hold competitive status when compared with fully supervised methods that require extensive manual annotation, reaffirming our method as data-efficient, highly scalable, and a robust collaborator in fast-tracking the development of medical AI tools.
References
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
102.6K
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
You Only Look Once: Unified, Real-Time Object Detection
Joseph Redmon,Santosh K. Divvala,Ross Girshick,Ali Farhadi +3 more
- 27 Jun 2016
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
•Posted Content
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
25.3K