Attention Decoupling for Query-Based Object Detection

doi:10.1109/icassp48485.2024.10447669

Proceedings Article10.1109/icassp48485.2024.10447669

Attention Decoupling for Query-Based Object Detection

- 14 Apr 2024

TL;DR: This work introduces an attention decoupling (AD) for query-based detectors to explicitly align multi-task features and proposes a task consistency loss (TCL) which integrates a novel task alignment metric to classification loss to further improve task consistency across multiple decoding stages.

Abstract: Benefiting from attention mechanisms, query-based detectors have a strong model capacity. They predict classification and regression by utilizing their shared queries and features in the decoder. Inter-task biases cause multi-directional gradients that disturb each other to limit model optimization. In this work, we introduce an attention decoupling (AD) for query-based detectors to explicitly align multi-task features. Specifically, AD consists of a Dense-to-Sparse Query Generator (DSQG) and a Split Cross-Attention (SCA), enabling query and feature decoupling respectively in decoding phase. Then, we propose a task consistency loss (TCL) which integrates a novel task alignment metric to classification loss to further improve task consistency across multiple decoding stages. Thus, AD effectively mitigates query-based detectors' task misalignment problem and inspires subsequent multi-task paradigms. Moreover, extensive experiments on COCO dataset demonstrate that the proposed AD can enhance a variety of representative detectors. Remarkably, AD-DINO achieves the state-of-the-art performance.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

Proceedings Article•10.1109/CVPR.2009.5206848

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

- 20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

75.9K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

Proceedings Article•10.1109/ICCV.2017.322

Mask R-CNN

Kaiming He, +3 more

- 20 Mar 2017

TL;DR: This work presents a conceptually simple, flexible, and general framework for object instance segmentation, which extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

...read moreread less

23.6K

...

Expand