Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

doi:10.1109/tgrs.2023.3292418

Journal Article10.1109/tgrs.2023.3292418

Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

Cong Zhang, +4 more

- 01 Jan 2023

- IEEE Transactions on Geoscience and Remo...

- pp 1-1

45

TL;DR: Zhang et al. as mentioned in this paper proposed an efficient inductive vision Transformer framework for oriented object detection in remote sensing imagery, which follows the hierarchical feature pyramid structure and makes threefold contributions, as follows.

Abstract: Object detection is a fundamental task in remote sensing image analysis and scene understanding. Previous remote sensing object detectors are typically based on convolutional neural networks (CNNs), whose performance is significantly limited by the intrinsic locality of convolution operations. The emergence of vision Transformers brings potential solutions to this problem, which have the capability to be a solid alternative to CNNs. However, three crucial obstacles hinder the application and performance of Transformers in the task of remote sensing object detection, i.e., 1) high computational complexity, especially for high-resolution remote sensing images, 2) training-and sample-inefficiency caused by lack of inductive bias, and 3) difficulty in learning arbitrary orientation knowledge of geospatial objects. To address these issues, in this paper, a novel efficient inductive vision Transformer framework is proposed for oriented object detection in remote sensing imagery. This framework follows the hierarchical feature pyramid structure and makes threefold contributions, as follows. 1) Spatial redundancy in remote sensing images is fully explored and an adaptive multi-grained routing mechanism is proposed to facilitate token sparsity in Transformers, which can dramatically reduce the computational cost without comprising the accuracy. 2) A compact dual-path encoding architecture, where both global long-range dependencies and local semantic relations are jointly and complementarily captured, is proposed to enhance inductive bias in Transformers. 3) An angle tokenization technique is proposed to promote the encoding, embedding, and learning of direction knowledge for oriented objects in remote sensing scenarios. In this work, the above three contributions are instantiated in an advanced Transformer-based object detector, namely EIA-PVT. Comprehensive experiments on two publicly available datasets have demonstrated its effectiveness and superiority for oriented object detection in remote sensing images.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.3390/rs15204974

Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images

Jiarui Zhang, +4 more

- 15 Oct 2023

- Remote sensing

TL;DR: A novel lightweight object detection algorithm based on Yolov5s is introduced to enhance detection performance while ensuring rapid processing and broad applicability, offering an efficient, lightweight solution for remote sensing applications.

...read moreread less

31

Journal Article•10.1109/tgrs.2023.3298661

Transcending Pixels: Boosting Saliency Detection via Scene Understanding From Aerial Imagery

Yanfeng Liu, +3 more

- IEEE Transactions on Geoscience and Remo...

TL;DR: A novel scene-guided dual-branch network (SDNet), which can perform cross-task knowledge distillation from the scene classification to facilitate accurate saliency detection and proves that the proposed framework is model-agnostic, and its extension to six baselines can bring significant performance benefits.

...read moreread less

29

Journal Article•10.1109/tim.2023.3315392

Boosting Object Detectors via Strong-Classification Weak-Localization Pretraining in Remote Sensing Imagery

Cong Zhang, +4 more

- IEEE Transactions on Instrumentation and...

TL;DR: The proposed RS SCWL pretraining paradigm is able to significantly improve downstream detection performance and outperforms classification pretraining methods, including ImageNet pretraining.

...read moreread less

16

Journal Article•10.2139/ssrn.4609914

Transformers in Intelligent Architecture, Engineering, and Construction (AEC) Industry: Applications, Challenges, and Future Scope

Nitin Rane

- Social Science Research Network

TL;DR: This study explores the applications, challenges, and future prospects of Vision Transformers in the Architecture, Engineering, and Construction (AEC) industry, highlighting their potential to enhance design simulations, project management, and construction monitoring, while addressing concerns about data privacy and security.

...read moreread less

13

Journal Article•10.1109/tii.2024.3431044

EF-DETR: A Lightweight Transformer-Based Object Detector With an Encoder-Free Neck

Mingliang Zhou, +4 more

- IEEE Transactions on Industrial Informat...

TL;DR: This article introduces EF-DETR, a lightweight transformer-based object detector with an encoder-free neck, enhancing DETR's accuracy and efficiency through a redesigned structure, multiscale feature extractor, and high-efficiency feature fusion module.

...read moreread less

10

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Journal Article•10.1109/TPAMI.2016.2577031

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 01 Jun 2017

- IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

...read moreread less

64.4K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

...

Expand

Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

Chat with Paper

AI Agents for this Paper

Citations

Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images

Transcending Pixels: Boosting Saliency Detection via Scene Understanding From Aerial Imagery

Boosting Object Detectors via Strong-Classification Weak-Localization Pretraining in Remote Sensing Imagery

Transformers in Intelligent Architecture, Engineering, and Construction (AEC) Industry: Applications, Challenges, and Future Scope

EF-DETR: A Lightweight Transformer-Based Object Detector With an Encoder-Free Neck

References

Deep Residual Learning for Image Recognition

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Attention Is All You Need

Related Papers (5)

Appearance and motion based deep learning architecture for moving object detection in moving camera

Moving Object Detection based on Deep Atrous Spatial Features for Moving Camera

Convolutional neural network based object detection: a review

Detection and tracking an object in omni-directional images using particle filter

Intelligent Detection of Missing and Unattended Objects in Complex Scene of Surveillance Videos