Journal Article10.1109/tgrs.2023.3292418
Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery
45
TL;DR: Zhang et al. as mentioned in this paper proposed an efficient inductive vision Transformer framework for oriented object detection in remote sensing imagery, which follows the hierarchical feature pyramid structure and makes threefold contributions, as follows.
read more
Abstract: Object detection is a fundamental task in remote sensing image analysis and scene understanding. Previous remote sensing object detectors are typically based on convolutional neural networks (CNNs), whose performance is significantly limited by the intrinsic locality of convolution operations. The emergence of vision Transformers brings potential solutions to this problem, which have the capability to be a solid alternative to CNNs. However, three crucial obstacles hinder the application and performance of Transformers in the task of remote sensing object detection, i.e., 1) high computational complexity, especially for high-resolution remote sensing images, 2) training-and sample-inefficiency caused by lack of inductive bias, and 3) difficulty in learning arbitrary orientation knowledge of geospatial objects. To address these issues, in this paper, a novel efficient inductive vision Transformer framework is proposed for oriented object detection in remote sensing imagery. This framework follows the hierarchical feature pyramid structure and makes threefold contributions, as follows. 1) Spatial redundancy in remote sensing images is fully explored and an adaptive multi-grained routing mechanism is proposed to facilitate token sparsity in Transformers, which can dramatically reduce the computational cost without comprising the accuracy. 2) A compact dual-path encoding architecture, where both global long-range dependencies and local semantic relations are jointly and complementarily captured, is proposed to enhance inductive bias in Transformers. 3) An angle tokenization technique is proposed to promote the encoding, embedding, and learning of direction knowledge for oriented objects in remote sensing scenarios. In this work, the above three contributions are instantiated in an advanced Transformer-based object detector, namely EIA-PVT. Comprehensive experiments on two publicly available datasets have demonstrated its effectiveness and superiority for oriented object detection in remote sensing images.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images
Jiarui Zhang,Zhihua Chen,Guo-Liang Yan,Yi Wang,Bo Hu +4 more
TL;DR: A novel lightweight object detection algorithm based on Yolov5s is introduced to enhance detection performance while ensuring rapid processing and broad applicability, offering an efficient, lightweight solution for remote sensing applications.
31
Transcending Pixels: Boosting Saliency Detection via Scene Understanding From Aerial Imagery
TL;DR: A novel scene-guided dual-branch network (SDNet), which can perform cross-task knowledge distillation from the scene classification to facilitate accurate saliency detection and proves that the proposed framework is model-agnostic, and its extension to six baselines can bring significant performance benefits.
29
Boosting Object Detectors via Strong-Classification Weak-Localization Pretraining in Remote Sensing Imagery
Cong Zhang,Tianshang Liu,Jun Xiao,Kin-Man Lam,Qi Wang +4 more
TL;DR: The proposed RS SCWL pretraining paradigm is able to significantly improve downstream detection performance and outperforms classification pretraining methods, including ImageNet pretraining.
16
Transformers in Intelligent Architecture, Engineering, and Construction (AEC) Industry: Applications, Challenges, and Future Scope
Nitin Rane
TL;DR: This study explores the applications, challenges, and future prospects of Vision Transformers in the Architecture, Engineering, and Construction (AEC) industry, highlighting their potential to enhance design simulations, project management, and construction monitoring, while addressing concerns about data privacy and security.
13
EF-DETR: A Lightweight Transformer-Based Object Detector With an Encoder-Free Neck
Mingliang Zhou,Xue Wei,Huayan Pu,Jun Luo,Weijia Jia +4 more
TL;DR: This article introduces EF-DETR, a lightweight transformer-based object detector with an encoder-free neck, enhancing DETR's accuracy and efficiency through a redesigned structure, multiscale feature extractor, and high-efficiency feature fusion module.
10
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin +7 more
- 01 Jan 2017
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
51.8K