TL;DR: A Laplacian pyramid pansharpening network architecture for accurately fusing a high spatial resolution panchromatic image and a low spatial resolution multispectral image, which outperforms state-of-the-art panshARPening methods.
TL;DR: Wang et al. as mentioned in this paper proposed a class feature attention mechanism fused with an improved Deeplabv3+ network called CFAMNet for semantic segmentation of common features in remote sensing images.
TL;DR: In this article, the authors used the Gaussian pyramid to improve the simple ORB-oriented algorithm, which is more suitable for minimally invasive surgery endoscopic image mosaic through theoretical analysis and experimental verification.
TL;DR: A method to quantitatively calculate the length of the exposed bolt for detecting loosening using vision-based deep learning and geometric imaging theory is proposed, outperforming other measurement methods and the state-of-the-art networks of human pose estimation.
TL;DR: It was shown that an intelligent health assistant for knee injuries could be developed by using the proposed exemplar pyramid LBP method, and the general and high success of this method were demonstrated.
TL;DR: Zhang et al. as discussed by the authors proposed a novel attentive and context-aware network for saliency prediction on omni-directional images, which is named as ACSalNet, and they further designed a Context-aware Feature Pyramid Module (CFPM) to reduce the semantic gap between features of different levels.
TL;DR: Zhang et al. as mentioned in this paper proposed a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for image captioning, which first extracts multi-level features by using Feature Pyramid Network (FPN), then utilizes the scaled-dot-product to fuse these features, which enables the model to detect objects of different scales in the image more effectively without increasing parameters.
Abstract: Image captioning aims to generate a corresponding description of an image. In recent years, neural encoder-decoder models have been the dominant approaches, in which the Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are used to translate an image into a natural language description. Among these approaches, the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. However, most conventional visual attention mechanisms are based on high-level image features, ignoring the effects of other image features, and giving insufficient consideration to the relative positions between image features. In this work, we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems. The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network (FPN), then utilizes the scaled-dot-product to fuse these features, which enables our model to detect objects of different scales in the image more effectively without increasing parameters. In the position-aware attention mechanism, the relative positions between image features are obtained at first, afterwards the relative positions are incorporated into the original image features to generate captions more accurately. Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4, METEOR, ROUGE-L, CIDEr scores compared with some state-of-the-art approaches, demonstrating the effectiveness of our approach.
TL;DR: Wang et al. as mentioned in this paper proposed a novel feature pyramid fashion to produce semantic features at all levels of the network for specially addressing the problem of face detection, where a Semantic Convolutional Box (SCBox) is presented by merging the features from different layers in a bottom-up fashion.
Abstract: Convolutional neural networks have been extensively used as the key role to address many computer vision applications. Traditionally, learning convolutional features is performed in a hierarchical manner along the dimension of network depth to create multi-scale feature maps. As a result, strong semantic features are derived at the top-level layers only. This paper proposes a novel feature pyramid fashion to produce semantic features at all levels of the network for specially addressing the problem of face detection. Particularly, a Semantic Convolutional Box (SCBox) is presented by merging the features from different layers in a bottom-up fashion. The proposed lightweight detector is stacked of alternating SCBox and Inception residual modules to learn the visual features in both the dimensions of network depth and width. In addition, the newly introduced objective functions (e.g., focal and CIoU losses) are incorporated to effectively address the problem of unbalanced data, resulting in stable training. The proposed model has been validated on the standard benchmarks FDDB and WIDER FACES, in comparison with the state-of-the-art methods. Experiments showed promising results in terms of both processing time and detection accuracy. For instance, the proposed network achieves an average precision of $$96.8\%$$
on FDDB, $$82.4\%$$
on WIDER FACES, and gains an inference speed of 106 FPS on a moderate GPU configuration or 20 FPS on a CPU machine.
TL;DR: Zhang et al. as discussed by the authors proposed a salient object detection approach using global context and multi-scale feature representation to estimate saliency maps in a pixel-wise manner, which could help the network effectively locate salient objects and suppress background noises.
Abstract: Currently, fully convolutional network based salient object detection approaches have some challenging problems. This paper proposes a novel salient object detection approach using global context and multi-scale feature representation to estimate saliency maps in a pixel-wise manner. Firstly, we explore and design a multi-scale feature enhancement module to improve the capability of feature representation and learning of multi-level side-output features. Moreover, we use global features to guide side-output multi-scale features to focus on the useful information, which could help the network effectively locate salient objects and suppress background noises. Finally, the feature pyramid network structure is utilized to refine the estimated results in a coarse-to-fine manner, and then obtain the final predicted results. The comparisons of our approach and 15 state-of-the-art methods demonstrate the effictiveness and robustness of the proposed approach on various scenarios.
TL;DR: Wang et al. as discussed by the authors proposed a complementary model of GANs for missing traffic video frames, which uses the Feature Pyramid Network (FPN) to obtain feature maps of multiple scales on the input video frame.
Abstract: Aiming at the problem of missing traffic video frames, this paper proposes a complementary model of generative adversarial networks. The model uses the Feature Pyramid Network (FPN) to obtain feature maps of multiple scales on the input video frame. By fusing feature maps of different scales, it can better integrate the semantic information on the frame. The local patch discriminator added to the discriminator model effectively ensures the accuracy and continuity of the completed frame. Experimental results on Caltech pedestrian dataset and KITTI dataset show the good performance of the proposed model.
TL;DR: Zhang et al. as mentioned in this paper proposed an efficient and accurate method for aerial image object detection in which oriented bounding boxes of objects are predicted by utilizing a simple network in the first stage, and the rotated bounding box predictions are then sent to non-maximum suppression (NMS) to produce final detection results.
Abstract: Aerial image object detection in aerial images is a hot and challenging task in computer vision, due to the bird-view perspective, complex backgrounds, variant scales and appearance of objects and extremely dense objects distribution. It has previously been observed that existing methods cannot meet the application requirements of accuracy and speed at the same time. In this paper, we propose an efficient and accurate method for aerial image object detection. The pipeline of our method has only two stages. The oriented bounding boxes of objects are predicted by utilizing a simple network in the first stage, and the rotated bounding box predictions are then sent to non-maximum suppression (NMS) to produce final detection results. Besides, atrous spatial pyramid pooling (ASPP) network is added to the pipeline to extract multi-scale features, and Bi-directional long short term memory network (BiLSTM) is adopted to improve detection performance of long and slender instances. Experiments on the challenging DOTA dataset have shown the propose method outperforms existing methods in terms of detection rate and speed.
TL;DR: Zhang et al. as mentioned in this paper proposed a depthwise separable convolution-joint feature pyramid (DSC-JFP) model, ASPP model and auxiliary network are removed to improve the real-time performance of semantic segmentation.
Abstract: Image semantic segmentation is an important research direction in image processing, computer vision and deep learning. Semantic segmentation is to classify the image pixel by pixel, so that the original image is divided into semantic segmentation images with specific pixel marks, which is the most challenging in image processing. Based on DSC-JFP (depthwise separable convolution-joint feature pyramid) model, ASPP model and auxiliary network are removed to improve the real-time performance of semantic segmentation. Combined with batch normalization and instance normalization, parallel batch and instance normalization (PBIN) and cascaded batch and instance normalization (CBIN) methods are proposed to improve the effect of semantic segmentation. The experimental results also show that the proposed method improves the real-time performance of semantic segmentation while ensuring the effect of semantic segmentation.
TL;DR: In this article, the authors proposed an automatic frame work for detecting COVID-19 at the early stage using chest X-ray image and achieved 99.6% accuracy in detecting the virus at its early stage.
TL;DR: In this article, a multi-scale conditional GAN is proposed for high-resolution, large-scale histopathology image generation and segmentation, which consists of a pyramid of GAN structures, each responsible for generating and segmenting images at a different scale.
TL;DR: In this paper, a pyramid phase correlation algorithm (PCA) and normalized cross correlation-pyramid (NCCP) algorithm are combined for image registration in frequency domain and spatial domain, respectively.
Abstract: Image registration is an important process for applications in various fields, such as remote sensing and medical imaging; thus, its accuracy significantly affects the efficacy as well as efficiency of those applications. Phase correlation algorithm (PCA) and normalized cross correlation-pyramid (NCCP) algorithm are the state-of-the-art frequency domain and spatial domain methods for image registration, respectively. However, these algorithms have some limitations. In particular, the registration speed of PCA needs to be improved, while the NCCP algorithm leads to errors if the image to be registered is partially occluded. Thus, to overcome these limitations, we propose a pyramid PCA that combines both algorithms. To verify the performance of our proposed algorithm, its results are compared with those obtained using the traditional PCA and NCCP algorithm. Our simulation results for partially occluded images indicate that the proposed algorithm outperforms the NCCP algorithm in terms of accuracy; in addition, it outperforms PCA in terms of speed. Furthermore, to test the feasibility of the proposed algorithm for real-time applications, a panoramic target detection system was set up, and the results obtained using the system proved that our method for image registration was both feasible and effective.