Global-Attention-Based Neural Networks for Vision Language Intelligence

doi:10.1109/JAS.2020.1003402

Journal Article10.1109/JAS.2020.1003402

Global-Attention-Based Neural Networks for Vision Language Intelligence

Pei Liu, +3 more

- 01 Jul 2021

- IEEE/CAA Journal of Automatica Sinica

- Vol. 8, Iss: 7, pp 1243-1252

22

TL;DR: Zhang et al. as mentioned in this paper developed a novel global attention-based neural network (GANN) for image captioning, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects.

Abstract: In this paper, we develop a novel global-attention-based neural network (GANN) for vision language intelligence, specifically, image captioning (language description of a given image). As many previous works, the encoder-decoder framework is adopted in our proposed model, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects, and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer. The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning, and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation. Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer. In our experiments, we qualitatively analyzed the proposed model, and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset. Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/j.knosys.2022.109512

ConvUNeXt: An efficient convolution neural network for medical image segmentation

Zhimeng Han, +2 more

- 01 Jul 2022

- Knowledge Based Systems

TL;DR: Wang et al. as mentioned in this paper improved the convolution block of UNet by using large convolution kernels and depth-wise separable convolution to considerably decrease the number of parameters; residual connections in both encoder and decoder are added and pooling is abandoned via adopting convolution for down-sampling; during skip connection, a lightweight attention mechanism is designed to filter out noise in low-level semantic information and suppress irrelevant features, so that the network can pay more attention to the target area.

...read moreread less

186

Journal Article•10.1109/jas.2022.105743

Complex-Valued Neural Networks: A Comprehensive Survey

01 Aug 2022

- IEEE/CAA Journal of Automatica Sinica

TL;DR: Complex-valued neural networks (CVNNs) have shown their excellent efficiency compared to their real counter-parts in speech enhancement, image and signal processing as discussed by the authors , and there exists an obvious reason to provide a comprehensive survey of the advancement of CVNNs.

...read moreread less

96

Journal Article•10.1109/tnnls.2022.3208967

Learning Transactional Behavioral Representations for Credit Card Fraud Detection

01 Jan 2022

- IEEE transactions on neural networks and...

TL;DR: Wang et al. as discussed by the authors proposed a novel model by improving long short-term memory with a time-aware gate that can capture the behavioral changes caused by consecutive transactions of users, which achieved better fraud detection performance compared with the state-of-the-art methods.

...read moreread less

43

•Journal Article•10.1109/jas.2022.106103

DeCASA in AgriVerse: Parallel Agriculture for Smart Villages in Metaverses

Xiujuan Wang, +4 more

- 01 Dec 2022

- IEEE/CAA Journal of Automatica Sinica

TL;DR: In this article , the authors developed Metaverses for agriculture, referred to as AgriVerse, under the Decentralized Complex Adaptive Systems in Agriculture (DeCASA) project.

...read moreread less

32

Journal Article•10.1109/jas.2022.105734

Visuals to Text: A Comprehensive Review on Automatic Image Captioning

01 Aug 2022

- IEEE/CAA Journal of Automatica Sinica

TL;DR: Image captioning refers to automatic generation of descriptive texts according to the visual content of images as mentioned in this paper , which is a technique integrating multiple disciplines including the computer vision (CV), natural language processing (NLP) and artificial intelligence.

...read moreread less

31

...

Expand

References

•Proceedings Article

Multimodal Neural Language Models

Ryan Kiros, +5 more

- 21 Jun 2014

TL;DR: This work introduces two multimodal neural language models: models of natural language that can be conditioned on other modalities and imagetext modelling, which can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees.

...read moreread less

•Posted Content

Semantic Compositional Networks for Visual Captioning

Zhe Gan, +7 more

- 23 Nov 2016

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.

...read moreread less

•Posted Content

Adaptively Aligned Image Captioning via Adaptive Attention Time

Lun Huang, +3 more

- 19 Sep 2019

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper proposes a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning, and empirically shows that AAT improves over state-of-the-art methods on the task ofimage captioning.

...read moreread less