Journal Article10.1109/JAS.2020.1003402
Global-Attention-Based Neural Networks for Vision Language Intelligence
22
TL;DR: Zhang et al. as mentioned in this paper developed a novel global attention-based neural network (GANN) for image captioning, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects.
read more
Abstract: In this paper, we develop a novel global-attention-based neural network (GANN) for vision language intelligence, specifically, image captioning (language description of a given image). As many previous works, the encoder-decoder framework is adopted in our proposed model, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects, and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer. The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning, and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation. Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer. In our experiments, we qualitatively analyzed the proposed model, and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset. Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
ConvUNeXt: An efficient convolution neural network for medical image segmentation
TL;DR: Wang et al. as mentioned in this paper improved the convolution block of UNet by using large convolution kernels and depth-wise separable convolution to considerably decrease the number of parameters; residual connections in both encoder and decoder are added and pooling is abandoned via adopting convolution for down-sampling; during skip connection, a lightweight attention mechanism is designed to filter out noise in low-level semantic information and suppress irrelevant features, so that the network can pay more attention to the target area.
186
Complex-Valued Neural Networks: A Comprehensive Survey
TL;DR: Complex-valued neural networks (CVNNs) have shown their excellent efficiency compared to their real counter-parts in speech enhancement, image and signal processing as discussed by the authors , and there exists an obvious reason to provide a comprehensive survey of the advancement of CVNNs.
96
Learning Transactional Behavioral Representations for Credit Card Fraud Detection
TL;DR: Wang et al. as discussed by the authors proposed a novel model by improving long short-term memory with a time-aware gate that can capture the behavioral changes caused by consecutive transactions of users, which achieved better fraud detection performance compared with the state-of-the-art methods.
43
DeCASA in AgriVerse: Parallel Agriculture for Smart Villages in Metaverses
TL;DR: In this article , the authors developed Metaverses for agriculture, referred to as AgriVerse, under the Decentralized Complex Adaptive Systems in Agriculture (DeCASA) project.
Visuals to Text: A Comprehensive Review on Automatic Image Captioning
TL;DR: Image captioning refers to automatic generation of descriptive texts according to the visual content of images as mentioned in this paper , which is a technique integrating multiple disciplines including the computer vision (CV), natural language processing (NLP) and artificial intelligence.
31
References
•Proceedings Article
Multimodal Neural Language Models
Ryan Kiros,Ryan Kiros,Ruslan Salakhutdinov,Ruslan Salakhutdinov,Rich Zemel,Rich Zemel +5 more
- 21 Jun 2014
TL;DR: This work introduces two multimodal neural language models: models of natural language that can be conditioned on other modalities and imagetext modelling, which can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees.
•Posted Content
Semantic Compositional Networks for Visual Captioning
TL;DR: Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.
•Posted Content
Adaptively Aligned Image Captioning via Adaptive Attention Time
TL;DR: This paper proposes a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning, and empirically shows that AAT improves over state-of-the-art methods on the task ofimage captioning.