Open AccessPosted Content
Semantic Compositional Networks for Visual Captioning
TL;DR: Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.
read more
Abstract: A Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of the LSTM to an ensemble of tag-dependent weight matrices. The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag. In addition to captioning images, we also extend the SCN to generate captions for video clips. We qualitatively analyze semantic composition in SCNs, and quantitatively evaluate the algorithm on three benchmark datasets: COCO, Flickr30k, and Youtube2Text. Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 1: Model architecture and illustration of semantic composition. Each triangle symbol represents an ensemble of tag-dependent weight matrices. The number next to a semantic concept (i.e., a tag) is the probability that the corresponding semantic concept is presented in the input image. 
Table 3: Results on BLEU-4 (B-4), METEOR (M) and CIDEr-D (C) metrices compared to other state-of-the-art results and baselines on Youtube2Text. ![Table 2: Comparison to published state-of-the-art image captioning models on the blind test set as reported by the COCO test server. SCN-LSTM is our model. ATT refers to ATT VC [47], OV refers to OriolVinyals [41], and MSR Cap refers to MSR Captivator [8].](/figures/table2-1-uow52e71ocfb.png)
Table 2: Comparison to published state-of-the-art image captioning models on the blind test set as reported by the COCO test server. SCN-LSTM is our model. ATT refers to ATT VC [47], OV refers to OriolVinyals [41], and MSR Cap refers to MSR Captivator [8]. 
Figure 6: Detected tags and sentences generation results on COCO. The output captions are generated by: 1) LSTM-R, 2) LSTM-RT2, and 3) our SCN-LSTM. 
Figure 4: Detected tags and sentences generation results on COCO. The output captions are generated by: 1) LSTM-R, 2) LSTM-RT2, and 3) our SCN-LSTM. 
Figure 3: Illustration of semantic composition. Our model can adjust the caption smoothly as the semantic concepts are modified.
Citations
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
Tao Xu,Pengchuan Zhang,Qiuyuan Huang,Han Zhang,Zhe Gan,Xiaolei Huang,Xiaodong He +6 more
- 18 Jun 2018
TL;DR: AttnGAN as mentioned in this paper proposes an attentional generative network to synthesize fine-grained details at different sub-regions of the image by paying attentions to the relevant words in the natural language description.
Attention on Attention for Image Captioning
Lun Huang,Wenmin Wang,Jie Chen,Xiao-Yong Wei +3 more
- 01 Oct 2019
TL;DR: AoANet as mentioned in this paper proposes an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries and achieves state-of-the-art performance.
A Comprehensive Survey of Deep Learning for Image Captioning
TL;DR: A comprehensive review of deep learning-based image captioning techniques can be found in this article, where the authors discuss the foundation of the techniques to analyze their performances, strengths, and limitations.
934
Video Captioning With Attention-Based LSTM and Semantic Consistency
TL;DR: A novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences with competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.
729
TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays
Xiaosong Wang,Yifan Peng,Le Lu,Zhiyong Lu,Ronald M. Summers +4 more
- 18 Jun 2018
TL;DR: A novel Text-Image Embedding network (TieNet) is proposed for extracting the distinctive image and text representations of chest X-rays and multi-level attention models are integrated into an end-to-end trainable CNN-RNN architecture for highlighting the meaningful text words and image regions.
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
•Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
82.5K
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Related Papers (5)
Oriol Vinyals,Alexander Toshev,Samy Bengio,Dumitru Erhan +3 more
- 07 Jun 2015
Ramakrishna Vedantam,C. Lawrence Zitnick,Devi Parikh +2 more
- 07 Jun 2015
David L. Chen,William B. Dolan +1 more
- 19 Jun 2011