Topic-Oriented Image Captioning Based on Order-Embedding

doi:10.1109/TIP.2018.2889922

Journal Article10.1109/TIP.2018.2889922

Topic-Oriented Image Captioning Based on Order-Embedding

Niange Yu, +4 more

- 01 Jun 2019

- IEEE Transactions on Image Processing

- Vol. 28, Iss: 6, pp 2743-2754

98

TL;DR: Experiments on the image captioning task on the MS-COCO and Flickr30K datasets validate the usefulness of this framework by showing that the different given topics can lead to different captions describing specific aspects of the given image and that the quality of generated captions is higher than the control model without a topic as input.

Abstract: We present an image captioning framework that generates captions under a given topic. The topic candidates are extracted from the caption corpus. A given image’s topics are then selected from these candidates by a CNN-based multi-label classifier. The input to the caption generation model is an image-topic pair, and the output is a caption of the image. For this purpose, a cross-modal embedding method is learned for the images, topics, and captions. In the proposed framework, the topic, caption, and image are organized in a hierarchical structure, which is preserved in the embedding space by using the order-embedding method. The caption embedding is upper bounded by the corresponding image embedding and lower bounded by the topic embedding. The lower bound pushes the images and captions about the same topic closer together in the embedding space. A bidirectional caption-image retrieval task is conducted on the learned embedding space and achieves the state-of-the-art performance on the MS-COCO and Flickr30K datasets, demonstrating the effectiveness of the embedding method. To generate a caption for an image, an embedding vector is sampled from the region bounded by the embeddings of the image and the topic, then a language model decodes it to a sentence as the output. The lower bound set by the topic shrinks the output space of the language model, which may help the model to learn to match images and captions better. Experiments on the image captioning task on the MS-COCO and Flickr30K datasets validate the usefulness of this framework by showing that the different given topics can lead to different captions describing specific aspects of the given image and that the quality of generated captions is higher than the control model without a topic as input. In addition, the proposed method is competitive with many state-of-the-art methods in terms of standard evaluation metrics.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1109/TCSVT.2019.2947482

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Jun Yu, +3 more

- 01 Dec 2020

- IEEE Transactions on Circuits and System...

TL;DR: Inspired by the success of the Transformer model in machine translation, this work extends it to a Multimodal Transformer (MT) model for image captioning that significantly outperforms the previous state-of-the-art methods.

...read moreread less

390

•Posted Content

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Jun Yu, +3 more

- 20 May 2019

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Zhang et al. as discussed by the authors proposed a multimodal transformer model to capture intra-and inter-modal interactions in a unified attention block, which can perform complex multimodal reasoning and output accurate captions.

...read moreread less

287

Journal Article•10.1109/TIP.2020.3004729

Spatio-Temporal Memory Attention for Image Captioning

Junzhong Ji, +4 more

- 30 Jun 2020

- IEEE Transactions on Image Processing

TL;DR: The proposed STMA model is flexible to combine with attention-based image captioning frameworks and builds strong temporal connections of attentions and learns the spatio-temporal relationship of attended areas simultaneously.

...read moreread less

78

Journal Article•10.1109/TCSVT.2020.2995959

Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

Ting Yu, +4 more

- 01 Mar 2021

- IEEE Transactions on Circuits and System...

TL;DR: Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos answering, and extensive ablation studies are carried out to explore the reasons behind the proposed model’s effectiveness.

...read moreread less

63

•Journal Article•10.3390/APP9102024

A Systematic Literature Review on Image Captioning

Raimonda Staniūtė, +1 more

- 16 May 2019

- Applied Sciences

TL;DR: In this study a comprehensive Systematic Literature Review (SLR) provides a brief overview of improvements in image captioning over the last four years and to summarize the results from the newest papers.

...read moreread less

53

...

Expand

References

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

51.9K

•Journal Article•10.5555/944919.944937

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003

- Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

36.2K

...

Expand

Topic-Oriented Image Captioning Based on Order-Embedding

Chat with Paper

AI Agents for this Paper

Citations

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Spatio-Temporal Memory Attention for Image Captioning

Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

A Systematic Literature Review on Image Captioning

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Scikit-learn: Machine Learning in Python

Very Deep Convolutional Networks for Large-Scale Image Recognition

Latent dirichlet allocation

Related Papers (5)

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation

Glove: Global Vectors for Word Representation

Faster R-CNN: towards real-time object detection with region proposal networks