Understanding Image Captioning Models beyond Visualizing Attention

Open AccessPosted Content

Understanding Image Captioning Models beyond Visualizing Attention

- 04 Jan 2020

- arXiv: Computer Vision and Pattern Recog...

9

TL;DR: Variants of layer-wise relevance backpropagation (LRP) and gradient back Propagation, tailored to image captioning models with attention mechanisms, are developed and shown to correlate to object locations with higher precision than attention.

Abstract: This paper interprets the predictions of image captioning models with attention mechanisms beyond visualizing the attention itself. In this paper, we develop variants of layer-wise relevance propagation (LRP) and gradient-based explanation methods, tailored to image captioning models with attention mechanisms. We compare the interpretability of attention heatmaps systematically against the explanations computed with explanation methods such as LRP, Grad-CAM, and Guided Grad-CAM. We show that explanation methods provide simultaneously pixel-wise image explanation (supporting and opposing pixels of the input image) and linguistic explanation (supporting and opposing words of the preceding sequence) for each word in the predicted captions. We demonstrate with extensive experiments that explanation methods can 1) reveal more related evidence used by the model to make decisions than attention; 2) correlate to object locations with high precision; 3) is helpful to `debug' the model such as analyzing the reasons for hallucinated object words. With the observed properties of explanations, we further design an LRP-inference fine-tuning strategy that can alleviate the object hallucination of image captioning models, meanwhile, maintain the sentence fluency. We conduct experiments with two widely used attention mechanisms: the adaptive attention mechanism calculated with the additive attention and the multi-head attention calculated with the scaled dot product.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1109/ICCCIS51004.2021.9397159

An Approach to identify Captioning Keywords in an Image using LIME

Siddharth Sahay, +2 more

- 19 Feb 2021

TL;DR: In this paper, explainable AI techniques such as LIME (Local Interpretable Model-Agnostic Explanations) are employed to explain the predictions of complex image captioning models.

...read moreread less

27

•Proceedings Article•10.1109/ipta54936.2022.9784146

Explainability for Medical Image Captioning

19 Apr 2022

TL;DR: In this article , an explainable module for medical image captioning is presented, which provides a sound interpretation of the attention-based encoder-decoder model by explaining the correspondence between visual features and semantic features.

...read moreread less

8

•Posted Content

Challenges for cognitive decoding using deep learning methods

Armin W. Thomas, +2 more

- 16 Aug 2021

- arXiv: Learning

TL;DR: In this article, explainable artificial intelligence and transfer learning are used to improve the reproducibility and robustness of deep learning models for cognitive decoding, while also providing specific recommendations on how to improve robustness.

...read moreread less

3

Proceedings Article•10.1109/icaibd57115.2023.10206230

M-Rule: An Enhanced Deep Taylor Decomposition for Multi-model Interpretability

Runan Wang, +3 more

- 26 May 2023

TL;DR: The results demonstrate that the model trained in this paper using interpretation-based augmented data has a significant performance improvement compared to other classification models.

...read moreread less

1

Journal Article•10.1109/icde60146.2024.00032

RA<sup>3</sup>: A Human-in-the-loop Framework for Interpreting and Improving Image Captioning with Relation-Aware Attribution Analysis

Lei Chai, +3 more

- 13 May 2024

TL;DR: A human-in-the-loop framework for improving the interpretability, and further boosting the performance of the image captioning model, and designed an explanation loss that penalizes the difference between model attribution and human rationale to optimize the model's behavior for improving caption quality.

...read moreread less

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Journal Article•10.1109/TPAMI.2016.2577031

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 01 Jun 2017

- IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

...read moreread less

64.4K

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

51.9K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K