Open AccessPosted Content
Understanding Image Captioning Models beyond Visualizing Attention
TL;DR: Variants of layer-wise relevance backpropagation (LRP) and gradient back Propagation, tailored to image captioning models with attention mechanisms, are developed and shown to correlate to object locations with higher precision than attention.
read more
Abstract: This paper interprets the predictions of image captioning models with attention mechanisms beyond visualizing the attention itself. In this paper, we develop variants of layer-wise relevance propagation (LRP) and gradient-based explanation methods, tailored to image captioning models with attention mechanisms. We compare the interpretability of attention heatmaps systematically against the explanations computed with explanation methods such as LRP, Grad-CAM, and Guided Grad-CAM. We show that explanation methods provide simultaneously pixel-wise image explanation (supporting and opposing pixels of the input image) and linguistic explanation (supporting and opposing words of the preceding sequence) for each word in the predicted captions. We demonstrate with extensive experiments that explanation methods can 1) reveal more related evidence used by the model to make decisions than attention; 2) correlate to object locations with high precision; 3) is helpful to `debug' the model such as analyzing the reasons for hallucinated object words. With the observed properties of explanations, we further design an LRP-inference fine-tuning strategy that can alleviate the object hallucination of image captioning models, meanwhile, maintain the sentence fluency. We conduct experiments with two widely used attention mechanisms: the adaptive attention mechanism calculated with the additive attention and the multi-head attention calculated with the scaled dot product.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
An Approach to identify Captioning Keywords in an Image using LIME
Siddharth Sahay,Nikita Omare,K. K. Shukla +2 more
- 19 Feb 2021
TL;DR: In this paper, explainable AI techniques such as LIME (Local Interpretable Model-Agnostic Explanations) are employed to explain the predictions of complex image captioning models.
27
Explainability for Medical Image Captioning
19 Apr 2022
TL;DR: In this article , an explainable module for medical image captioning is presented, which provides a sound interpretation of the attention-based encoder-decoder model by explaining the correspondence between visual features and semantic features.
•Posted Content
Challenges for cognitive decoding using deep learning methods
TL;DR: In this article, explainable artificial intelligence and transfer learning are used to improve the reproducibility and robustness of deep learning models for cognitive decoding, while also providing specific recommendations on how to improve robustness.
3
M-Rule: An Enhanced Deep Taylor Decomposition for Multi-model Interpretability
Runan Wang,Yifan Wang,Yiwen Huang,Tuo Leng +3 more
- 26 May 2023
TL;DR: The results demonstrate that the model trained in this paper using interpretation-based augmented data has a significant performance improvement compared to other classification models.
1
RA<sup>3</sup>: A Human-in-the-loop Framework for Interpreting and Improving Image Captioning with Relation-Aware Attribution Analysis
Lei Chai,Lu Qi,Hailong Sun,Jingzheng Li +3 more
- 13 May 2024
TL;DR: A human-in-the-loop framework for improving the interpretability, and further boosting the performance of the image captioning model, and designed an explanation loss that penalizes the difference between model attribution and human rationale to optimize the model's behavior for improving caption quality.
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
51.9K
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.