Open AccessPosted Content
Bottom-Up and Top-Down Attention for Image Captioning and VQA.
Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang +6 more
- 25 Jul 2017
409
TL;DR: A combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of the method to VQA.
read more
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Image Chat: Engaging Grounded Conversations.
TL;DR: Guo et al. as discussed by the authors collected a dataset of grounded human-human conversations, where speakers are asked to play roles given a provided emotional mood or style, as the use of such traits is also a key factor in engagingness.
86
Making History Matter: History-Advantage Sequence Training for Visual Dialog
Tianhao Yang,Zheng-Jun Zha,Hanwang Zhang +2 more
- 01 Oct 2019
TL;DR: This work intentionally imposes wrong answers in the history, obtaining an adverse critic, and sees how the historic error impacts the codec’s future behavior by History Advantage — a quantity obtained by subtracting the adverse critic from the gold reward of ground-truth history.
Contrastive Attention for Automatic Chest X-ray Report Generation
Fenglin Liu,Changchang Yin,Xian Wu,Shen Ge,Ping Zhang,Xu Sun +5 more
- 01 Aug 2021
TL;DR: Wang et al. as discussed by the authors proposed the Contrastive Attention (CA) model, which compared the current image with normal images to distill the contrastive information to better represent the visual features of abnormal regions.
•Posted Content
Unshuffling Data for Improved Generalization.
TL;DR: This work describes a training procedure to capture the patterns that are stable across environments while discarding spurious ones, and demonstrates multiple use cases with the task of visual question answering, which is notorious for dataset biases.
83
Image-Chat: Engaging Grounded Conversations
Kurt Shuster,Samuel Humeau,Antoine Bordes,Jason Weston +3 more
- 01 Jul 2020
TL;DR: Automatic metrics and human evaluations of engagingness show the efficacy of this approach, and state-of-the-art performance on the existing IGC task is obtained, and the best performing model is almost on par with humans on the Image-Chat test set.
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Related Papers (5)
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Oriol Vinyals,Alexander Toshev,Samy Bengio,Dumitru Erhan +3 more
- 07 Jun 2015