Bottom-Up and Top-Down Attention for Image Captioning and VQA.

Open AccessPosted Content

Bottom-Up and Top-Down Attention for Image Captioning and VQA.

- 25 Jul 2017

409

TL;DR: A combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of the method to VQA.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Image Chat: Engaging Grounded Conversations.

Kurt Shuster, +3 more

- 02 Nov 2018

- arXiv: Computation and Language

TL;DR: Guo et al. as discussed by the authors collected a dataset of grounded human-human conversations, where speakers are asked to play roles given a provided emotional mood or style, as the use of such traits is also a key factor in engagingness.

...read moreread less

86

•Proceedings Article•10.1109/ICCV.2019.00265

Making History Matter: History-Advantage Sequence Training for Visual Dialog

Tianhao Yang, +2 more

- 01 Oct 2019

TL;DR: This work intentionally imposes wrong answers in the history, obtaining an adverse critic, and sees how the historic error impacts the codec’s future behavior by History Advantage — a quantity obtained by subtracting the adverse critic from the gold reward of ground-truth history.

...read moreread less

86

•Proceedings Article•10.18653/V1/2021.FINDINGS-ACL.23

Contrastive Attention for Automatic Chest X-ray Report Generation

Fenglin Liu, +5 more

- 01 Aug 2021

TL;DR: Wang et al. as discussed by the authors proposed the Contrastive Attention (CA) model, which compared the current image with normal images to distill the contrastive information to better represent the visual features of abnormal regions.

...read moreread less

84

•Posted Content

Unshuffling Data for Improved Generalization.

Damien Teney, +2 more

- 27 Feb 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work describes a training procedure to capture the patterns that are stable across environments while discarding spurious ones, and demonstrates multiple use cases with the task of visual question answering, which is notorious for dataset biases.

...read moreread less

83

•Proceedings Article•10.18653/V1/2020.ACL-MAIN.219

Image-Chat: Engaging Grounded Conversations

Kurt Shuster, +3 more

- 01 Jul 2020

TL;DR: Automatic metrics and human evaluations of engagingness show the efficacy of this approach, and state-of-the-art performance on the existing IGC task is obtained, and the best performing model is almost on par with humans on the Image-Chat test set.

...read moreread less

81

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Journal Article•10.1109/TPAMI.2016.2577031

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 01 Jun 2017

- IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

...read moreread less

64.4K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K