Open AccessPosted Content
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
TL;DR: This article proposed an adaptive attention model with a visual sentinel to decide whether to attend to the image and where, in order to extract meaningful information for sequential word generation, which set the new state-of-the-art by a significant margin.
read more
Abstract: Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as "the" and "of". Other words that may seem visual can often be predicted reliably just from the language model e.g., "sign" after "behind a red stop" or "phone" following "talking on a cell". In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions
TL;DR: Zhang et al. as discussed by the authors proposed the stepwise image-topic merging network (simNet) that makes use of the two kinds of attention at the same time. And the decoder adaptively merges the attentive information in the extracted topics and the image according to the generated context, so that the visual information and the semantic information can be effectively combined.
28
Natural Language Generation for Spoken Dialogue System using RNN Encoder-Decoder Networks
Van-Khanh Tran,Le-Minh Nguyen +1 more
TL;DR: In this paper, a Recurrent Neural Network based Encoder-Decoder (RNN-decoder) architecture was proposed to generate natural language sentences in spoken dialogue systems. But the proposed generator can be jointly trained both sentence planning and surface realization.
•Proceedings Article
h-detach: Modifying the LSTM Gradient Towards Better Optimization
Devansh Arpit,Bhargav Kanuparthi,Giancarlo Kerg,Nan Rosemary Ke,Ioannis Mitliagkas,Yoshua Bengio +5 more
- 27 Sep 2018
TL;DR: In this paper, a stochastic algorithm called H-detach was proposed to prevent the vanishing gradient problem in LSTM by suppressing the gradient components through the linear path (cell state) in the computational graph, which can prevent LSTMs from capturing long-term dependencies.
A New Attention-Based LSTM for Image Captioning
Fen Xiao,Wenfeng Xue,Xieping Gao +2 more
TL;DR: An attentional LSTM (ALSTM) is proposed and how to integrate it within state-of-the-art automatic image captioning framework is shown and shows how to get effective visual/context attention to update input vector.
27
•Posted Content
Interpretable Counting for Visual Question Answering
TL;DR: The model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections and outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.
27
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation
Kyunghyun Cho,Bart van Merriënboer,Caglar Gulcehre,Dzmitry Bahdanau,Fethi Bougares,Holger Schwenk,Yoshua Bengio,Yoshua Bengio,Yoshua Bengio +8 more
- 01 Jan 2014
TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
•Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau,Kyunghyun Cho,Yoshua Bengio +2 more
- 01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
25.7K
Related Papers (5)
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Diederik P. Kingma,Jimmy Ba +1 more