Open AccessPosted Content
Interpretable Counting for Visual Question Answering
TL;DR: The model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections and outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.
read more
Abstract: Questions that require counting a variety of objects in images remain a major challenge in visual question answering (VQA). The most common approaches to VQA involve either classifying answers based on fixed length representations of both the image and question or summing fractional counts estimated from each section of the image. In contrast, we treat counting as a sequential decision process and force our model to make discrete choices of what to count. Specifically, the model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections. A distinction of our approach is its intuitive and interpretable output, as discrete counts are automatically grounded in the image. Furthermore, our method outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
In Defense of Grid Features for Visual Question Answering
Huaizu Jiang,Ishan Misra,Marcus Rohrbach,Erik Learned-Miller,Xinlei Chen +4 more
- 14 Jun 2020
TL;DR: This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
KVQA: Knowledge-Aware Visual Question Answering
Sanket Shah,Anand Mishra,Naganand Yadati,Partha Pratim Talukdar +3 more
- 17 Jul 2019
TL;DR: KVQA is introduced – the first dataset for the task of (world) knowledge-aware VQA and is the largest dataset for exploring V QA over large Knowledge Graphs (KG), which consists of 183K question-answer pairs involving more than 18K named entities and 24K images.
Hypergraph Attention Networks for Multimodal Learning
Eun-Sol Kim,Woo-Young Kang,Kyoung-Woon On,Yu-Jung Heo,Byoung-Tak Zhang +4 more
- 14 Jun 2020
TL;DR: From the qualitative analysis with two Visual Question and Answering datasets, it is discovered that the alignment of the information levels between the modalities is important, and the symbolic graphs are very powerful ways to represent the information of the low-level signals in alignment.
Interpretable Visual Question Answering by Reasoning on Dependency Trees
TL;DR: A novel neural network model that performs global reasoning on a dependency tree parsed from the question and is capable of building an interpretable visual question answering (VQA) system that gradually derives image cues following question-driven parse-tree reasoning.
69
Multimodal research in vision and language: A review of current and emerging trends
01 Jan 2022
TL;DR: A detailed overview of the latest trends in research pertaining to visual and language modalities is presented in this paper , where the authors look at their applications in their task formulations and how to solve various problems related to semantic perception and content generation.
References
•Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
82.5K
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Ross Girshick,Jeff Donahue,Trevor Darrell,Jitendra Malik +3 more
- 23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
•Posted Content
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
25.3K
•Posted Content
Fast R-CNN
TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.
20.3K