Visual Commonsense R-CNN
Tan Wang,Jianqiang Huang,Hanwang Zhang,Qianru Sun +3 more
- 14 Jun 2020
- pp 10760-10770
TL;DR: A novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), is presented to serve as an improved visual region encoder for high-level tasks such as captioning and VQA, and observes consistent performance boosts across them.
read more
Abstract: We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn “sense-making” knowledge like chair can be sat — while not just “common” co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them, achieving many new state-of-the-arts.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Unbiased Scene Graph Generation From Biased Training
Kaihua Tang,Yulei Niu,Jianqiang Huang,Jiaxin Shi,Hanwang Zhang +4 more
- 14 Jun 2020
TL;DR: In this article, the authors propose to use the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed, and use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG.
•Posted Content
Unbiased Scene Graph Generation from Biased Training
TL;DR: A novel SGG framework based on causal inference but not the conventional likelihood is presented, which uses Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG and can be widely applied in the community who seeks unbiased predictions.
Counterfactual VQA: A Cause-Effect Look at Language Bias
Yulei Niu,Kaihua Tang,Hanwang Zhang,Zhiwu Lu,Xian-Sheng Hua,Ji-Rong Wen +5 more
- 01 Jun 2021
TL;DR: The authors proposed a counterfactual inference framework to mitigate language bias in VQA models, which enables them to capture the language bias as the direct causal effect of questions on answers and reduce language bias by subtracting the direct language effect from the total causal effect.
•Posted Content
Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification
TL;DR: This paper analyzes the effect of the learned visual attention on network prediction through counterfactual intervention and maximize the effect to encourage the network to learn more useful attention for fine-grained image recognition.
302
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
ImageNet classification with deep convolutional neural networks
TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton +2 more
- 03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.