Visual Commonsense R-CNN

doi:10.1109/CVPR42600.2020.01077

Open AccessProceedings Article10.1109/CVPR42600.2020.01077

Visual Commonsense R-CNN

Tan Wang, +3 more

- 14 Jun 2020

- pp 10760-10770

323

TL;DR: A novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), is presented to serve as an improved visual region encoder for high-level tasks such as captioning and VQA, and observes consistent performance boosts across them.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

The theory of affordances

博之三嶋

- 01 Nov 2008

3.2K

•Proceedings Article•10.1109/CVPR42600.2020.00377

Unbiased Scene Graph Generation From Biased Training

Kaihua Tang, +4 more

- 14 Jun 2020

TL;DR: In this article, the authors propose to use the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed, and use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG.

...read moreread less

595

•Posted Content

Unbiased Scene Graph Generation from Biased Training

Kaihua Tang, +4 more

- 27 Feb 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel SGG framework based on causal inference but not the conventional likelihood is presented, which uses Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG and can be widely applied in the community who seeks unbiased predictions.

...read moreread less

459

•Proceedings Article•10.1109/CVPR46437.2021.01251

Counterfactual VQA: A Cause-Effect Look at Language Bias

Yulei Niu, +5 more

- 01 Jun 2021

TL;DR: The authors proposed a counterfactual inference framework to mitigate language bias in VQA models, which enables them to capture the language bias as the direct causal effect of questions on answers and reduce language bias by subtracting the direct language effect from the total causal effect.

...read moreread less

329

•Posted Content

Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Yongming Rao, +3 more

- 19 Aug 2021

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper analyzes the effect of the learned visual attention on network prediction through counterfactual intervention and maximize the effect to encourage the network to learn more useful attention for fine-grained image recognition.

...read moreread less

302

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

•Journal Article•10.1145/3065386

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017

- Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

98.2K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K