Graph-RISE: Graph-Regularized Image Semantic Embedding

Open AccessPosted Content

Graph-RISE: Graph-Regularized Image Semantic Embedding

- 14 Feb 2019

- arXiv: Computer Vision and Pattern Recog...

54

TL;DR: A large-scale neural graph learning framework that allows embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semantic labels, Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Figure 1: Spectrumof image semantic similarity.We provide six image examples (two for each granularity) to illustrate the difference from coarser (left) to ultra-fine granularity (right). We refer to ultra fine-grained as “instance-level” to contrast with category-level and fine-grained semantics.

Figure 4: An illustration of the Graph-RISE framework. Flow in red is added to enable graph regularization and required only during training. In the input layer, a labeled image is associated with one of its neighbor images, which can be either labeled or unlabeled, and then fed into the ResNet together with its neighbor image. Then, the image embeddings generated from ResNet are used to both (a) compute the cross-entropy loss and (b) graph regularization.

Figure 5: PIT triplet evaluation on Recall v.s. Margin.

Figure 6: GIT triplet evaluation on Recall v.s. Margin.

Table 3: Performance comparisons (in %) via triplet accuracy (η = 0) on the internal evaluation datasets.

Table 2: Performance comparisons (in %) via kNN search accuracy on publicly available datasets.

Citations

•Proceedings Article•10.1109/CVPR46437.2021.00356

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Soravit Changpinyo, +3 more

- 17 Feb 2021

TL;DR: The Conceptual 12M (CC12M) dataset as mentioned in this paper is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.

...read moreread less

1K

•Posted Content

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Chao Jia, +9 more

- 11 Feb 2021

- arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, a simple dual-encoder architecture is proposed to align visual and language representations of the image and text pairs using a contrastive loss. But the authors show that the scale of their corpus can make up for its noise and leads to state-of-the-art representations even with a simple learning scheme.

...read moreread less

690

•Proceedings Article•10.1145/3404835.3463257

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Krishna Srinivasan, +4 more

- 11 Jul 2021

TL;DR: The Wikipedia-based Image Text (WIT) dataset as mentioned in this paper is a curated set of 37.5 million entity rich image-text examples with 11.5 millions unique images across 108 Wikipedia languages.

...read moreread less

294

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

•Journal Article•10.1145/3065386

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017

- Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

98.2K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K