Open AccessPosted Content
Graph-RISE: Graph-Regularized Image Semantic Embedding
Aleksei Timofeev,Andrew Tomkins,Chun-Ta Lu,Da-Cheng Juan,Futang Peng,Krishnamurthy Viswanathan,Lucy Gao,Sujith Ravi,Tom Duerig,Yi-Ting Chen,Zhen Li +10 more
TL;DR: A large-scale neural graph learning framework that allows embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semantic labels, Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking.
read more
Abstract: Learning image representations to capture fine-grained semantics has been a challenging and important task enabling many applications such as image search and clustering. In this paper, we present Graph-Regularized Image Semantic Embedding (Graph-RISE), a large-scale neural graph learning framework that allows us to train embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semantic labels. Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking. We provide case studies to demonstrate that, qualitatively, image retrieval based on Graph-RISE effectively captures semantics and, compared to the state-of-the-art, differentiates nuances at levels that are closer to human-perception.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 1: Spectrumof image semantic similarity.We provide six image examples (two for each granularity) to illustrate the difference from coarser (left) to ultra-fine granularity (right). We refer to ultra fine-grained as “instance-level” to contrast with category-level and fine-grained semantics. 
Figure 4: An illustration of the Graph-RISE framework. Flow in red is added to enable graph regularization and required only during training. In the input layer, a labeled image is associated with one of its neighbor images, which can be either labeled or unlabeled, and then fed into the ResNet together with its neighbor image. Then, the image embeddings generated from ResNet are used to both (a) compute the cross-entropy loss and (b) graph regularization. 
Figure 5: PIT triplet evaluation on Recall v.s. Margin. 
Figure 6: GIT triplet evaluation on Recall v.s. Margin. 
Table 3: Performance comparisons (in %) via triplet accuracy (η = 0) on the internal evaluation datasets. 
Table 2: Performance comparisons (in %) via kNN search accuracy on publicly available datasets.
Citations
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo,Piyush Sharma,Nan Ding,Radu Soricut +3 more
- 17 Feb 2021
TL;DR: The Conceptual 12M (CC12M) dataset as mentioned in this paper is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.
1K
•Posted Content
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia,Yinfei Yang,Ye Xia,Yi-Ting Chen,Zarana Parekh,Hieu Pham,Quoc V. Le,Yun-Hsuan Sung,Zhen Li,Tom Duerig +9 more
TL;DR: In this article, a simple dual-encoder architecture is proposed to align visual and language representations of the image and text pairs using a contrastive loss. But the authors show that the scale of their corpus can make up for its noise and leads to state-of-the-art representations even with a simple learning scheme.
690
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen,Xiao Jing Wang,Soravit Changpinyo,AJ Piergiovanni,Piotr Padlewski,Daniel M Salz,Sebastian Goodman,Adam Grycner,Basil Mustafa,Lucas Beyer,Alexander Kolesnikov,Joan Puigcerver,Nan Ding,Keran Rong,Hassan Akbari,Gaurav Mishra,Linting Xue,Ashish V. Thapliyal,James Bradbury,Weicheng Kuo,Mojtaba Seyedhosseini,Chao Jia,Burcu Karagol Ayan,Carlos Riquelme,Andreas Steiner,Anelia Angelova,Xiaohua Zhai,Neil Houlsby,Radu Soricut +28 more
- 14 Sep 2022
TL;DR: PaLI achieves state-of-the-art in multiple vision and language tasks, while retaining a simple, modular, and scalable design.
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan,Karthik Raman,Jiecao Chen,Michael Bendersky,Marc Najork +4 more
- 11 Jul 2021
TL;DR: The Wikipedia-based Image Text (WIT) dataset as mentioned in this paper is a curated set of 37.5 million entity rich image-text examples with 11.5 millions unique images across 108 Wikipedia languages.
Interpretable survival prediction for colorectal cancer using deep learning
Ellery Wulczyn,David F. Steiner,Melissa Moran,Markus Plass,Robert Reihs,Fraser Tan,Isabelle Flament-Auvigne,Trissia Brown,Peter Regitnig,Po-Hsuan Cameron Chen,Narayan Hegde,Apaar Sadhwani,Robert C. MacDonald,Benny Ayalew,Greg S. Corrado,Lily Peng,Daniel Tse,Heimo Müller,Zhaoyang Xu,Yun Liu,Martin C. Stumpe,Kurt Zatloukal,Craig H. Mermel +22 more
- 19 Apr 2021
TL;DR: In this article, a deep learning system was developed for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides).
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
102.6K
ImageNet classification with deep convolutional neural networks
TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
•Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton +2 more
- 03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.