Fine-Tuning CNN Image Retrieval with No Human Annotation
TL;DR: It is shown that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval.
read more
Abstract: Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron,Hugo Touvron,Hugo Touvron,Ishan Misra,Hervé Jégou,Julien Mairal,Piotr Bojanowski,Armand Joulin +7 more
TL;DR: In this paper, self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) beyond the fact that adapting selfsupervised methods to this architecture works particularly well, they make the following observations: first, self-vised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.
2.5K
•Posted Content
Deep Learning for Person Re-identification: A Survey and Outlook
TL;DR: A powerful AGW baseline is designed, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks, and a new evaluation metric (mINP) is introduced, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re- ID system for real applications.
D2-Net: A Trainable CNN for Joint Description and Detection of Local Features
Mihai Dusmanu,Ignacio Rocco,Tomas Pajdla,Marc Pollefeys,Josef Sivic,Akihiko Torii,Torsten Sattler +6 more
- 15 Jun 2019
TL;DR: This work proposes an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector, and shows that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations.
Label Propagation for Deep Semi-Supervised Learning
Ahmet Iscen,Giorgos Tolias,Yannis Avrithis,Ondrej Chum +3 more
- 15 Jun 2019
TL;DR: This work employs a transductive label propagation method that is based on the manifold assumption to make predictions on the entire dataset and use these predictions to generate pseudo-labels for the unlabeled data and train a deep neural network.
Learning With Average Precision: Training Image Retrieval With a Listwise Loss
Jerome Revaud,Jon Almazan,Rafael Sampaio de Rezende,César Roberto de Souza +3 more
- 01 Oct 2019
TL;DR: In this article, the authors proposed to directly optimize the global mAP by leveraging recent advances in listwise loss formulations, using a histogram binning approximation, which can be differentiated and thus employed to end-to-end learning.
References
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This study investigates the effect of convolutional network depth on image recognition accuracy, achieving significant improvements with 16-19 weight layers, and securing top places in the ImageNet Challenge 2014, with publicly available models for further research.
Deep Convolutional Features for Image Based Retrieval and Scene Categorization
Arsalan Mousavian,Jana Kosecka +1 more
TL;DR: This paper proposes an image retrieval and scene categorization approach using deep convolutional features from an earlier layer of a CNN, demonstrating superior performance on INRIA Holidays and SUN397 datasets with reduced computational cost and memory requirements.
Orientation Covariant Aggregation of Local Descriptors with Embeddings
Giorgos Tolias,Teddy Furon,Hervé Jégou +2 more
- 06 Sep 2014
TL;DR: Image search systems based on local descriptors typically achieve orientation invariance by aligning the patches on their dominant orientations, but this choice introduces too much invariance because it does not guarantee that the patches are rotated consistently.
Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors
Danfeng Qin,Stephan Gammeter,Lukas Bossard,Till Quack,Luc Van Gool +4 more
- 20 Jun 2011
TL;DR: This paper introduces a simple yet effective method to improve visual word based image retrieval based on an analysis of the k-reciprocal nearest neighbor structure in the image space and demonstrates a significant improvement over standard bag-of-words retrieval.
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition.
Jeff Donahue,Yangqing Jia,Oriol Vinyals,Judy Hoffman,Ning Zhang,Eric Tzeng,Trevor Darrell +6 more
TL;DR: Researchers propose DeCAF, a deep convolutional activation feature, for generic visual recognition tasks, achieving state-of-the-art results on various challenges, including scene recognition, domain adaptation, and fine-grained recognition, with an open-source implementation and associated network parameters.