LoFTR: Detector-Free Local Feature Matching with Transformers
Jiaming Sun,Zehong Shen,Yuang Wang,Hujun Bao,Xiaowei Zhou +4 more
- 01 Apr 2021
- pp 8922-8931
TL;DR: LoFTR as discussed by the authors uses self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images, which enables the method to produce dense matches in low-texture areas.
read more
Abstract: We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods. Code is available at our project page: https://zju3dv.github.io/loftr/.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures
![Figure 1: Comparison between the proposed method LoFTR and the detector-based method SuperGlue [37]. This example demonstrates that LoFTR is capable of finding correspondences on the texture-less wall and the floor with repetitive patterns, where detector-based methods struggle to find repeatable interest points.1](/figures/figure1-1-2z16ejavm0s4.png)
Figure 1: Comparison between the proposed method LoFTR and the detector-based method SuperGlue [37]. This example demonstrates that LoFTR is capable of finding correspondences on the texture-less wall and the floor with repetitive patterns, where detector-based methods struggle to find repeatable interest points.1 ![Table 3: Evaluation on MegaDepth [21] for outdoor pose estimation. Matching with LoFTR results in better performance in the outdoor pose estimation task.](/figures/table3-1-719iw5evnzee.png)
Table 3: Evaluation on MegaDepth [21] for outdoor pose estimation. Matching with LoFTR results in better performance in the outdoor pose estimation task. ![Table 1: Homography estimation on HPatches [7]. The AUC of the corner error in percentage is reported. The suffix DS indicates the differentiable matching with dualsoftmax.](/figures/table1-1-1odru6m6v34g.png)
Table 1: Homography estimation on HPatches [7]. The AUC of the corner error in percentage is reported. The suffix DS indicates the differentiable matching with dualsoftmax. ![Table 2: Evaluation on ScanNet [7] for indoor pose estimation. The AUC of the pose error in percentage is reported. LoFTR improves the state-of-the-art methods by a large margin. †indicates models trained on MegaDepth. The suffixes OT and DS indicate differentiable matching with optimal transport and dual-softmax, respectively.](/figures/table2-1-atecwm4swiar.png)
Table 2: Evaluation on ScanNet [7] for indoor pose estimation. The AUC of the pose error in percentage is reported. LoFTR improves the state-of-the-art methods by a large margin. †indicates models trained on MegaDepth. The suffixes OT and DS indicate differentiable matching with optimal transport and dual-softmax, respectively. ![Table 4: Visual localization evaluation on the Aachen Day-Night [54] benchmark v1.1. The evaluation results on both the local feature evaluation track and the full visual localization track are reported.](/figures/table4-1-6yghqw78y9i1.png)
Table 4: Visual localization evaluation on the Aachen Day-Night [54] benchmark v1.1. The evaluation results on both the local feature evaluation track and the full visual localization track are reported. ![Table 5: Visual localization evaluation on the InLoc [41] benchmark.](/figures/table5-1-4ry8peczgi7p.png)
Table 5: Visual localization evaluation on the InLoc [41] benchmark.
Citations
Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation
Jiankun Li,Peisen Wang,Peng Xiong,Tao Cai,Zi-Ping Yan,Lei Yang,Jiangyu Liu,Haoqiang Fan,Shuaicheng Liu +8 more
- 22 Mar 2022
TL;DR: A hierarchical network with recurrent refinement to update disparities in a coarse-to-fine manner, as well as a stacked cascaded architecture for inference and a new synthetic dataset with special attention to difficult cases for better generalizing to real-world scenes are introduced.
150
Geometric Transformer for Fast and Robust Point Cloud Registration
Zheng Qin,Yu-Yan Peng,Kaiping Xu,Hao Yu,Changjian Wang,Yulan Guo +5 more
- 14 Feb 2022
TL;DR: This work proposes Geometric Transformer, a simplistic design that attains surprisingly high matching accuracy such that no RANSAC is required in the estimation of alignment transformation, leading to 100 times acceleration.
148
Geometric Transformer for Fast and Robust Point Cloud Registration
Hao Yu,Zheng Qin,Changjian Wang,Yulan Guo,Yu-Yan Peng,Kaiping Xu +5 more
- 14 Feb 2022
TL;DR: This work proposes Geometric Transformer, a simplistic design that attains surprisingly high matching accuracy such that no RANSAC is required in the estimation of alignment transformation, leading to 100 times acceleration.
140
TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers
01 Jun 2022
TL;DR: TransMVSNet as discussed by the authors proposes a feature matching transformer to aggregate long-range context information within and across images, which achieves state-of-the-art performance on DTU dataset, Tanks and Temples benchmark and BlendedMVS dataset.
134
Neural 3D Reconstruction in the Wild
Jiaming Sun,Xin Chen,Qianqian Wang,Zhengqi Li,Hadar Averbuch-Elor,Xiaowei Zhou,Noah Snavely +6 more
- 25 May 2022
TL;DR: This work introduces a new method that enables efficient and accurate surface reconstruction from Internet photo collections in the presence of varying illumination and proposes a hybrid voxel- and surface-guided sampling technique that allows for more efficient ray sampling around surfaces and leads to significant improvements in reconstruction quality.
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Distinctive Image Features from Scale-Invariant Keypoints
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin +7 more
- 01 Jan 2017
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
51.8K