Abstract
We introduce a comprehensive benchmark for local features and robust estimation algorithms, focusing on the downstream task—the accuracy of the reconstructed camera pose—as our primary metric. Our pipeline’s modular structure allows easy integration, configuration, and combination of different methods and heuristics. This is demonstrated by embedding dozens of popular algorithms and evaluating them, from seminal works to the cutting edge of machine learning research. We show that with proper settings, classical solutions may still outperform the perceived state of the art. Besides establishing the actual state of the art, the conducted experiments reveal unexpected properties of structure from motion pipelines that can help improve their performance, for both algorithmic and learned methods. Data and code are online (https://github.com/ubc-vision/image-matching-benchmark), providing an easy-to-use and flexible framework for the benchmarking of local features and robust estimation methods, both alongside and against top-performing methods. This work provides a basis for the Image Matching Challenge (https://image-matching-challenge.github.io).































Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In (Barroso-Laguna et al. 2019) the models are converted to TensorFlow—we use the original PyTorch version.
Time measured on ‘1-standard-2’ VMs on Google Cloud Compute: 2 vCPUs with 7.5 GB of RAM and no GPU.
References
Aanaes, H., Dahl, A. L., & Steenstrup-Pedersen, K. (2012). Interesting interest points. International Journal of Computer Vision, 97, 18–35.
Aanaes, H., & Kahl, F. (2002). Estimation of deformable structure and motion. In Vision and modelling of dynamic scenes workshop.
Agarwal, S., Snavely, N., Simon, I., Seitz, S., & Szeliski, R. (2009). Building Rome in one day. In International conference on computer vision.
Alahi, A., Ortiz, R., & Vandergheynst, P. (2012). FREAK: Fast retina keypoint. In Conference on computer vision and pattern recognition.
Alcantarilla, P. F., Nuevo, J., & Bartoli, A. (2013). Fast explicit diffusion for accelerated features in nonlinear scale spaces. In British machine vision conference.
Aldana-Iuit, J., Mishkin, D., Chum, O., & Matas, J. (2019). Saddle: Fast and repeatable features with good coverage. Image and Vision Computing, 97, 3807.
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2016). NetVLAD: CNN architecture for weakly supervised place recognition. In Conference on computer vision and pattern recognition.
Arandjelovic, R. & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In Conference on computer vision and pattern recognition.
Badino, H., Huber, D., & Kanade, T. (2011). The CMU visual localization data set. http://3dvis.ri.cmu.edu/data-sets/localization.
Balntas, V. (2018). SILDa: A multi-task dataset for evaluating visual localization. https://research.scape.io/silda/.
Balntas, V., Lenc, K., Vedaldi, A., & Mikolajczyk, K. (2017). HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Conference on computer vision and pattern recognition.
Balntas, V., Li, S., & Prisacariu, V. (September 2018). RelocNet: continuous metric learning relocalisation using neural nets. In European conference on computer vision.
Balntas, V., Riba, E., Ponsa, D., & Mikolajczyk, K. (2016). Learning local feature descriptors with triplets and shallow convolutional neural networks. In British machine vision conference.
Barath, D. & Matas, J. (June 2018). Graph-cut RANSAC. In Conference on computer vision and pattern recognition.
Barath, D., Matas, J., & Noskova, J. (2019). MAGSAC: Marginalizing sample consensus. In Conference on computer vision and pattern recognition.
Barroso-Laguna, A., Riba, E., Ponsa, D., & Mikolajczyk, K. (2019). Key.Net: Keypoint detection by handcrafted and learned CNN filters. In International conference on computer vision.
Baumberg, A. (2000). Reliable feature matching across widely separated views. In Conference on computer vision and pattern recognition.
Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. In European conference on computer vision.
Beaudet, P. R. (Nov. 1978). Rotationally invariant image operators. In Proceedings of the 4th international joint conference on pattern recognition (pp. 579–583). Kyoto.
Bellavia, F., & Colombo, C. (2020). Is there anything new to say about sift matching? International Journal of Computer Vision, 2020, 1–20.
Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, M.-M., & Reid, I. (2019). An evaluation of feature matchers for fundamental matrix estimation. In British machine vision conference.
Brachmann, E., & Rother, C. (2019). Neural-guided RANSAC: learning where to sample model hypotheses. In International conference on computer vision.
Bradski, G. (2000). The OpenCV library. Dr. Dobb’s Journal of Software Tools, 120, 122–125.
Brown, M., Hua, G., & Winder, S. (2011). Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 43–57.
Brown, M., & Lowe, D. (2007). Automatic panoramic image stitching using invariant features. International Journal of Computer Vision, 74, 59–73.
Bui, M., Baur, C., Navab, N., Ilic, S., & Albarqouni, S. (October 2019). Adversarial networks for camera pose regression and refinement. In International conference on computer vision.
Chum, O., & Matas, J. (June 2005). Matching with PROSAC—progressive sample consensus. In Conference on computer vision and pattern recognition.
Chum, O., Matas, J., & Kittler, J. (2003). Locally optimized RANSAC. In Pattern recognition.
Chum, O., Werner, T., & Matas, J. (2005). Two-view geometry estimation unaffected by a dominant plane. In Conference on computer vision and pattern recognition.
Cui, H., Gao, X., Shen, S., & Hu, Z. (July 2017). Hsfm: Hybrid structure-from-motion. In CVPR.
Dang, Z., Yi, K. M., Hu, Y., Wang, F., Fua, P., & Salzmann, M. (2018). Eigendecomposition-free training of deep networks with zero eigenvalue-based losses. In European conference on computer vision.
Detone, D., Malisiewicz, T., & Rabinovich, A. (2017). Toward geometric deep SLAM. Preprint arXiv:1707.07410.
Detone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. CVPR workshop on deep learning for visual SLAM.
Dong, J., Karianakis, N., Davis, D., Hernandez, J., Balzer, J., & Soatto, S. (June 2015). Multi-view feature engineering and learning. In Conference on computer vision and pattern recognition.
Dong, J. & Soatto, S. (2015). Domain-size pooling in local descriptors: DSP-SIFT. In Conference on computer vision and pattern recognition.
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., & Sattler, T. (2019). D2-Net: A trainable CNN for joint detection and description of local features. In Conference on computer vision and pattern recognition.
Ebel, P., Mishchuk, A., Yi, K. M., Fua, P., & Trulls, E. (2019). Beyond Cartesian representations for local descriptors. In International conference on computer vision.
Fischler, M., & Bolles, R. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Gay, P., Bansal, V., Rubino, C., & Bue, A. D. (2017). Probabilistic structure from motion with objects (PSfMO). In International conference on computer vision.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Conference on computer vision and pattern recognition.
Hartley, R. (1997). In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6), 580–593.
Hartley, R., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.
Hartley, R. I. (1994). Projective reconstruction and invariants from multiple images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(10), 1036–1041.
He, K., Lu, Y., & Sclaroff, S. (2018). Local descriptors optimized for average precision. In Conference on computer vision and pattern recognition.
Heinly, J., Schoenberger, J., Dunn, E., & Frahm, J.-M. (2015). Reconstructing the world in six days. In Conference on computer vision and pattern recognition.
Jacobs, N., Roman, N., & Pless, R. (2007). Consistent temporal variations in many outdoor scenes. In Conference on computer vision and pattern recognition.
Kendall, A., Grimes, M., & Cipolla, R. (2015). Posenet: A convolutional network for real-time 6-DOF camera relocalization. In International conference on computer vision.
Krishna Murthy, J., Iyer, G., & Paull, L. (2019). gradSLAM: Dense SLAM meets automatic differentiation.
Lenc, K., Gulshan, V., & Vedaldi, A. (2011). VLBenchmarks. http://www.vlfeat.org/benchmarks/.
Leutenegger, S., Chli, M., & Siegwart, R. Y. (2011). Brisk: Binary robust invariant scalable keypoints. In International conference on computer vision.
Li, Z., & Snavely, N. (2018). MegaDepth: Learning single-view depth prediction from internet photos. In Conference on computer vision and pattern recognition.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 20(2), 91–110.
Luo, Z., Shen, T., Zhou, L., Zhang, J., Yao, Y., Li, S., Fang, T., & Quan, L. (2019). ContextDesc: Local descriptor augmentation with cross-modality context. In Conference on computer vision and pattern recognition.
Luo, Z., Shen, T., Zhou, L., Zhu, S., Zhang, R., Yao, Y., Fang, T., & Quan, L. (2018). Geodesc: Learning local descriptors by integrating geometry constraints. In European conference on computer vision.
Lynen, S., Zeisl, B., Aiger, D., Bosse, M., Hesch, J., Pollefeys, M., Siegwart, R., & Sattler, T. (2019). Large-scale, real-time visual-inertial localization revisited. Preprint.
Maddern, W., Pascoe, G., Linegar, C., & Newman, P. (2017). 1 year, 1000 km: The Oxford RobotCar dataset. International Journal of Robotics Research, 36(1), 3–15.
Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.
Mikolajczyk, K., & Schmid, C. (2004). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.
Mikolajczyk, K., Schmid, C., & Zisserman, A. (2004). Human detection based on a probabilistic assembly of robust part detectors. In European conference on computer vision.
Mishchuk, A., Mishkin, D., Radenovic, F., & Matas, J. (2017). Working hard to know your neighbor’s margins: Local descriptor learning loss. In Advances in neural information processing systems.
Mishkin, D., Matas, J., & Perdoch, M. (2015). MODS: Fast and robust method for two-view matching. Computer Vision and Image Understanding, 141, 81–93.
Mishkin, D., Radenovic, F., & Matas, J. (2018). Repeatability is not enough: Learning affine regions via discriminability. In European conference on computer vision.
Muja, M. & Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. In International conference on computer vision.
Mukundan, A., Tolias, G., & Chum, O. (2019). Explicit spatial encoding for deep local descriptors. In Conference on computer vision and pattern recognition.
Mur-Artal, R., Montiel, J., & Tardós, J. (2015). ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5), 1147–1163.
Nister, D. (June 2003). An efficient solution to the five-point relative pose problem. In Conference on computer vision and pattern recognition.
Noh, H., Araujo, A., Sim, J., & nd Bohyung Han, T. W. (2017). Large-scale image retrieval with attentive deep local features. In International conference on computer vision.
Ono, Y., Trulls, E., Fua, P., & Yi, K. M. (2018). LF-Net: Learning local features from images. In Advances in neural information processing systems.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Pizer, S. M., Amburn, E. P., Austin, J. D., Cromartie, R., Geselowitz, A., Greer, T., ter Haar Romeny, B., Zimmerman, J. B., & Zuiderveld, K. (1987). Adaptive histogram equalization and its variations. In Computer vision, graphics, and image processing.
Pritchett, P., & Zisserman, A. (1998). Wide baseline stereo matching. In ICCV (pp. 754–760).
Pultar, M., Mishkin, D., & Matas, J. (2019). Leveraging outdoor webcams for local descriptor learning. In Computer vision winter workshop.
Qi, C., Su, H., Mo, K., & Guibas, L. (2017). Pointnet: Deep learning on point sets for 3D classification and segmentation. In Conference on computer vision and pattern recognition.
Radenovic, F., Tolias, G., & Chum, O. (2016). CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In European conference on computer vision.
Ranftl, R. & Koltun, V. (2018). Deep fundamental matrix estimation. In European conference on computer vision.
Revaud, J., Weinzaepfel, P., De Souza, C., Pion, N., Csurka, G., Cabon, Y., & Humenberger, M. (2019). R2D2: Repeatable and reliable detector and descriptor. Preprint.
Revaud, J., Weinzaepfel, P., de Souza, C. R., Pion, N., Csurka, G., Cabon, Y., & Humenberger, M. (2019). R2D2: Repeatable and reliable detector and descriptor. In Advances in neural information processing systems.
Rosten, E., Porter, R., & Drummond, T. (2010). Faster and better: A machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 105–119.
Rublee, E., Rabaud, V., Konolidge, K., & Bradski, G. (2011). ORB: An efficient alternative to SIFT or SURF. In International conference on computer vision.
Sarlin, P., DeTone, D., Malisiewicz, T., & Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. In Conference on computer vision and pattern recognition.
Sattler, T., Leibe, B., & Kobbelt, L. (2012). Improving image-based localization by active correspondence search. In European conference on computer vision.
Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys, M., Sivic, J., Kahl, F., & Pajdla, T. (2018). Benchmarking 6DOF outdoor visual localization in changing conditions. In Conference on computer vision and pattern recognition.
Sattler, T., Weyand, T., Leibe, B., & Kobbelt, L. (2012). Image retrieval for image-based localization revisited. In British machine vision conference.
Sattler, T., Zhou, Q., Pollefeys, M., & Leal-Taixe, L. (2019). Understanding the limitations of CNN-based absolute camera pose regression. In Conference on computer vision and pattern recognition.
Savinov, N., Seki, A., Ladicky, L., Sattler, T., & Pollefeys, M. (2017). Quad-networks: Unsupervised learning to rank for interest point detection. Conference on computer vision and pattern recognition.
Schönberger, J., & Frahm, J. (2016). Structure-from-motion revisited. In Conference on computer vision and pattern recognition.
Schönberger, J., Hardmeier, H., Sattler, T., & Pollefeys, M. (2017). Comparative evaluation of hand-crafted and learned local features. In Conference on computer vision and pattern recognition.
Schönberger, J., Zheng, E., Pollefeys, M., & Frahm, J. (2016). Pixelwise view selection for unstructured multi-view stereo. In European conference on computer vision.
Shi, Y., Zhu, J., Fang, Y., Lien, K., & Gu, J. (2019). Self-supervised learning of depth and ego-motion with differentiable bundle adjustment. Preprint.
Simo-serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & Moreno-Noguer, F. (2015). Discriminative learning of deep convolutional feature point descriptors. In International conference on computer vision.
Strecha, C., Hansen, W., Van Gool, L., Fua, P., & Thoennessen, U. (2008). On benchmarking camera calibration and multi-view stereo for high resolution imagery. In Conference on computer vision and pattern recognition.
Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of RGB-D SLAM systems. In International conference on intelligent robots and systems.
Sun, W., Jiang, W., Trulls, E., Tagliasacchi, A., & Yi, K. M. (2020). ACNe: Attentive context normalization for robust permutation-equivariant learning. In Conference on computer vision and pattern recognition.
Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., et al. (2019). InLoc: indoor visual localization with dense matching and view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1744–1756.
Tang, C., & Tan, P. (2019). Ba-Net: dense bundle adjustment network. In International conference on learning representations.
Tateno, K., Tombari, F., Laina, I., & Navab, N. (July 2017). CNN-SLAM: Real-time dense monocular slam with learned depth prediction. In CVPR.
Thomee, B., Shamma, D., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2016). YFCC100M: the new data in multimedia research. Communications of the ACM, 59, 64–73.
Tian, Y., Fan, B., & Wu, F. (2017). L2-Net: Deep learning of discriminative patch descriptor in Euclidean space. In Conference on computer vision and pattern recognition.
Tian, Y., Yu, X., Fan, B., Wu, F., Heijnen, H., & Balntas, V. (2019). SOSNet: Second order similarity regularization for local descriptor learning. In Conference on computer vision and pattern recognition.
Tolias, G., Avrithis, Y., & Jégou, H. (2016). Image search with selective match kernels: Aggregation across single and multiple images. IJCV, 116(3), 247–261.
Torr, P., & Zisserman, A. (2000). MLESAC: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78, 138–156.
Triggs, B., Mclauchlan, P., Hartley, R., & Fitzgibbon, A. (2000). Bundle adjustment—A modern synthesis. In Vision algorithms: Theory and practice (pp. 298–372).
Vedaldi, A., & Fulkerson, B. (2010). Vlfeat: An open and portable library of computer vision algorithms. In Proceedings of the 18th ACM international conference on multimedia, MM’10 (pp. 1469–1472).
Verdie, Y., Yi, K. M., Fua, P., & Lepetit, V. (2015). TILDE: A temporally invariant learned detector. In Conference on computer vision and pattern recognition.
Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). SFM-Net: Learning of structure and motion from video. Preprint.
Wei, X., Zhang, Y., Gong, Y., & Zheng, N. (2018). Kernelized subspace pooling for deep local descriptors. In Conference on computer vision and pattern recognition.
Wei, X., Zhang, Y., Li, Z., Fu, Y., & Xue, X. (2020). DeepSFM: Structure from motion via deep bundle adjustment. In European conference on computer vision.
Wu, C. (2013). Towards linear-time incremental structure from motion. In 3DV.
Yi, K. M., Trulls, E., Lepetit, V., & Fua, P. (2016). LIFT: Learned invariant feature transform. In European conference on computer vision.
Yi, K. M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., & Fua, P. (2018). Learning to find good correspondences. In Conference on computer vision and pattern recognition.
Yoo, A. B., Jette, M. A., & Grondona, M. (2003). SLURM: Simple Linux utility for resource management. In Workshop on job scheduling strategies for parallel processing (pp. 44–60). Berlin: Springer.
Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image patches via convolutional neural networks. In Conference on computer vision and pattern recognition.
Zhang, J., Sun, D., Luo, Z., Yao, A., Zhou, L., Shen, T., et al. (2019). Learning two-view correspondences and geometry using order-aware network. International conference on computer vision.
Zhang, J., Sun, D., Luo, Z., Yao, A., Zhou, L., Shen, T., Chen, Y., Quan, L., & Liao, H. (2019). Learning two-view correspondences and geometry using order-aware network. In ICCV.
Zhang, X., Yu, F. X., Karaman, S., & Chang, S.-F. (July 2017). Learning discriminative and transformation covariant local feature detectors. In Conference on computer vision and pattern recognition.
Zhao, C., Cao, Z., Li, C., Li, X., & Yang, J. (2019). NM-Net: Mining reliable neighbors for robust feature correspondences. In Conference on computer vision and pattern recognition.
Zhou, Q., Sattler, T., Pollefeys, M., & Leal-Taixe, L. (2020). To learn or not to learn: Visual localization from essential matrices. In ICRA.
Zhu, S., Zhang, R., Zhou, L., Shen, T., Fang, T., Tan, P., & Quan, L. (June 2018). Very large-scale global SFM by distributed motion averaging. In Conference on computer vision and pattern recognition.
Zitnick, C., & Ramnath, K. (2011). Edge foci interest points. In International conference on computer vision.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Konrad Schindler.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant “Deep Visual Geometry Machines” (RGPIN-2018-03788), by systems supplied by Compute Canada, and by Google’s Visual Positioning Service. DM and JM were supported by OP VVV funded Project CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for Informatics”. DM was also supported by CTU student Grant SGS17/185/OHK3/3T/13 and by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry for Digital and Economic Affairs, and the Province of Upper Austria in the frame of the COMET center SCCH. AM was supported by the Swiss National Science Foundation.
Rights and permissions
About this article
Cite this article
Jin, Y., Mishkin, D., Mishchuk, A. et al. Image Matching Across Wide Baselines: From Paper to Practice. Int J Comput Vis 129, 517–547 (2021). https://doi.org/10.1007/s11263-020-01385-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01385-0