Abstract
The 2D-3D matching determine the spatial relationship between 2D and 3D space, which can be used for Augmented Reality (AR) and robot pose estimation, and provides support for multi-sensor fusion. Specifically, the cross-domain descriptor extraction between 2D images and 3D point clouds is a solution to achieve 2D-3D matching. Essentially, the 3D point cloud volumes and 2D image patches can be sampled based on the keypoints of 3D point clouds and 2D images, which are used to learn the cross-domain descriptors for 2D-3D matching. However, it is difficult to achieve 2D-3D matching by using handcrafted descriptors; meanwhile, the cross-domain descriptors based on learning is vulnerable to translation, scale, rotation of cross-domain data. In this paper, we propose a novel network, HAS-Net, for learning cross-domain descriptors to achieve 2D image patch and 3D point cloud volume matching. The HAS-Net introduces the spatial transformer network (STN) to overcome the translation, scale, rotation and more generic warping of 2D image patches. In addition, the HAS-Net uses the negative sample sampling strategy of hard triplet loss to solve the uncertainty of randomly sampling negative samples during training, thereby improving the ability to distinguish hardest samples. Experiments demonstrate the superiority of HAS-Net on the 2D-3D retrieval and matching. To demonstrate the robustness of the learned descriptors, the 3D descriptors of cross-domain descriptors learned by HAS-Net are applied in 3D global registration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Liu, W., Wang, C., Bian, X., et al.: Learning to match ground camera image and uav 3-d model-rendered image based on siamese network with attention mechanism. IEEE Geosci. Remote Sens. Lett. 17(9), 1608–1612 (2019)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF: binary robust independent elementary features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 778–792. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_56
Tola, E., Lepetit, V., Fua, P.: Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815–830 (2009)
Simo-Serra, E., Trulls, E., Ferraz, L., et al.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126 (2015)
Liu, W., Shen, X., Wang, C., et al.: H-Net: neural network for cross-domain image patch matching. In: IJCAI, pp. 856–853 (2018)
Tian, Y., Fan, B., Wu, F.: L2-net: deep learning of discriminative patch descriptor in euclidean space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–669 (2017)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
He, K., Lu, Y., Sclaroff, S.: Local descriptors optimized for average precision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–605 (2018)
Keller, M., Chen, Z., Maffra, F., et al.: Learning deep descriptors with scale-aware triplet networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2762–2770 (2018)
Rusu, R.B., Blodow, N., Marton, Z. C., et al.: Aligning point cloud views using persistent feature histograms. In: EEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3384–3391 (2008)
Rusu, R.B., Marton, Z.C., Blodow, N., et al.: Learning informative point classes for the acquisition of object model maps. In: 2008 10th International Conference on Control, Automation, Robotics and Vision, pp. 643–650 (2008)
Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: 2009 IEEE International Conference on Robotics and Automation, pp. 3212–3217 (2009)
Guo, Y., Sohel, F., Bennamoun, M., et al.: Rotational projection statistics for 3d local surface description and object recognition. Int. J. Comput. Vision 105(1), 63–86 (2013)
Qi, C.R., Su, H., Mo, K., et al.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., et al.: Guibas: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5099–5108 (2017)
Jiang, M., Wu, Y., Zhao, T., et al.: Pointsift: A sift like network module for 3d point cloud semantic segmentation. arXiv preprint arXiv:1807.00652 (2018)
Li, Y., Bu, R., Sun, M., et al.: Pointcnn: convolution on x-transformed points. In: Advances in Neural Information Processing Systems, pp. 820–830 (2018)
Liu, W., Lai, B., Wang, C., et al.: Ground camera image and large-scale 3D image-based point cloud registration based on learning domain invariant feature descriptors. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 14, 997–1009 (2021)
Feng, M., Hu, S., Ang, M.H., et al.: 2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4790–4796 (2019)
Liu, W., Lai, B., Wang, C., et al.: Learning to match 2d images and 3d lidar point clouds for outdoor augmented reality. In: 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (IEEE VR), pp. 655–656 (2020)
Pham, Q.-H., Uy, M.A., Hua, B.-S., et al.: LCd: learned cross-domain descriptors for 2d–3d matching. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11 856–11 864 (2020)
Zeng, A., Song, S., Nießner, M., et al.: 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1802–1811 (2017)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, 7–12 December 2015, Montreal, Quebec, Canada, pp. 2017–2025 (2015)
Mishchuk, A., Mishkin, D., Radenovic, F., et al.: Working hard to know your neighbor’s margins: local descriptor learning loss. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 4826–4837 (2017)
Acknowledgements
This work was funded by China Postdoctoral Science Foundation (No. 2021M690094).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lai, B. et al. (2021). Learning Cross-Domain Descriptors for 2D-3D Matching with Hard Triplet Loss and Spatial Transformer Network. In: Peng, Y., Hu, SM., Gabbouj, M., Zhou, K., Elad, M., Xu, K. (eds) Image and Graphics. ICIG 2021. Lecture Notes in Computer Science(), vol 12890. Springer, Cham. https://doi.org/10.1007/978-3-030-87361-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-87361-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87360-8
Online ISBN: 978-3-030-87361-5
eBook Packages: Computer ScienceComputer Science (R0)