A Comparison of SuperPoint + SuperGlue and SIFT-Based Pipelines for Sparse-View Structure-from-Motion Shoya Morizaki(a), Masayuki Matsuoka (a*)
a) Department of Information Engineering, Mie University
1577 Kurima-machiya, Tsu, 514-8507 Japan
* matsuoka[at]info.mie-u.ac.jp
Abstract
Structure-from-Motion (SfM) is a widely used technique for reconstructing three-dimensional structures from two-dimensional images, with applications in cultural heritage preservation, urban planning, and environmental monitoring. However, its performance often deteriorates under sparse-view conditions where image coverage is limited by restricted viewpoints or flight paths, such as in UAV-based surveys or narrow indoor environments. To address this challenge, we explored the integration of machine learning-based keypoint detection and matching methods-SuperPoint and SuperGlue-into the SfM workflow, and compared their performance with that of conventional SIFT-based pipelines implemented in OpenMVG. In this approach, SuperPoint was used to extract robust keypoints, while SuperGlue performs context-aware feature matching. The resulting matching points were then passed to OpenMVG for geometric verification, incremental reconstruction, and camera pose estimation. Our comparative study evaluated three configurations: (1) a fully traditional SIFT-based pipeline, (2) a hybrid approach combining SuperPoint and OpenMVG, and (3) a deep-learning-based pipeline combining SuperPoint, SuperGlue, and OpenMVG. The evaluation focused on reconstruction success rate, point cloud density, and camera pose accuracy under varying conditions of image overlap and viewpoint sparsity. Initial trials suggested that the deep learning-based approach may offer improved performance in challenging conditions involving occlusion and limited image data. This study aims to contribute to a deeper understanding of how recent advances in feature extraction and image matching algorithms can be integrated into classical SfM frameworks to enhance robustness and accuracy, especially in scenarios constrained by sparse visual input.