CLIP–Depth Guided Multi-View Pair Selection for Large-Scale 3D Reconstruction

A CLIP and Depth-Anything v2 based view selection strategy that improves robustness and scalability of agricultural 3D reconstruction with COLMAP.

Traditional structure-from-motion pipelines struggle in large, repetitive agricultural scenes due to ambiguous matches and weak viewpoint coverage. In this project, we design an end-to-end 3D reconstruction pipeline that fuses global descriptors from CLIP and Depth-Anything v2 to intelligently shortlist image pairs, and then uses rotation-robust local features to build stable maps with COLMAP.

Methodology

The pipeline combines global semantic and geometric cues with robust feature matching:

Global descriptor fusion. Extract CLIP and Depth-Anything v2 global descriptors for each image to capture both high-level semantics and depth-aware geometry.
View pair clustering. Reduce the fused descriptors using t-SNE and cluster them with KMeans to identify promising view pairs that are likely to have good overlap and complementary coverage.
Robust local matching. Detect rotation-robust ALIKED keypoints and match them with LightGlue to obtain accurate, efficient correspondences within the shortlisted pairs.
Incremental COLMAP mapping. Feed the matched pairs into an incremental COLMAP pipeline to reconstruct large field scenes with improved stability and fewer gross matching failures.

Results & Impact

The CLIP–Depth guided pairing improves reconstruction completeness and consistency in long, repetitive rows compared with naive nearest-neighbor or fully-connected pairing strategies.

This project shows that integrating foundation model descriptors with classic SfM pipelines can significantly enhance 3D mapping robustness in agricultural environments, without requiring modifications to the underlying reconstruction engine.