CLIP–Depth Guided Multi-View Pair Selection for Large-Scale 3D Reconstruction

A CLIP and Depth-Anything v2 based view selection strategy that improves robustness and scalability of agricultural 3D reconstruction with COLMAP.

Traditional structure-from-motion pipelines struggle in large, repetitive agricultural scenes due to ambiguous matches and weak viewpoint coverage. In this project, we design an end-to-end 3D reconstruction pipeline that fuses global descriptors from CLIP and Depth-Anything v2 to intelligently shortlist image pairs, and then uses rotation-robust local features to build stable maps with COLMAP.

Methodology

The pipeline combines global semantic and geometric cues with robust feature matching:

Results & Impact

The CLIP–Depth guided pairing improves reconstruction completeness and consistency in long, repetitive rows compared with naive nearest-neighbor or fully-connected pairing strategies.

This project shows that integrating foundation model descriptors with classic SfM pipelines can significantly enhance 3D mapping robustness in agricultural environments, without requiring modifications to the underlying reconstruction engine.