Vision–Language Model Adaptation for Open-Vocabulary Agricultural Perception

Adapting general-purpose vision–language models to agricultural scenes using lightweight adapters, enabling open-vocabulary perception and robust downstream yield prediction under real field conditions.

This project focuses on tailoring large vision–language models (VLMs) to agriculture-specific imagery and semantics. By inserting lightweight adapter modules into pretrained VLMs, we enable open-vocabulary understanding of field scenes and repurpose the learned multimodal embeddings for downstream tasks such as yield prediction and decision support for farmers.

Methodology

We design the pipeline to keep the backbone frozen and only adapt small, efficient components:

Results & Impact

The adapted VLM significantly improves robustness and cross-season generalization compared with vision-only baselines. In particular, downstream yield prediction benefits from the richer, more semantically aligned representations, leading to more stable performance across varying lighting, growth stages, and field conditions.

This work highlights how lightweight VLM adaptation can turn generic foundation models into practical tools for agricultural decision-making without the cost of training large models from scratch.