Vision–Language Model Adaptation for Open-Vocabulary Agricultural Perception

Adapting general-purpose vision–language models to agricultural scenes using lightweight adapters, enabling open-vocabulary perception and robust downstream yield prediction under real field conditions.

This project focuses on tailoring large vision–language models (VLMs) to agriculture-specific imagery and semantics. By inserting lightweight adapter modules into pretrained VLMs, we enable open-vocabulary understanding of field scenes and repurpose the learned multimodal embeddings for downstream tasks such as yield prediction and decision support for farmers.

Methodology

We design the pipeline to keep the backbone frozen and only adapt small, efficient components:

Adapter-based fine-tuning. Insert lightweight adapter layers into a pretrained VLM and fine-tune them on agriculture-specific datasets, capturing domain-specific visual and textual cues like crop stages, disease patterns, and management operations.
Open-vocabulary scene understanding. Use the adapted VLM to parse field scenes with free-form text prompts, enabling flexible querying of objects and conditions that are not explicitly annotated in the training set.
Multimodal embedding reuse. Extract the joint vision–language embeddings and feed them into downstream models for yield and risk prediction, integrating semantic understanding with quantitative forecasting.

Results & Impact

The adapted VLM significantly improves robustness and cross-season generalization compared with vision-only baselines. In particular, downstream yield prediction benefits from the richer, more semantically aligned representations, leading to more stable performance across varying lighting, growth stages, and field conditions.

This work highlights how lightweight VLM adaptation can turn generic foundation models into practical tools for agricultural decision-making without the cost of training large models from scratch.