Multi-View Transformer for Lightweight 3D Object Representation

A transformer-based architecture that fuses RGB and depth views to predict compact superquadric parameters, offering an efficient alternative to heavy 3D representations for object modeling and segmentation.

High-fidelity 3D representations such as Gaussian splatting provide excellent visual quality, but can be computationally heavy for large-scale deployment. This project explores a lightweight alternative: representing agricultural objects with superquadrics predicted by a multi-view RGB–Depth transformer, enabling efficient downstream tasks like 3D segmentation and tracking.

Methodology

The pipeline transforms multi-view image streams into compact parametric 3D shapes:

Multi-view feature encoding. Fuse RGB and depth inputs from multiple viewpoints using a transformer encoder that aggregates spatial and cross-view information.
Superquadric decoding. Decode the aggregated features into superquadric parameters (shape, scale, orientation, and position), providing a compact and differentiable 3D representation.
Regularization for shape plausibility. Apply geometric and smoothness constraints to encourage realistic object shapes suited to plant structures and fruits.
Downstream 3D segmentation. Use the predicted superquadrics as proxies for object-level segmentation and as input to further refinement modules where needed.

Results & Impact

The transformer–superquadric framework provides a lightweight 3D representation that is easy to store, manipulate, and integrate into robotics or analytics pipelines, while still capturing essential object geometry.

This work points toward scalable 3D modeling solutions that balance fidelity and efficiency for real-world agricultural applications.