EmbodiedMAE

Multi-modal Masked Autoencoder for 3D Plant Phenotyping

EmbodiedMAE is a multi-modal masked autoencoder designed for 3D reconstruction and phenotyping of Sorghum plants. The model jointly encodes RGB images, depth maps, and point clouds to learn rich 3D-aware representations without requiring complete supervision.

Motivation

Plant phenotyping — measuring structural traits like plant height, leaf angle, and biomass — is critical for precision agriculture and crop breeding. Traditional methods are manual and slow. EmbodiedMAE aims to automate this via scalable self-supervised learning from readily available sensor data.

Approach

  • Multi-modal encoder: separate ViT-based encoders for RGB, depth, and point cloud modalities
  • Cross-modal masked reconstruction: randomly mask tokens across modalities; reconstruct them using visible tokens from all modalities
  • 3D-aware loss: combines pixel-level, depth-level, and point-wise Chamfer distance losses
  • Target crop: Sorghum phenotyping using the ISU field robot dataset

Status

Active development. Part of PhD research at SCSLab, Iowa State University.