Enkoder wizyjny
Visual input tokenization — converting observational images into vector representations compatible with the language backbone
Processes raw RGB images from robot cameras into sequences of visual tokens. Typically based on a Vision Transformer (ViT) or a convolutional network. More recent VLA architectures apply feature fusion from multiple visual backbones (e.g., DINOv2 + SigLIP in OpenVLA) to improve both spatial and semantic understanding.
[B, H, W, C][B, N_vis, d_model]