Vision-Language-Action

Extends pretrained vision-language models (VLMs) with the ability to directly generate robot action tokens via joint fine-tuning on internet data and robot trajectories, enabling knowledge transfer from the web to physical control without separate planning or control modules.

Enkoder wizyjny

Visual input tokenization — converting observational images into vector representations compatible with the language backbone

Modular

Processes raw RGB images from robot cameras into sequences of visual tokens. Typically based on a Vision Transformer (ViT) or a convolutional network. More recent VLA architectures apply feature fusion from multiple visual backbones (e.g., DINOv2 + SigLIP in OpenVLA) to improve both spatial and semantic understanding.

i/o

[B, H, W, C]

out

[B, N_vis, d_model]

Language Backbone (LLM/VLM)

Instruction understanding, contextual reasoning, and action token generation from visual and language inputs.

Modular

A large language model or vision-language model forming the core of the VLA architecture. It processes a token sequence composed of visual tokens from the encoder, text instruction tokens, and action history tokens, and generates an output sequence consisting of action tokens.

Action decoder / action output head

Converts language model outputs into executable robot control signals (velocities, positions, torques).

Modular

The component responsible for converting the backbone's output representation into concrete robot control signals. In the tokenized approach, action tokens are mapped to discrete action bin values (e.g., 256 bins per dimension). In the continuous approach, a diffusion head or flow-matching head generates continuous action vectors.

Vision-Language Projector

Visual-language feature space alignment — enables the LLM to process visual embeddings as text tokens.

Modular

A linear layer or MLP that maps the output dimension of the visual encoder to the token space dimension of the language backbone (d_model). This enables visual tokens to be integrated with text tokens into a single sequence processed by the LLM.

Wąskie gardło: Real-time on-robot inference latency

Typical VLA models (7B–55B parameters) generate action tokens at 1–6 Hz on A100/RTX 4090-class GPUs, which is insufficient for tasks requiring high-frequency control (e.g., bimanual manipulation at >50 Hz). Deploying a large model directly on a robot, or routing inference through a network link to a GPU server, introduces additional latency.

Parallelism

Partially parallel

Training is fully parallel across tokens (full trajectories are processed as sequences of vision-language tokens). Inference is sequential per action token, but visual and linguistic processing (prefill) is parallel.

Paradigm

Dense

All paths active

Standard VLA models (RT-2, OpenVLA) use a dense Transformer backbone that processes all tokens — visual, language, and action — through every layer. There is no routing or sparse activation, in contrast to MoE-VLA variants proposed in later work.

Visual-Language Backbone

Critical

PaLM-E 12BRT-2.
LLaMA 2 7BOpenVLA.
Gemma-2Bπ0 (Physical Intelligence).

Selection of a pretrained VLM as the backbone of a VLA. This choice determines reasoning capabilities, model size, and hardware requirements.

Output Action Representation

Critical

discrete_tokens_256binsRT-2, OpenVLA — simplicity, compatibility with the LLM tokenizer.
continuous_diffusionπ0 — higher motor precision, higher latency.

The method by which the model encodes robot actions: discrete tokens (bins) or continuous output (diffusion, flow matching).

Training data mixture

Standard

co-fine-tuning: robot + VQA + OKVQA + captionRT-2 — joint fine-tuning on web and robotics tasks.
970k robot trajectories (Open X-Embodiment)OpenVLA — robotics-only data following VLM pretraining.

The ratio of robotic data (demonstration trajectories) to internet data (vision-language tasks). This affects the balance between language understanding and motor generalization.

Control Frequency

Standard

1–6 HzTypical range for a single-model 7B VLA (OpenVLA on RTX 4090).
50+ HzRequired for precise manipulation; achievable via dual-system VLA (Helix, Groot N1).

The frequency at which a VLA generates and executes actions. It is constrained by the model's inference speed and system architecture (single-model vs. dual-system).

Architecture System Type

Standard

single-modelRT-2, OpenVLA, π0 — simplicity through a single model handling both perception and action.
dual-system (slow VLM + fast action module)Helix (Figure AI), Groot N1 (NVIDIA) — improved precision and higher control frequency.

Whether a VLA operates as a single end-to-end model or a dual-system architecture with separate planning and execution components.

Common pitfalls

Control frequency too low for precision tasks

HIGH

VLA models built on large LLMs generate actions at 1–6 Hz, which is insufficient for tasks requiring smooth manipulation — such as folding, screwing, or assembly — that typically demand >50 Hz. This low frequency leads to oscillations, latency, and motion instability.

Use a dual-system architecture with a fast action module (flow-matching, diffusion). Implement action chunking — the model generates N steps ahead and executes them sequentially without additional LLM queries.

Training-Deployment Distribution Shift

HIGH

VLA models trained on demonstrations collected under specific conditions (lighting, background, camera, robot configuration) generalize poorly to new environments. Changing the camera, viewing angle, background, or robot platform can drastically reduce performance.

Collect training data with visual augmentation (lighting, background, and viewpoint variation). Apply PEFT (LoRA) for rapid fine-tuning to new environments with minimal demonstrations. Use multi-embodiment datasets (Open X-Embodiment) to improve generalization.

Action Discretization Limits — Loss of Precision

MEDIUM

Discretizing the action space into 256 bins (as in RT-2 and OpenVLA) introduces quantization error, which is especially noticeable in tasks requiring sub-millimeter precision. Converting continuous trajectories into tokens can lose important motor details.

Use continuous action decoding via diffusion or flow-matching instead of discrete tokens for precision-demanding tasks. Alternatively, increase the number of bins or apply adaptive discretization.

Trade-off Between Catastrophic Forgetting and Knowledge Transfer

HIGH

When fine-tuning a VLM on robotics data, the model may lose the general language and visual capabilities of the pretrained VLM (catastrophic forgetting). RT-2 addresses this through co-fine-tuning on robotics and internet data simultaneously — omitting this mixture degrades the model.

Apply co-fine-tuning with robotic and internet data mixed in appropriate proportions. With PEFT (LoRA), freezing the LLM backbone preserves VLM knowledge while training action generation.

Hardware requirements preventing on-robot deployment

HIGH

Models in the 7B–55B range require A100-class GPUs (40–80 GB VRAM) or an external GPU server. Direct deployment on resource-constrained robot hardware (Jetson Orin, CPU) is not feasible without quantization or distillation.

Use INT4/INT8 quantization (without accuracy loss per OpenVLA). Train smaller models (SmolVLA 450M). Apply a dual-system architecture with a lightweight action module deployed on-robot and a heavy VLM on a remote server.

Reference implementations

OpenVLA (Stanford)official

Python · Stanford / Moo Jin Kim et al.

LeRobot (Hugging Face)

Python · Hugging Face

SmolVLA (Hugging Face)

Python · Hugging Face

GENESIS · Source paper

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

2023CoRL 2023 (Conference on Robot Learning, PMLR 229)Anthony Brohan, Noah Brown, Justice Carbajal et al.

2022

RT-1 — Robotics Transformer for real-time control

breakthrough

Brohan et al. (Google) publish RT-1 — a Transformer trained on 130k robot demonstrations with conditioned text input. It is the first large-scale model combining vision, language, and robot control, but without pretraining on internet-scale data.

RT-1: Robotics Transformer for Real-World Control at Scale

2023

RT-2 — first VLA model transferring web knowledge to robot control

breakthrough

Zitkovich, Brohan et al. (Google DeepMind) formalized the VLA paradigm by co-fine-tuning PaLI-X and PaLM-E on robotic and internet-scale tasks. Actions are encoded as text tokens. The paper coined the term "vision-language-action model" and demonstrated emergent reasoning on novel tasks without additional training data.

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

2023

Open X-Embodiment — multi-platform robotics dataset

breakthrough

The collaboration of 21 institutions produced Open X-Embodiment — a dataset of ~1M trajectories from 22 robot types. It enables VLA training across diverse embodiments and tasks, and serves as a foundational resource for RT-X and OpenVLA.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

2024

OpenVLA — open-source 7B-parameter VLA

breakthrough

Kim et al. (Stanford) publish OpenVLA — an open-source 7B VLA built on LLaMA 2 + DINOv2 + SigLIP, trained on 970k trajectories from Open X-Embodiment. It outperforms the closed RT-2-X (55B) while using 7× fewer parameters. It is the first open platform for VLA research with PEFT and quantization support.

OpenVLA: An Open-Source Vision-Language-Action Model

2024

π0 (Physical Intelligence) — VLA with continuous diffusion-based output

breakthrough

Black et al. (Physical Intelligence) publish π0 — a VLA with a Gemma-2B backbone and a flow-matching action head in place of discrete tokens, achieving higher motor precision on dexterity-demanding tasks such as folding clothes and washing dishes.

π0: A Vision-Language-Action Flow Model for General Robot Control

2025

Dual-system VLA — Helix (Figure AI) and Groot N1 (NVIDIA)

Dual-model architecture: a slower VLM acts as a high-level planner, paired with a fast action-generation module for high-frequency control. Figure AI (Helix) and NVIDIA (Groot N1) demonstrate dual-system VLAs for humanoids operating in real time.

GPU Tensor CoresPRIMARY

VLA models based on large LLMs (7B–55B parameters) require GPUs with tensor cores for efficient inference. Training demands A100/H100-class GPU clusters (OpenVLA: 64×A100 for 14 days). Real-time inference on a robot requires at minimum an RTX 4090 (6 Hz for a 7B model).

Hardware requirements are determined by the LLM backbone size. Flash Attention and quantization can reduce VRAM requirements without sacrificing effectiveness.

TPUGOOD

Google DeepMind trained RT-2 (PaLM-E, PaLI-X backbone) on TPUs. TPU v4/v5 efficiently handle LLM matrix operations in VLA models.

Real-time robotic inference on TPUs is uncommon due to hardware availability and cost.

Related AI models

Other

MoDE-VLA

Ti0

UnifoLM-WMA-0

Back to technology catalog

Vision-Language-Action

Main components

Enkoder wizyjny

Language Backbone (LLM/VLM)

Action decoder / action output head

Vision-Language Projector

Computational complexity

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Related models and families

Related AI models

Other