Robots AtlasRobots Atlas

Video Pretraining

Learning visual representations by predicting video frame sequences in a self-supervised manner, instead of image classification.

Category
Abstraction level
Robotics foundation modelsWorld modelsVideo generationAction predictionSim-to-real transfer

A model (typically a video transformer or diffusion network) processes frame sequences and is trained to predict masked or future frames. Gradients flow back through time (BPTT), teaching the model temporal coherence and scene physics. After pretraining the model is fine-tuned for downstream tasks (robot control, scene understanding).

Lack of large-scale labelled visual data; need to teach a model scene physics and motion dynamics without human supervision.

Parallelism

Partially parallel

Paradigm

Dense

All paths active

2022

VideoCLIP and VideoMAE β€” first scalable video pretraining with masked modelling

breakthrough
2023

Sora (OpenAI) and Genie (DeepMind) demonstrate generative video pretraining at scale

breakthrough
2025

UnifoLM-WMA-0 (Unitree) applies video pretraining as the foundation of a world-model-action framework for robotics

GPU Tensor CoresPRIMARY

Massive attention matrices over frame sequences require high-throughput GPUs with tensor cores.

BUILT ON

Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data β€” next tokens in text, masked words, future video frames, future robot states β€” without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" β€” dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT

Related AI models

Other

1