Robots Atlas>ROBOTS ATLAS
Training

Video Pretraining

2022ActiveUpdated: 5 May 2026Published
Key innovation
Learning visual representations by predicting video frame sequences in a self-supervised manner, instead of image classification.
Category
Training
Abstraction level
Pattern
Use cases
Robotics foundation modelsWorld modelsVideo generationAction predictionSim-to-real transfer

How it works

A model (typically a video transformer or diffusion network) processes frame sequences and is trained to predict masked or future frames. Gradients flow back through time (BPTT), teaching the model temporal coherence and scene physics. After pretraining the model is fine-tuned for downstream tasks (robot control, scene understanding).

Problem solved

Lack of large-scale labelled visual data; need to teach a model scene physics and motion dynamics without human supervision.

Evolution

2022
VideoCLIP and VideoMAE โ€” first scalable video pretraining with masked modelling
Inflection point
2023
Sora (OpenAI) and Genie (DeepMind) demonstrate generative video pretraining at scale
Inflection point
2025
UnifoLM-WMA-0 (Unitree) applies video pretraining as the foundation of a world-model-action framework for robotics
Technical details

Execution paradigm

Primary mode
dense
Activation pattern
all_paths_active

Parallelism

Parallelism level
partially_parallel
Scope
trainingacross_devices

Hardware requirements

Primary

Massive attention matrices over frame sequences require high-throughput GPUs with tensor cores.