Training
Video Pretraining
2022ActiveUpdated: 5 May 2026Published
Key
innovation
Learning visual representations by predicting video frame sequences in a self-supervised manner, instead of image classification.
Category
Training
Abstraction level
Pattern
Use cases
Robotics foundation modelsWorld modelsVideo generationAction predictionSim-to-real transfer
How it works
A model (typically a video transformer or diffusion network) processes frame sequences and is trained to predict masked or future frames. Gradients flow back through time (BPTT), teaching the model temporal coherence and scene physics. After pretraining the model is fine-tuned for downstream tasks (robot control, scene understanding).
Problem solved
Lack of large-scale labelled visual data; need to teach a model scene physics and motion dynamics without human supervision.
Evolution
2022
VideoCLIP and VideoMAE โ first scalable video pretraining with masked modelling
Inflection point2023
Sora (OpenAI) and Genie (DeepMind) demonstrate generative video pretraining at scale
Inflection point2025
UnifoLM-WMA-0 (Unitree) applies video pretraining as the foundation of a world-model-action framework for robotics
Technical details
Execution paradigm
Primary mode
dense
Activation pattern
all_paths_active
Parallelism
Parallelism level
partially_parallel
Scope
trainingacross_devices
Hardware requirements
Primary
Massive attention matrices over frame sequences require high-throughput GPUs with tensor cores.