Training

Video Pretraining

2022ActiveUpdated: 5 May 2026Published

Key innovation

Learning visual representations by predicting video frame sequences in a self-supervised manner, instead of image classification.

How it works

A model (typically a video transformer or diffusion network) processes frame sequences and is trained to predict masked or future frames. Gradients flow back through time (BPTT), teaching the model temporal coherence and scene physics. After pretraining the model is fine-tuned for downstream tasks (robot control, scene understanding).

Problem solved

Lack of large-scale labelled visual data; need to teach a model scene physics and motion dynamics without human supervision.

Evolution

2022

VideoCLIP and VideoMAE — first scalable video pretraining with masked modelling

Inflection point

2023

Sora (OpenAI) and Genie (DeepMind) demonstrate generative video pretraining at scale

Inflection point

2025

UnifoLM-WMA-0 (Unitree) applies video pretraining as the foundation of a world-model-action framework for robotics

Technical details

Execution paradigm

Primary mode

dense

Activation pattern

all_paths_active

Parallelism

Parallelism level

partially_parallel

Scope

trainingacross_devices

Hardware requirements

Primary

Massive attention matrices over frame sequences require high-throughput GPUs with tensor cores.