Robots AtlasRobots Atlas

World Models

Formalizes a paradigm in which an agent learns an internal model of environment dynamics and trains its control policy entirely within that model's generated simulations, dramatically improving sample efficiency.

Category
Abstraction level
Operation level
01

Perception Model / Observation Encoder (V)

Dimensionality reduction of observations — conversion of raw sensory data into a compact latent representation from

Modular

Compresses high-dimensional environment observations (e.g., pixel images) into a low-dimensional latent space representation. In the original World Models (2018), this is implemented via a Variational Autoencoder (VAE). It is responsible for extracting salient spatial features from observations.

Variational Autoencoder (VAE)Enkoder CNNRSSM — stochastic and deterministic representation
02

Environment Dynamics Model (M)

Temporal environment dynamics modeling — predicting future latent states conditioned on agent actions

Modular

Predicts the next latent states based on the current latent state and the agent's action. It forms the core of the world model — its capacity for temporal extrapolation enables the generation of synthetic trajectories. In the original World Models paper, this component is implemented as an MDN-RNN (Mixture Density Network + LSTM).

MDN-RNN (Mixture Density Network + LSTM)RSSM (Recurrent State Space Model)MuZero Dynamics Network
03

Controller / Policy (C)

Maps the internal state (latent + RNN hidden state) to agent actions; optimized against reward within generated trajectories

Modular

The agent's decision-making module that maps the current latent state and the hidden state of the dynamics model to actions executed in the environment. In the original World Models architecture, it is compact (linear or a small MLP) and trained separately from the world model — using an evolutionary method (CMA-ES) on generated "dreams".

Linear ControllerLatent-Space Actor-Critic
04

Imagined Trajectory Generation (Dreaming)

Generates synthetic training data for the controller via internal simulation — replacing costly interactions with the real environment.

The mechanism for generating synthetic trajectories by unrolling a dynamics model over time — without interaction with the real environment. The agent "dreams": it initializes a latent state, then sequentially predicts subsequent states by applying the dynamics model and selecting actions via a controller. The resulting sequences are used for policy optimization.

Wąskie gardło: Training dynamics models on long trajectories and generating imagined sequences

The dynamics model (RNN/RSSM) requires sequential processing of time steps, which limits parallelism during training. With long imagination horizons (e.g., 15–50 steps in Dreamer), the cost of training the actor-critic via backpropagation through the unrolled dynamics model becomes the dominant computational expense.

Parallelism

Partially parallel

Training the perception model (encoder) is fully parallel (batch processing). Training the dynamics model (RNN/RSSM) is sequential along the time dimension, but parallel across batch elements.

Paradigm

Dense

All paths active

Standard world models (VAE + RNN/RSSM + controller) use dense neural networks without routing or sparse activation. MuZero relies solely on a deterministic dynamics network without observation reconstruction — an architecturally simplified approach, but still dense.

Latent Space Dimensionality

Standard
  • 32Ha & Schmidhuber (2018) — VAE latent dimension.
  • 1024DreamerV3 — stochastic + deterministic.

The size of the latent vector generated by the observation encoder. It determines representation capacity and information compression. Too small — information loss; too large — slower controller training.

Number of hidden units in the dynamics model

Standard
  • 256Ha & Schmidhuber (2018) — MDN-RNN.
  • 2048DreamerV3.

The size of the RNN or RSSM hidden state determines the capacity of the dynamics model to retain history and predict future states.

Imagination Horizon (prediction steps)

Critical
  • 15DreamerV1.
  • 64Long horizons for tasks requiring long-term planning.

The number of timesteps simulated internally by the dynamics model when generating an imagined trajectory for policy training. A longer horizon improves long-term planning at the cost of increased compounding errors and computational overhead.

Typ modelu dynamiki

Critical
  • MDN-RNN (LSTM + Mixture Density Network)Ha & Schmidhuber (2018).
  • RSSM (Recurrent State Space Model)PlaNet, Dreamer — deterministic + stochastic pathway.
  • Transformer-based world modelIRIS, Genie — a Transformer-based dynamics model.

The architecture used to model transitions between latent states over time. This choice determines the model's ability to capture the complexity of environment dynamics.

Common pitfalls

Model Exploitation of World Model Imperfections by the Agent
CRITICAL

An agent trained exclusively inside an imagined world model may discover policies that achieve high rewards within that imagination but fail to transfer to the real environment — by exploiting the model's prediction errors rather than learning genuine skills.

Use model temperature (uncertainty injection) to control prediction confidence and penalize overly optimistic imaginations. Regularly validate the policy in the real environment. Apply pessimistic planners that penalize uncertainty.

Prediction Error Accumulation over Long Imagination Horizons
HIGH

Errors in the dynamics model accumulate over each step of the imagined trajectory. At long horizons (>20 steps), imagined trajectories can deviate significantly from real ones, degrading policy quality.

Limit the imagination horizon to values where cumulative errors remain acceptable. Apply uncertainty calibration techniques. Train the dynamics model on diverse inputs, including actions produced by the trained policy (on-policy data).

Catastrophic forgetting in the dynamics model under distribution shift
HIGH

When an agent explores previously unseen regions of the environment, the dynamics model may fail to generalize correctly to those states, producing unrealistic imagined trajectories in new parts of the state space.

Use a replay buffer containing data collected throughout training. Train the dynamics model on a mixture of old and new data. Apply adaptive data collection to ensure adequate state-space coverage.

Difficulty modeling stochastic and multimodal environments
HIGH

Environments with stochastic elements or multimodal distributions over future states are difficult to capture with deterministic dynamics models. Such models tend to average across modes rather than preserving multimodality, resulting in blurry and unreliable predictions.

Use models with an explicit stochastic component (RSSM, MDN-RNN, diffusion). Model uncertainty via calibrated distributions rather than point predictions. Avoid MSE as the sole reconstruction criterion.

High computational cost in scalable visual environments
MEDIUM

Training a VAE on pixel images and a dynamics model on imagined sequences demands substantial GPU resources. DreamerV3 on complex environments such as Minecraft requires tens of GPU-days.

Use low-dimensional state spaces instead of pixels where possible. Compress the latent space aggressively. Apply mixed-precision training and efficient implementations (JAX, TensorRT).

GENESIS · Source paper

Recurrent World Models Facilitate Policy Evolution
2018NeurIPS 2018David Ha, Jürgen Schmidhuber
1990

Schmidhuber — first formal work on RNN-based world models and controllers

breakthrough

Jürgen Schmidhuber published a series of papers (1990a, 1990b, 1991a) formally defining the concept of a learnable world model and a separate controller trained through that model. These works established the foundations of the MBRL paradigm with internal simulation.

2018

Ha & Schmidhuber — World Models: V-M-C with VAE, MDN-RNN, and evolutionary controller

breakthrough

Ha and Schmidhuber formalize and demonstrate a three-component architecture (Vision: VAE, Memory: MDN-RNN, Controller: CMA-ES), showing that a controller can be trained entirely inside the imagined "dreams" of a world model and then transferred to real environments (Car Racing, VizDoom).

2019

PlaNet (Hafner et al.) — latent-space planning via RSSM

breakthrough

Hafner et al. (Google Brain) propose PlaNet: a world model using a Recurrent State Space Model (RSSM) that combines deterministic and stochastic state transitions. Planning is performed by optimizing latent trajectories via CEM, without an actor model — the first pixel-level demonstration across multiple continuous control environments.

2020

DreamerV1 (Hafner et al.) — actor-critic trained entirely in imagination

breakthrough

Hafner et al. combine RSSM with an actor-critic optimized solely via backpropagation through imagined trajectories. DreamerV1 outperforms model-free baselines on the DeepMind Control Suite benchmarks.

2020

MuZero (DeepMind) — world model without observation reconstruction

breakthrough

Schrittwieser et al. (DeepMind) publish MuZero — a world model that learns only rewards, values, and policies without reconstructing observations, combined with MCTS. It achieves human-level performance in Go, Chess, Shogi, and Atari without knowledge of the game rules.

2023

DreamerV3 — generalist algorithm across 150+ tasks

breakthrough

Hafner et al. publish DreamerV3 — a generalized version of Dreamer using a single hyperparameter configuration that operates across more than 150 diverse tasks, including diamond collection in Minecraft. This is the first demonstration of world model RL generality across such a broad spectrum of environments.

2024

Genie (Google DeepMind) — interactive world model generating environments from video

Bruce et al. (Google DeepMind) publish Genie — a world model trained on unlabeled internet videos, capable of generating interactive 2D environments controlled by learned latent actions. This extends the world models paradigm to generative environment simulators.

GPU Tensor CoresPRIMARY

Training world models — particularly the encoder (VAE/CNN), dynamics model (RNN/RSSM/Transformer), and actor-critic — is dominated by matrix operations executed efficiently by GPU tensor cores. DreamerV3 is trained on V100/A100 GPUs.

Sequential RNN/RSSM processing limits sequence-dimension scaling across multiple GPUs; parallelism is achievable through multiple independent environments or batch elements. JAX-based implementations (DreamerV3) efficiently compile and parallelize computations.

Related AI models

Other

1