World Models

Formalizes a paradigm in which an agent learns an internal model of environment dynamics and trains its control policy entirely within that model's generated simulations, dramatically improving sample efficiency.

Perception Model / Observation Encoder (V)

Dimensionality reduction of observations — conversion of raw sensory data into a compact latent representation from

Modular

Compresses high-dimensional environment observations (e.g., pixel images) into a low-dimensional latent space representation. In the original World Models (2018), this is implemented via a Variational Autoencoder (VAE). It is responsible for extracting salient spatial features from observations.

Environment Dynamics Model (M)

Temporal environment dynamics modeling — predicting future latent states conditioned on agent actions

Modular

Predicts the next latent states based on the current latent state and the agent's action. It forms the core of the world model — its capacity for temporal extrapolation enables the generation of synthetic trajectories. In the original World Models paper, this component is implemented as an MDN-RNN (Mixture Density Network + LSTM).

Controller / Policy (C)

Maps the internal state (latent + RNN hidden state) to agent actions; optimized against reward within generated trajectories

Modular

The agent's decision-making module that maps the current latent state and the hidden state of the dynamics model to actions executed in the environment. In the original World Models architecture, it is compact (linear or a small MLP) and trained separately from the world model — using an evolutionary method (CMA-ES) on generated "dreams".

Imagined Trajectory Generation (Dreaming)

Generates synthetic training data for the controller via internal simulation — replacing costly interactions with the real environment.

The mechanism for generating synthetic trajectories by unrolling a dynamics model over time — without interaction with the real environment. The agent "dreams": it initializes a latent state, then sequentially predicts subsequent states by applying the dynamics model and selecting actions via a controller. The resulting sequences are used for policy optimization.

Wąskie gardło: Training dynamics models on long trajectories and generating imagined sequences

The dynamics model (RNN/RSSM) requires sequential processing of time steps, which limits parallelism during training. With long imagination horizons (e.g., 15–50 steps in Dreamer), the cost of training the actor-critic via backpropagation through the unrolled dynamics model becomes the dominant computational expense.

Parallelism

Partially parallel

Training the perception model (encoder) is fully parallel (batch processing). Training the dynamics model (RNN/RSSM) is sequential along the time dimension, but parallel across batch elements.

Paradigm

Dense

All paths active

Standard world models (VAE + RNN/RSSM + controller) use dense neural networks without routing or sparse activation. MuZero relies solely on a deterministic dynamics network without observation reconstruction — an architecturally simplified approach, but still dense.

Latent Space Dimensionality

Standard

32Ha & Schmidhuber (2018) — VAE latent dimension.
1024DreamerV3 — stochastic + deterministic.

The size of the latent vector generated by the observation encoder. It determines representation capacity and information compression. Too small — information loss; too large — slower controller training.

Number of hidden units in the dynamics model

Standard

256Ha & Schmidhuber (2018) — MDN-RNN.
2048DreamerV3.

The size of the RNN or RSSM hidden state determines the capacity of the dynamics model to retain history and predict future states.

Imagination Horizon (prediction steps)

Critical

15DreamerV1.
64Long horizons for tasks requiring long-term planning.

The number of timesteps simulated internally by the dynamics model when generating an imagined trajectory for policy training. A longer horizon improves long-term planning at the cost of increased compounding errors and computational overhead.

Typ modelu dynamiki

Critical

MDN-RNN (LSTM + Mixture Density Network)Ha & Schmidhuber (2018).
RSSM (Recurrent State Space Model)PlaNet, Dreamer — deterministic + stochastic pathway.
Transformer-based world modelIRIS, Genie — a Transformer-based dynamics model.

The architecture used to model transitions between latent states over time. This choice determines the model's ability to capture the complexity of environment dynamics.

Common pitfalls

Model Exploitation of World Model Imperfections by the Agent

CRITICAL

An agent trained exclusively inside an imagined world model may discover policies that achieve high rewards within that imagination but fail to transfer to the real environment — by exploiting the model's prediction errors rather than learning genuine skills.

Use model temperature (uncertainty injection) to control prediction confidence and penalize overly optimistic imaginations. Regularly validate the policy in the real environment. Apply pessimistic planners that penalize uncertainty.

Prediction Error Accumulation over Long Imagination Horizons

HIGH

Errors in the dynamics model accumulate over each step of the imagined trajectory. At long horizons (>20 steps), imagined trajectories can deviate significantly from real ones, degrading policy quality.

Limit the imagination horizon to values where cumulative errors remain acceptable. Apply uncertainty calibration techniques. Train the dynamics model on diverse inputs, including actions produced by the trained policy (on-policy data).

Catastrophic forgetting in the dynamics model under distribution shift

HIGH

When an agent explores previously unseen regions of the environment, the dynamics model may fail to generalize correctly to those states, producing unrealistic imagined trajectories in new parts of the state space.

Use a replay buffer containing data collected throughout training. Train the dynamics model on a mixture of old and new data. Apply adaptive data collection to ensure adequate state-space coverage.

Difficulty modeling stochastic and multimodal environments

HIGH

Environments with stochastic elements or multimodal distributions over future states are difficult to capture with deterministic dynamics models. Such models tend to average across modes rather than preserving multimodality, resulting in blurry and unreliable predictions.

Use models with an explicit stochastic component (RSSM, MDN-RNN, diffusion). Model uncertainty via calibrated distributions rather than point predictions. Avoid MSE as the sole reconstruction criterion.

High computational cost in scalable visual environments

MEDIUM

Training a VAE on pixel images and a dynamics model on imagined sequences demands substantial GPU resources. DreamerV3 on complex environments such as Minecraft requires tens of GPU-days.

Use low-dimensional state spaces instead of pixels where possible. Compress the latent space aggressively. Apply mixed-precision training and efficient implementations (JAX, TensorRT).

Reference implementations

World Models (original Ha & Schmidhuber implementation, TensorFlow)official

Python · David Ha

DreamerV3 (Hafner et al., JAX)official

Python · Danijar Hafner

PlaNet (Hafner et al., TensorFlow)official

Python · Google Research

GENESIS · Source paper

Recurrent World Models Facilitate Policy Evolution

2018NeurIPS 2018David Ha, Jürgen Schmidhuber

1990

Schmidhuber — first formal work on RNN-based world models and controllers

breakthrough

Jürgen Schmidhuber published a series of papers (1990a, 1990b, 1991a) formally defining the concept of a learnable world model and a separate controller trained through that model. These works established the foundations of the MBRL paradigm with internal simulation.

2018

Ha & Schmidhuber — World Models: V-M-C with VAE, MDN-RNN, and evolutionary controller

breakthrough

Ha and Schmidhuber formalize and demonstrate a three-component architecture (Vision: VAE, Memory: MDN-RNN, Controller: CMA-ES), showing that a controller can be trained entirely inside the imagined "dreams" of a world model and then transferred to real environments (Car Racing, VizDoom).

Recurrent World Models Facilitate Policy Evolution

2019

PlaNet (Hafner et al.) — latent-space planning via RSSM

breakthrough

Hafner et al. (Google Brain) propose PlaNet: a world model using a Recurrent State Space Model (RSSM) that combines deterministic and stochastic state transitions. Planning is performed by optimizing latent trajectories via CEM, without an actor model — the first pixel-level demonstration across multiple continuous control environments.

Learning Latent Dynamics for Planning from Pixels

2020

DreamerV1 (Hafner et al.) — actor-critic trained entirely in imagination

breakthrough

Hafner et al. combine RSSM with an actor-critic optimized solely via backpropagation through imagined trajectories. DreamerV1 outperforms model-free baselines on the DeepMind Control Suite benchmarks.

Dream to Control: Learning Behaviors by Latent Imagination

2020

MuZero (DeepMind) — world model without observation reconstruction

breakthrough

Schrittwieser et al. (DeepMind) publish MuZero — a world model that learns only rewards, values, and policies without reconstructing observations, combined with MCTS. It achieves human-level performance in Go, Chess, Shogi, and Atari without knowledge of the game rules.

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

2023

DreamerV3 — generalist algorithm across 150+ tasks

breakthrough

Hafner et al. publish DreamerV3 — a generalized version of Dreamer using a single hyperparameter configuration that operates across more than 150 diverse tasks, including diamond collection in Minecraft. This is the first demonstration of world model RL generality across such a broad spectrum of environments.

Mastering Diverse Domains through World Models

2024

Genie (Google DeepMind) — interactive world model generating environments from video

Bruce et al. (Google DeepMind) publish Genie — a world model trained on unlabeled internet videos, capable of generating interactive 2D environments controlled by learned latent actions. This extends the world models paradigm to generative environment simulators.

Genie: Generative Interactive Environments

GPU Tensor CoresPRIMARY

Training world models — particularly the encoder (VAE/CNN), dynamics model (RNN/RSSM/Transformer), and actor-critic — is dominated by matrix operations executed efficiently by GPU tensor cores. DreamerV3 is trained on V100/A100 GPUs.

Sequential RNN/RSSM processing limits sequence-dimension scaling across multiple GPUs; parallelism is achievable through multiple independent environments or batch elements. JAX-based implementations (DreamerV3) efficiently compile and parallelize computations.

Related AI models

Other

UnifoLM-WMA-0

Title	Publisher	Type
World Models	—	—

World Models

—

Back to technology catalog

World Models

Main components

Perception Model / Observation Encoder (V)

Environment Dynamics Model (M)

Controller / Policy (C)

Imagined Trajectory Generation (Dreaming)

Computational complexity

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Related models and families

Related AI models

Other

Sources