Robots AtlasRobots Atlas

Diffusion Model

Diffusion Model introduced a generative paradigm based on reversing a stochastic Gaussian noise addition process, enabling stable training of deep generative models without adversarial objectives and without the architectural constraints imposed by invertible flow-based networks.

Category
Abstraction level
Operation level
01

Forward Diffusion Process

Gradually adds Gaussian noise to data over T steps, producing a sequence of noisy samples used as training targets.

A fixed, non-learnable Markov chain that iteratively corrupts a data sample x0 by adding Gaussian noise according to a predefined variance schedule {β1, ..., βT}, transforming it toward an isotropic Gaussian. Analytically tractable: any timestep t can be sampled in closed form.

i/o
in
[B, C, H, W]Clean data sample x0 from the training dataset; shape depends on data modality.
out
[B, C, H, W]Noised sample xt at a given timestep t, with added Gaussian noise.
Linear noise scheduleCosine Noise Schedule
02

Reverse Diffusion Process

Iteratively denoises a noisy sample over T steps, progressing from pure Gaussian noise to a sample from the data distribution.

A learned Markov chain that models the reverse transition p_θ(x_{t-1}|x_t) as a Gaussian with mean and variance predicted by a neural network. At inference, T sequential denoising steps transform pure noise xT ~ N(0, I) into a sample x0.

03

Denoising network (backbone)

Noise-prediction network conditioned on the timestep index, estimating either the noise or the mean of the reverse distribution at each denoising step.

Modular

A neural network parameterized by θ, conditioned on the noised input xt and the timestep t, that predicts the noise component ε (in DDPM parameterization) or the score function. Typically implemented as a U-Net with sinusoidal timestep embeddings and self-attention layers; alternatively a Transformer (DiT).

Backbone U-NetDiffusion Transformer (DiT)
04

Harmonogram szumu

Defines the variance schedule {β1, ..., βT} controlling the rate of noise addition during the diffusion process, directly affecting generation quality and training stability.

Modular

A sequence of hyperparameters {β1, ..., βT} specifying how much noise is added at each forward step. The schedule determines the rate at which the data distribution transitions to Gaussian noise and affects the difficulty of the reverse denoising task.

05

Timestep Embedding

Encodes the timestep index t as a continuous vector and injects it into the denoising network, enabling the model to adapt its behavior to the current noise level.

Modular

Sinusoidal or learned embedding of the integer timestep t, injected into each residual block of the denoising network to condition it on the current noise level.

Time

T = number of diffusion steps (typically 50–1000); C_net = cost of a single denoising network forward pass. Training complexity: O(C_net) per sample (random step t sampled from the schedule).

Inference requires T sequential network passes, making it significantly slower than single-pass models (e.g., GANs). Accelerated samplers (DDIM, DPM-Solver) reduce the effective T to 10–50 steps.

Memory complexity

D = data dimensionality (equal to the input, since in DDPM the latent space has full dimensionality); P = number of parameters in the denoising network.

Unlike VAE, the latent space in a standard DDPM has the same dimensionality as the input data. Latent Diffusion Models address this by operating in a compressed encoder space.

Wąskie gardło: Sequential reverse-process sampling

Inference requires T sequential passes through the denoising network because each step depends on the output of the previous step (Markov property), making latency proportional to T and preventing naive step-level parallelism.

Parallelism

Partially parallel

Training is fully parallel: each sample uses a randomly selected step t, so batches of independent training examples can be processed in parallel. Inference is sequential for a single sample, but multiple samples can be generated in parallel (throughput parallelism). Approaches such as Picard iteration (ParaDiGMS) explore the compute–latency tradeoff.

Paradigm

Dense

All paths active

Each denoising step applies the full network to the entire data tensor. The base diffusion model concept includes no expert routing or conditional activation sparsity.

Diffusion Steps Count (T)

Critical
  • 100Reduced T, faster inference at the cost of quality.
  • 1000Default in the original DDPM (Ho et al. 2020).

Controls the number of forward and reverse Markov chain steps. Larger T generally improves sample quality but increases inference cost linearly.

Noise schedule type

Standard
  • linearOriginal DDPM: β ranging from 1e-4 to 0.02.
  • cosineImproved DDPM (Nichol and Dhariwal, 2021).

Defines the variance schedule {β1, ..., βT}. Common choices: linear (original DDPM), cosine (Improved DDPM), sigmoid.

Noise Prediction Parametrization

Standard
  • epsilon (ε)Standard DDPM parameterization (Ho et al. 2020).
  • x0Direct data prediction.

Whether the denoising network predicts the noise ε (epsilon-parameterization, standard in DDPM), the original data x0, or the score function.

Denoising network backbone

Standard
  • U-NetStandard choice for image generation (DDPM, Stable Diffusion).
  • Transformer (DiT)Diffusion Transformer — used in Sora and similar systems.

Architecture of the neural network parameterizing the reverse process. Affects capacity, training speed, and generalization.

Common pitfalls

Very slow inference due to high step count
HIGH

The default DDPM reverse process requires T=1000 sequential denoising steps, each requiring a full network forward pass, making inference orders of magnitude slower than single-pass generative models like GANs.

Use accelerated samplers such as DDIM, DPM-Solver, or PNDM to reduce effective steps to 20–100. Alternatively, use Latent Diffusion Models to operate in a compressed latent space.

Noise schedule mismatch relative to data resolution and domain
MEDIUM

The linear noise schedule from the original DDPM can destroy data signal too aggressively at early timesteps for high-resolution images, leading to suboptimal training. This schedule is not universally optimal across data types.

Use a cosine noise schedule (Nichol and Dhariwal, 2021) or explore schedules tailored to the specific domain and data resolution.

Image saturation at high classifier-free guidance weights
MEDIUM

High classifier-free guidance (CFG) weights improve condition adherence but cause out-of-distribution denoised samples, resulting in oversaturated or artifact-ridden outputs due to a train-inference mismatch.

Use dynamic thresholding (Saharia et al., Imagen) or carefully tune the CFG weight. Values in the 5–15 range are typical for text-to-image; exceeding this range risks quality degradation.

Insufficient number of training steps
HIGH

Diffusion models typically require very long training runs (hundreds of thousands to millions of gradient steps) to converge to high sample quality, especially at high resolutions.

Monitor FID on the validation set. Use exponential moving average (EMA) of model weights during training — EMA weights consistently produce better samples than the raw model.

GENESIS · Source paper

Deep Unsupervised Learning using Nonequilibrium Thermodynamics
2015ICML 2015Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan et al.
2015

First formal definition of diffusion generative models (Sohl-Dickstein et al.)

breakthrough

Sohl-Dickstein et al. published 'Deep Unsupervised Learning using Nonequilibrium Thermodynamics' at ICML 2015, introducing the forward-reverse diffusion framework inspired by non-equilibrium thermodynamics as a tractable generative model.

2020

DDPM: practical high-quality image generation (Ho et al.)

breakthrough

Ho, Jain, and Abbeel published 'Denoising Diffusion Probabilistic Models' (DDPM) at NeurIPS 2020, reframing diffusion models with a simplified noise-prediction objective and achieving GAN-competitive image quality on CIFAR-10 (FID 3.17).

2020

DDIM: accelerated non-deterministic sampling (Song et al.)

breakthrough

Song, Meng, and Ermon proposed Denoising Diffusion Implicit Models (DDIM), enabling non-Markovian sampling that reduces required inference steps from 1000 to 50–100 without retraining.

2021

Improved DDPM: cosine schedule and log-likelihood (Nichol and Dhariwal)

Nichol and Dhariwal published 'Improved Denoising Diffusion Probabilistic Models', introducing the cosine noise schedule and learned variance, improving log-likelihoods and generation quality.

2021

Diffusion Models surpass GANs in image synthesis (Dhariwal and Nichol)

breakthrough

Dhariwal and Nichol demonstrated that diffusion models with classifier guidance surpass state-of-the-art GANs on FID metrics on ImageNet 256×256, establishing diffusion models as the leading paradigm for high-quality image generation.

2021

Unification via SDE (Song et al.)

Song et al. published 'Score-Based Generative Modeling through Stochastic Differential Equations' (ICLR 2021), unifying DDPM and score-based generative models under a continuous-time SDE framework.

2022

Latent Diffusion Models and Stable Diffusion (Rombach et al.)

breakthrough

Rombach et al. published 'High-Resolution Image Synthesis with Latent Diffusion Models' (CVPR 2022), applying diffusion in a learned latent space to reduce computational cost. This work led directly to Stable Diffusion, open-sourced by Stability AI.

GPU Tensor CoresPRIMARY

Training and inference for diffusion models involve large batches of dense floating-point operations (convolutions, attention) on image-resolution tensors, which map well to GPU Tensor Core parallelism. Training at high resolutions requires substantial VRAM.

Training large diffusion models typically requires multi-GPU or multi-node configurations. Inference for a single sample is sequential, but can be batched across multiple samples simultaneously.

TPUGOOD

TPUs are used to train large diffusion models (e.g., Imagen by Google Brain) and handle the dense matrix operations required by U-Net and Transformer backbones via JAX/Flax.

JAX/Flax implementations of diffusion models (e.g., via Hugging Face Diffusers) are compatible with TPU inference and training.

Related AI models

AlphaFold

1

Other

1