Training

PPO

2017ActivePublished: 10 May 2026Updated: 10 May 2026Published

Key innovation

Introducing clipping of the policy probability ratio in the policy gradient objective, limiting the magnitude of policy updates per step — providing TRPO-like stability with significantly simpler first-order SGD optimization.

Components

Clipped Surrogate Objective (PPO-Clip)Stabilizing policy gradient training without requiring constrained optimization as in TRPO.

PPO's main objective: L^CLIP(θ) = E_t[ min(r_t(θ) · Â_t, clip(r_t(θ), 1−ε, 1+ε) · Â_t) ]. The min operator selects the more pessimistic value, creating a lower bound on the objective and limiting the magnitude of a single policy update. ε ~ 0.1–0.2.

PPO-ClipVariant with probability ratio clipping. Most commonly used in practice — InstructGPT, RLHF, OpenAI Baselines, Stable Baselines.

PPO-Penalty (Adaptive KL)Variant with adaptive KL penalty: L^KLPEN(θ) = E_t[ r_t(θ) · Â_t − β · KL(π_θ_old, π_θ) ]. β is adjusted dynamically to keep KL in a target range. Less common in practice.

Official

Actor network (policy π_θ)Generating actions optimized to maximize expected cumulative reward.

A parameterized policy π_θ(a|s) mapping states to probability distributions over actions. In LLM applications, the policy is the Transformer itself generating tokens: a = token, s = context prefix.

Critic network (value function V_φ)Reducing policy gradient estimator variance via baseline subtraction.

Network estimating state value V_φ(s) = E[R | s], used to compute the advantage Â_t. Trained by minimizing MSE against observed returns or TD-targets.

Generalized Advantage Estimation (GAE)Advantage estimation with controlled bias-variance trade-off, critical for PPO training stability and quality.

Advantage estimator: Â_t^GAE(γ,λ) = Σ_{l=0}^∞ (γλ)^l δ_{t+l}, where δ_t = r_t + γV(s_{t+1}) − V(s_t). Hyperparameter λ ∈ [0,1] controls the bias-variance trade-off: λ=0 → low variance high bias, λ=1 → high variance low bias.

Official

Implementation

Reference implementations

Stable Baselines3 — PPO

Python (PyTorch) · DLR-RM

CleanRL — PPO

Python (PyTorch) · Costa Huang et al.

OpenAI Baselines — PPO2 (original)

Python (TensorFlow) · OpenAI

Official

Hugging Face TRL — PPOTrainer (for RLHF)

Python (PyTorch) · Hugging Face

Implementation pitfalls

Missing advantage normalizationHigh

Omitting advantage estimate (Â_t) normalization within the batch leads to training instability — ratio · Â can explode at large reward scales. Normalization (Â − mean) / std within the batch is standard practice not explicitly covered in the original paper.

Fix:Apply advantage normalization in each batch before computing L^CLIP. Most modern implementations (stable-baselines3, CleanRL, TRL) do this by default.

Reward hacking in RLHFHigh

PPO without a KL penalty (or with too small β) can aggressively exploit reward model weaknesses, generating responses scoring high but of low actual quality. Classic problem of RL with a learned reward model.

Fix:Add KL penalty to objective: L = L^CLIP − β · KL(π_θ || π_ref). Monitor KL during training. Early stopping when KL exceeds threshold. PPO-ptx (mixing with pretraining gradients).

High sensitivity to hyperparameters and seedMedium

PPO is known for high result variance across different seeds and hyperparameters. Henderson et al. "Deep RL that Matters" (2018) showed that the same PPO implementation can yield 50%-different results depending on the seed.

Fix:Averaging results across multiple seeds (3–10). Using domain-tested hyperparameters. Using well-maintained implementations (CleanRL, stable-baselines3).

Mismatched actor-critic weight sharingMedium

When actor and critic share weights (typical for CNN/Transformer in Atari games), the critic objective can dominate the policy gradient. The c1 coefficient must be chosen carefully (typically 0.5).

Fix:Use separate actor-critic networks when feasible. Tune c1. Monitor L^CLIP and L^VF loss scales.

Stale old policies in bufferMedium

PPO is on-policy — using samples collected too long ago (after many optimization iterations) leads to drift between π_θ_old and π_θ. This manifests as ratios far from 1, where clipping zeroes out most of the gradient.

Fix:Limit PPO epochs per batch (typically 4–10). Monitor the fraction of clipped samples (clipfrac). Reset buffer after each iteration.

Evolution

Original paper · 2017 · arXiv:1707.06347 (OpenAI technical report) · John Schulman

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

2015

Schulman et al. (Berkeley) introduce TRPO

Inflection point

Trust Region Policy Optimization (arXiv:1502.05477) — PPO's direct predecessor. Optimizes a surrogate objective with a KL constraint via second-order constrained optimization (conjugate gradient + line search). Effective but complex.

Trust Region Policy Optimization (paper)

2017

Schulman et al. (OpenAI) publish PPO

Inflection point

Paper arXiv:1707.06347 introduces PPO as a simplified alternative to TRPO. Instead of a constrained second-order problem — probability ratio clipping in the objective. Allows using standard SGD/Adam and is compatible with network architectures sharing parameters between actor and critic.

Proximal Policy Optimization Algorithms (paper)

2019

OpenAI Five defeats Dota 2 world champions

OpenAI Five (5 neural networks trained with PPO at ~10⁵ CPU + 256 GPU scale) defeats team OG, Dota 2 world champions. Demonstration of PPO scalability to complex multi-agent environments with long time horizons.

2022

InstructGPT — PPO as the RL Fine-Tuning algorithm in RLHF

Inflection point

Ouyang et al. (2022) used PPO as the third stage of the RLHF pipeline for GPT-3 with a KL penalty relative to the SFT model (PPO-ptx). This became the standard approach for LLM alignment (ChatGPT, Claude, Gemini). PPO found a second life beyond classical robotics and games.

Training language models to follow instructions with human feedback (paper)

2023

Engstrom et al. "37 implementation details of PPO"

Paper and blog post (ICLR 2020 Blog Track / The 37 Implementation Details of PPO, Huang et al.) systematizing 37 underdocumented PPO implementation details (advantage normalization, orthogonal initialization, learning rate annealing, gradient clipping, etc.) critical to results. Crucial for PPO result reproducibility in the community.

The 37 Implementation Details of Proximal Policy Optimization (paper)

Technical details

Hyperparameters (configurable axes)

Clip range (ε)Critical

Probability ratio clipping hyperparameter in PPO-Clip. Determines how much the ratio can deviate from 1 (i.e., from the old policy) before the gradient is clipped.

0.2Default value used in the original paper and most implementations.

0.1More conservative value — often used in RLHF for LLMs to limit drift from the SFT policy.

GAE λHigh

Generalized Advantage Estimation hyperparameter controlling the advantage estimator's bias-variance trade-off.

0.95Typical value for continuous control tasks (MuJoCo, robotics).

0.97 – 1.0Used in RLHF — values near 1 increase variance but better estimate long returns from the reward model.

Discount factor (γ)High

Discount factor for future rewards. Determines the planning horizon.

0.99Standard value for most RL tasks.

1.0Used in RLHF when reward is single at sequence end (terminal reward).

PPO epochs per batchHigh

How many times the same data batch is used to update the policy before collecting new rollouts. Too many leads to over-optimization and instability.

4Typical for RLHF LLM, where clipping prevents overly aggressive updates despite reusing the batch.

10Value from the original PPO paper for MuJoCo/Atari.

Value loss coefficient (c1)Medium

Weight of the critic MSE loss in the combined objective. Combined loss: L = L^CLIP − c1 · L^VF + c2 · S[π], where S[π] is the policy entropy.

0.5 – 1.0Typical range. Lower when actor and critic share weights.

Execution paradigm

Primary mode

dense

PPO itself is a policy gradient optimization algorithm — actor and critic networks can use any architecture (MLP, CNN, Transformer). In RLHF for LLM context, the actor is a dense Transformer and the critic is typically a smaller network or an additional head on top of the base Transformer.

Activation pattern

all_paths_active

Parallelism

Parallelism level

partially_parallel

Commonly N parallel actors are used to collect rollouts (e.g., vectorized environments in stable-baselines3, A3C-style worker pools). In RLHF for LLMs, parallelization is via DDP/FSDP/DeepSpeed at the level of response generation and gradient computation.

Scope

trainingacross_devices

Constraints

!PPO requires collecting fresh rollouts from the current policy at each iteration — older samples cannot be used in parallel without introducing bias (off-policy correction).

!Successive PPO iterations are sequential (each requires the current policy from the previous iteration). Within an iteration, rollout collection (vectorized envs) and gradient computation (data parallelism) can be parallelized.

Hardware requirements

Primary

PPO for deep networks (especially Transformer in RLHF) benefits from efficient Tensor Core GEMM operations. Rollout generation requires fast forward passes; gradient computation is classic backward pass. NVIDIA A100/H100 with large HBM (40–80 GB) is standard for RLHF on 7B+ models.

Good fit

Classical PPO for MuJoCo/Atari can be efficiently trained on CPU with vectorized environments (environments are cheaper than network compute). OpenAI Five used ~128k CPU cores for rollout collection alone + 256 GPUs for training.

Good fit

Implementable on TPUs via JAX/Flax (e.g., DeepMind acme). Used in some Google RL research. Requires adapting the environment interaction loop.