PPO
Components
PPO's main objective: L^CLIP(θ) = E_t[ min(r_t(θ) · Â_t, clip(r_t(θ), 1−ε, 1+ε) · Â_t) ]. The min operator selects the more pessimistic value, creating a lower bound on the objective and limiting the magnitude of a single policy update. ε ~ 0.1–0.2.
Official
A parameterized policy π_θ(a|s) mapping states to probability distributions over actions. In LLM applications, the policy is the Transformer itself generating tokens: a = token, s = context prefix.
Network estimating state value V_φ(s) = E[R | s], used to compute the advantage Â_t. Trained by minimizing MSE against observed returns or TD-targets.
Advantage estimator: Â_t^GAE(γ,λ) = Σ_{l=0}^∞ (γλ)^l δ_{t+l}, where δ_t = r_t + γV(s_{t+1}) − V(s_t). Hyperparameter λ ∈ [0,1] controls the bias-variance trade-off: λ=0 → low variance high bias, λ=1 → high variance low bias.
Official
Implementation
Omitting advantage estimate (Â_t) normalization within the batch leads to training instability — ratio · Â can explode at large reward scales. Normalization (Â − mean) / std within the batch is standard practice not explicitly covered in the original paper.
PPO without a KL penalty (or with too small β) can aggressively exploit reward model weaknesses, generating responses scoring high but of low actual quality. Classic problem of RL with a learned reward model.
PPO is known for high result variance across different seeds and hyperparameters. Henderson et al. "Deep RL that Matters" (2018) showed that the same PPO implementation can yield 50%-different results depending on the seed.
When actor and critic share weights (typical for CNN/Transformer in Atari games), the critic objective can dominate the policy gradient. The c1 coefficient must be chosen carefully (typically 0.5).
PPO is on-policy — using samples collected too long ago (after many optimization iterations) leads to drift between π_θ_old and π_θ. This manifests as ratios far from 1, where clipping zeroes out most of the gradient.
Evolution
Trust Region Policy Optimization (arXiv:1502.05477) — PPO's direct predecessor. Optimizes a surrogate objective with a KL constraint via second-order constrained optimization (conjugate gradient + line search). Effective but complex.
Paper arXiv:1707.06347 introduces PPO as a simplified alternative to TRPO. Instead of a constrained second-order problem — probability ratio clipping in the objective. Allows using standard SGD/Adam and is compatible with network architectures sharing parameters between actor and critic.
OpenAI Five (5 neural networks trained with PPO at ~10⁵ CPU + 256 GPU scale) defeats team OG, Dota 2 world champions. Demonstration of PPO scalability to complex multi-agent environments with long time horizons.
Ouyang et al. (2022) used PPO as the third stage of the RLHF pipeline for GPT-3 with a KL penalty relative to the SFT model (PPO-ptx). This became the standard approach for LLM alignment (ChatGPT, Claude, Gemini). PPO found a second life beyond classical robotics and games.
Paper and blog post (ICLR 2020 Blog Track / The 37 Implementation Details of PPO, Huang et al.) systematizing 37 underdocumented PPO implementation details (advantage normalization, orthogonal initialization, learning rate annealing, gradient clipping, etc.) critical to results. Crucial for PPO result reproducibility in the community.
Technical details
Hyperparameters (configurable axes)
Probability ratio clipping hyperparameter in PPO-Clip. Determines how much the ratio can deviate from 1 (i.e., from the old policy) before the gradient is clipped.
Generalized Advantage Estimation hyperparameter controlling the advantage estimator's bias-variance trade-off.
Discount factor for future rewards. Determines the planning horizon.
How many times the same data batch is used to update the policy before collecting new rollouts. Too many leads to over-optimization and instability.
Weight of the critic MSE loss in the combined objective. Combined loss: L = L^CLIP − c1 · L^VF + c2 · S[π], where S[π] is the policy entropy.
Execution paradigm
PPO itself is a policy gradient optimization algorithm — actor and critic networks can use any architecture (MLP, CNN, Transformer). In RLHF for LLM context, the actor is a dense Transformer and the critic is typically a smaller network or an additional head on top of the base Transformer.
Parallelism
Commonly N parallel actors are used to collect rollouts (e.g., vectorized environments in stable-baselines3, A3C-style worker pools). In RLHF for LLMs, parallelization is via DDP/FSDP/DeepSpeed at the level of response generation and gradient computation.
Hardware requirements
PPO for deep networks (especially Transformer in RLHF) benefits from efficient Tensor Core GEMM operations. Rollout generation requires fast forward passes; gradient computation is classic backward pass. NVIDIA A100/H100 with large HBM (40–80 GB) is standard for RLHF on 7B+ models.
Classical PPO for MuJoCo/Atari can be efficiently trained on CPU with vectorized environments (environments are cheaper than network compute). OpenAI Five used ~128k CPU cores for rollout collection alone + 256 GPUs for training.
Implementable on TPUs via JAX/Flax (e.g., DeepMind acme). Used in some Google RL research. Requires adapting the environment interaction loop.