For each prompt x the policy π_θ generates a group G = {y_1, …, y_N} (N rollouts, typically 8–64). Each y_i gets a scalar reward r_i from the verifier (rule-based: correctness, format). GRPO computes each rollout's advantage as the normalised within-group relative reward:
Â_i = (r_i - mean(r_1..r_N)) / std(r_1..r_N)
so a rollout better than the group mean gets a positive advantage, a worse one negative — with no value model. The policy update then uses a PPO-style surrogate loss with importance-ratio clipping:
L = E[ min( ρ_i·Â_i, clip(ρ_i, 1-ε, 1+ε)·Â_i ) ] - β·KL(π_θ || π_ref)
where ρ_i = π_θ(y_i|x)/π_old(y_i|x) is the importance ratio, ε is the clip (typically 0.2), β·KL is regularisation to the reference policy (SFT). Key differences from PPO: (1) no value network — group baseline, (2) advantage computed at full-sequence level (outcome reward), not per-token, (3) KL computed directly as an unbiased estimator rather than inside the reward. The GRPO variant in DeepSeekMath was originally for mathematics; DeepSeek-R1 used it for full reasoning RL.
Classical PPO in RLHF requires four networks simultaneously: the policy (trained), the reference model (KL), the reward model, and the value model (a critic estimating value V(s) as the advantage baseline). The value model is as large as the policy, doubles memory usage, and is hard to train (it needs its own stabilisation). For tasks with reward only at the end of the sequence (math/code: 0/1 for correctness) the critic is especially problematic. GRPO observes: if we generate N responses to the same prompt, their mean reward is a natural, parameter-free baseline — no need to learn V(s). This removes the critic, halves memory, and simplifies the pipeline.
Generation of N independent responses to the same prompt. The basis of GRPO — the group replaces the value network as the baseline source.
Normalised within-group relative reward of a rollout. Replaces A = r - V(s) from PPO without a value model. The core GRPO contribution.
Official
Loss function: min(ρ·Â, clip(ρ,1±ε)·Â) with a KL penalty to the reference policy. Inherits PPO stability without its value model.
Official
When all N rollouts get the same reward (all correct or all wrong), std≈0 and the (r-mean)/std division gives numerically unstable values or zero gradient. Dr.GRPO identifies this as a bias of the original GRPO.
Per-sequence normalisation favours longer correct answers (per-token grad sum grows with length), leading to uncontrolled CoT bloat. A known vanilla-GRPO artifact.
If many updates are done on the same rollout batch, ρ = π_θ/π_old drifts from 1 and clipping is no longer sufficient — training diverges.
Proximal Policy Optimization — clipped surrogate objective with a value model as baseline. GRPO will remove its critic.
PPO with 4 networks (policy, reference, reward, value) becomes the RLHF standard — memory-costly and hard to stabilise.
Shao et al. (DeepSeek) publish DeepSeekMath (arXiv:2402.03300) and with it GRPO — removing the value model, baseline from a rollout group. First application: math reasoning on DeepSeekMath 7B.
DeepSeek-R1 (January 2025) uses GRPO at full 671B MoE scale to train long-CoT reasoning. GRPO becomes the de-facto open-source reasoning RL standard.
ByteDance DAPO (clip-higher, dynamic sampling, no KL), Dr.GRPO (removes std and length normalisation bias), GRPO++ — a wave of improvements addressing the original GRPO's instabilities and biases.
GRPOTrainer in Hugging Face TRL, native support in verl (ByteDance) and OpenRLHF make GRPO a one-line recipe for reasoning RL.
Time complexity: O(N · L · |θ|) sampling + O(N · L · |θ|) update — bez forward modelu value. Space complexity: O(2 · |θ|) (policy + reference) vs O(3-4 · |θ|) w PPO.
Generating N long rollouts per prompt dominates cost (as in reasoning RL). Removing the value model helps memory but does not change that sampling is the bottleneck.
Number of rollouts per prompt — defines the baseline quality (group mean). Small N = high advantage variance; large N = better signal at the cost of compute. Typically 8–64.
Importance-ratio clipping bound (as in PPO). Limits the size of a single update, preventing destabilisation. Typically 0.2; DAPO experiments with asymmetric clipping.
KL regularisation strength to the reference policy. DeepSeek-R1 uses a very small β (aggressive RL); some variants (DAPO) drop KL entirely for more exploration.
How the within-group relative reward is normalised. Standard: (r - mean)/std. Some variants drop the std division (when std≈0 for easy prompts it causes instability — Dr.GRPO addresses this).
Whether the advantage applies to the whole sequence (outcome reward, vanilla GRPO) or per-token/per-step (process reward). Process supervision gives a denser signal but requires a PRM.
GRPO can be applied to dense and MoE models (DeepSeek-R1 = 671B MoE). The algorithm itself is agnostic to the policy architecture.
No routing — GRPO is an optimisation algorithm, not a model structure.
Within group generation and within the update, full per-device parallelism. The absence of a value network reduces communication and memory versus PPO.
GRPO is RL on LLMs — sampling (vLLM/SGLang) and training (FSDP/Megatron) dominate. The absence of a value model halves VRAM demand versus PPO.
Algorithmically agnostic; TPU implementations exist (JAX), though the ecosystem is CUDA-first.
The algorithm itself is hardware-neutral, but LLM RL scale requires GPU/TPU; CPU is infeasible for models > 1B.