Reasoning

GRPO

2024ActivePublished: 10 June 2026Updated: 10 June 2026Published

Key innovation

Removes the value model (critic) from PPO, replacing it with a within-group relative reward across N rollouts of the same prompt as the advantage baseline. Cuts RL memory and cost by ~half (no separate value network) while preserving PPO stability — enabling cheap reasoning-model training.

How it works

For each prompt x the policy π_θ generates a group G = {y_1, …, y_N} (N rollouts, typically 8–64). Each y_i gets a scalar reward r_i from the verifier (rule-based: correctness, format). GRPO computes each rollout's advantage as the normalised within-group relative reward:

Â_i = (r_i - mean(r_1..r_N)) / std(r_1..r_N)

so a rollout better than the group mean gets a positive advantage, a worse one negative — with no value model. The policy update then uses a PPO-style surrogate loss with importance-ratio clipping:

L = E[ min( ρ_i·Â_i, clip(ρ_i, 1-ε, 1+ε)·Â_i ) ] - β·KL(π_θ || π_ref)

where ρ_i = π_θ(y_i|x)/π_old(y_i|x) is the importance ratio, ε is the clip (typically 0.2), β·KL is regularisation to the reference policy (SFT). Key differences from PPO: (1) no value network — group baseline, (2) advantage computed at full-sequence level (outcome reward), not per-token, (3) KL computed directly as an unbiased estimator rather than inside the reward. The GRPO variant in DeepSeekMath was originally for mathematics; DeepSeek-R1 used it for full reasoning RL.

Problem solved

Classical PPO in RLHF requires four networks simultaneously: the policy (trained), the reference model (KL), the reward model, and the value model (a critic estimating value V(s) as the advantage baseline). The value model is as large as the policy, doubles memory usage, and is hard to train (it needs its own stabilisation). For tasks with reward only at the end of the sequence (math/code: 0/1 for correctness) the critic is especially problematic. GRPO observes: if we generate N responses to the same prompt, their mean reward is a natural, parameter-free baseline — no need to learn V(s). This removes the critic, halves memory, and simplifies the pipeline.

Components

Group samplingSource of the advantage baseline

Generation of N independent responses to the same prompt. The basis of GRPO — the group replaces the value network as the baseline source.

INN rollouts from policy π_θ.

OUTVerifier rewards for each rollout.

Group-relative advantageAdvantage estimation without a critic

Normalised within-group relative reward of a rollout. Replaces A = r - V(s) from PPO without a value model. The core GRPO contribution.

(r - mean)/stdVanilla GRPO.

(r - mean) per-tokenDr.GRPO — removes length/std bias.

Official

Clipped surrogate loss + KLPolicy update with stabilisation

Loss function: min(ρ·Â, clip(ρ,1±ε)·Â) with a KL penalty to the reference policy. Inherits PPO stability without its value model.

Official

Implementation

Reference implementations

Hugging Face TRL — GRPOTrainer

Python (PyTorch) · Hugging Face

volcengine/verl (GRPO/PPO/DAPO)

Python (Ray) · ByteDance Volcengine

deepseek-ai/DeepSeek-Math (original GRPO)

Python · DeepSeek-AI (GRPO authors)

Official

OpenRLHF (scalable GRPO)

Python (Ray + DeepSpeed) · OpenRLHF community

Implementation pitfalls

std≈0 for too easy/hard prompts → advantage explosionHigh

When all N rollouts get the same reward (all correct or all wrong), std≈0 and the (r-mean)/std division gives numerically unstable values or zero gradient. Dr.GRPO identifies this as a bias of the original GRPO.

Fix:Filter prompts with zero reward variance (dynamic sampling, DAPO); add epsilon to std; or use Dr.GRPO without std division.

Length bias — model learns longer answersMedium

Per-sequence normalisation favours longer correct answers (per-token grad sum grows with length), leading to uncontrolled CoT bloat. A known vanilla-GRPO artifact.

Fix:Per-token instead of per-sequence normalisation (Dr.GRPO); explicit length penalty in the reward.

Off-policy importance-ratio lagMedium

If many updates are done on the same rollout batch, ρ = π_θ/π_old drifts from 1 and clipping is no longer sufficient — training diverges.

Fix:Limit the number of PPO epochs per rollout batch (usually 1); fresh sampling per update.

Evolution

Original paper · 2024 · arXiv:2402.03300 (DeepSeek-AI, 2024) · Zhihong Shao

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

2017

PPO (Schulman et al., OpenAI)

Proximal Policy Optimization — clipped surrogate objective with a value model as baseline. GRPO will remove its critic.

PPO (concept)

2022

RLHF / InstructGPT — PPO in LLM alignment

PPO with 4 networks (policy, reference, reward, value) becomes the RLHF standard — memory-costly and hard to stabilise.

RLHF (concept)

2024

GRPO — introduced in DeepSeekMath

Inflection point

Shao et al. (DeepSeek) publish DeepSeekMath (arXiv:2402.03300) and with it GRPO — removing the value model, baseline from a rollout group. First application: math reasoning on DeepSeekMath 7B.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (paper)

2025

DeepSeek-R1 — GRPO as the core of reasoning RL

Inflection point

DeepSeek-R1 (January 2025) uses GRPO at full 671B MoE scale to train long-CoT reasoning. GRPO becomes the de-facto open-source reasoning RL standard.

Reasoning RL (concept)

2025

Variants: DAPO, Dr.GRPO, GRPO++

ByteDance DAPO (clip-higher, dynamic sampling, no KL), Dr.GRPO (removes std and length normalisation bias), GRPO++ — a wave of improvements addressing the original GRPO's instabilities and biases.

2025

Integration into TRL, verl, OpenRLHF

GRPOTrainer in Hugging Face TRL, native support in verl (ByteDance) and OpenRLHF make GRPO a one-line recipe for reasoning RL.

GRPO

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements