Reasoning

Reasoning RL

2024ActivePublished: 10 June 2026Updated: 10 June 2026Published

Key innovation

LLM training via reinforcement learning on tasks with verifiable rewards (math correctness, code execution, format matching) instead of a trained reward model as in RLHF. It forces the emergence of long, self-correcting reasoning chains (long chain-of-thought) and removes the need for manual preference labels.

How it works

The pipeline has four components. (1) Verifier — a function R(x, y) giving rewards without a learned model: for math — comparing the final answer with ground truth (boxed{}, LaTeX equality), for code — running tests (`pytest`), for logic — symbolic solvers, for format — regex checking the `<think>...</think><answer>...</answer>` structure. Most often a composition of several R: `R = α·R_correct + β·R_format`. (2) Sampler — model π_θ generates N (typically 8–64) independent rollouts per prompt with temperature 0.6–1.0, each may have a different long-CoT and different answer. (3) Optimizer — an RL algorithm updating the policy. GRPO (Group Relative Policy Optimization, DeepSeek): within a group of N rollouts, computes relative reward Â_i = (R_i - mean(R)) / std(R) as advantage; no value model; loss = -E[π_θ/π_old · Â] with KL penalty to π_ref. PPO and REINFORCE++ are alternatives. (4) Iteration — resample from the current policy, fresh rollouts, another update. During training, CoT length grows organically (from ~500 to 5,000–10,000 tokens), and the model learns self-reflection without explicit instructions — the famous "aha moments" from DeepSeek-R1-Zero.

Problem solved

RLHF and DPO teach LLMs to match human preferences, but they are limited by label quality and do not scale to tasks where humans cannot judge correctness (advanced mathematics, formal proofs, complex code debugging). Yet classical math and programming tasks have a natural verifier (`==`, `pytest`) that provides a 0/1 reward without human labels. Reasoning RL exploits this asymmetry: the model solves a task with many rollouts, receives a rule-based reward, and the policy is updated with an RL algorithm (GRPO/PPO/REINFORCE++). During exploration, long-CoT emerges with self-reflection moments ("wait, let me reconsider"), yielding reasoning quality unreachable by SFT/RLHF.

Components

Verifier (rule-based reward function)Reward-model-free training signal

External function evaluating answer correctness y for prompt x. No learned parameters: math → equality check, code → pytest, logic → solver, format → regex. Most often a composition: R = α·R_correct + β·R_format + γ·R_length.

INTask prompt and full model rollout.

OUTScalar reward — typically binary or a combination of binary components.

Math equality verifierCompares boxed answer with ground truth (LaTeX/SymPy).

Code execution verifierpytest / unit tests in a sandbox.

Format regexChecks presence of `<think>...</think><answer>...</answer>`.

Symbolic solver (Lean/Coq)Formal proof verification.

Official

Sampler (rollout generator)Answer-space exploration

Engine generating N (typically 8–64) independent rollouts per prompt from policy π_θ. Requires a high-throughput inference engine (vLLM, SGLang) with FlashAttention/PagedAttention for efficient batched generation.

GRPO / PPO / REINFORCE++ optimizerSteering the policy towards higher rewards

Algorithm updating the policy. GRPO (DeepSeek): relative advantage within rollout group per prompt, no value model. PPO: classic, requires a value model. REINFORCE++ with normalised baseline.

GRPO (DeepSeek)Group Relative Policy Optimization — value-model-free.

PPOProximal Policy Optimization — requires a value model.

REINFORCE++With reward normalisation as baseline.

DAPO (ByteDance)Decoupled Advantage Policy Optimization — more stable.

Official

Reference policy π_ref + KL penaltyPolicy stabiliser, prevents catastrophic forgetting

Frozen SFT model used as anchor — KL(π_θ || π_ref) penalises policy drift, protects base capabilities. Some variants (DAPO, SimPO-style) drop π_ref entirely.

Official

Implementation

Reference implementations

deepseek-ai/DeepSeek-R1 (official weights + paper)

Python · DeepSeek-AI

Official

huggingface/open-r1 (open R1 reproduction)

Python (PyTorch) · Hugging Face

volcengine/verl (LLM RL framework, GRPO/PPO)

Python · ByteDance Volcengine

OpenRLHF (scalable RLHF/Reasoning RL)

Python (Ray + DeepSpeed) · OpenRLHF community

Hugging Face TRL — GRPOTrainer

Python (PyTorch) · Hugging Face

Jiayi-Pan/TinyZero (R1-Zero for $30)

Python · Jiayi Pan (UC Berkeley)

Implementation pitfalls

Reward hacking — model exploits a verifier holeCritical

The most serious problem in reasoning RL. Examples: the model only writes `\boxed{42}` without reasoning (if no format reward), steals the answer from a unit test (`assert answer == ...`), or generates short noisy answers that happen to land. Training "succeeds" (reward grows), but the model is useless.

Fix:Composition of multiple R (correct + format + length); rigorous sandboxing of code verifiers; manual audits of early-iteration rollouts; held-out benchmark distinct from training.

Mode collapse — all rollouts identicalHigh

Too aggressive KL-free policy + low temperature → the model produces N copies of the same rollout, advantage = 0, gradient = 0, training stalls. Manifests as no benchmark progress with "normal" loss.

Fix:Monitor rollout variance (std(R) per group); keep temperature 0.6–1.0; KL penalty ≥ 0.001 in early iterations.

CoT length budget too small → training cuts thinkingHigh

If max_tokens is set smaller than the natural CoT length, the model learns to be artificially terse and loses reasoning quality. Symptom: quality positively correlates with the allowed budget.

Fix:Start with 8k tokens, monitor mean rollout length, raise the limit when it approaches max.

Sampler ↔ trainer asynchrony causes stale dataMedium

In production pipelines (separate vLLM sampler cluster, separate trainer) rollouts may come from a policy lagging by tens of updates. With too large lag, the importance ratio π_θ/π_old explodes and the loss diverges.

Fix:Off-policy correction (clip importance ratio, IS reweighting); regularly sync sampler-trainer weights; bound max lag (e.g. ≤ 5 updates).

No reward-domain diversification → narrow modelMedium

Training only on math tasks yields a great mathematician who fails everywhere else. Reasoning RL works best on a mix of domains (math + code + logic + general QA with LLM-as-judge).

Fix:Mix domains in mini-batches; add a small fraction of RLHF preference data (LLM-as-judge) for general quality.

Evolution

Original paper · 2025 · arXiv:2501.12948 (DeepSeek-AI, 2025) · DeepSeek-AI Team

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI Team, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu

2022

Chain-of-Thought (Wei et al., Google)

CoT prompting shows LLMs solve much harder problems when thinking "step by step". Opens the path to training models that think long instead of short.

CoT (concept)

2024

OpenAI o1 — first production deployment of reasoning RL

Inflection point

September 2024: OpenAI releases o1-preview. The first publicly available LLM trained on long-CoT via RL. Quality jumps on AIME (12% → 74%) and Codeforces (1 258 → 1 891 ELO). No algorithmic details published.

2025

DeepSeek-R1 / R1-Zero — open source with open algorithm

Inflection point

January 2025: DeepSeek publishes R1 and R1-Zero with a description of GRPO and the full RL pipeline on a 671B MoE. R1-Zero shows that RL works even WITHOUT SFT cold-start. Open weights and method democratise reasoning RL — within weeks reproductions appear (TinyZero, OpenR1, Hugging Face Open-R1).

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (paper)

2025

Kimi k1.5 — parallel Moonshot AI publication

January 2025: Moonshot AI publishes Kimi k1.5 — an independently discovered reasoning RL pipeline with REINFORCE++ and length penalty. Confirms the method is reproducible outside DeepSeek.

2025

Qwen QwQ-32B, OpenAI o3, Anthropic extended thinking

Spring 2025: Alibaba releases QwQ-32B (open-source, GRPO). OpenAI ships o3 with larger-scale reasoning RL. Anthropic adds "extended thinking" to Claude 3.7 / 4. Reasoning RL becomes mainstream.

2025

DAPO, Reinforce-Lite, GRPO++ — wave of algorithm variants

ByteDance publishes DAPO (Decoupled Advantage Policy Optimization) with better stability. The community introduces variants without π_ref, with token-level rewards, with asynchronous rollouts. Reasoning RL stops being a GRPO monolith and becomes a research field.

Reasoning RL

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements