The most important element — reward function details. Best results: composition of R_correct (binary) + R_format (binary) + optionally R_length (penalty for too long). Weak/noisy verifier = reward hacking, the model learns to exploit a hole instead of reasoning.
Optimizer choice. GRPO (DeepSeek) — no value model, uses rollout group as baseline, much cheaper than PPO. PPO — RLHF classic, allegedly used in o1. REINFORCE++ — simplified variant with normalisation.
Number of independent attempts per prompt. In GRPO N=8–64 typically; smaller yields high advantage variance, larger — high compute usage with diminishing returns.
KL-penalty strength against the reference policy (usually the SFT model). Too low = policy drifts, generations become gibberish. Too high = no exploration, no long-CoT emergence.
Maximum rollout length in tokens. Determines sampling compute and the model's ceiling. CoT length grows during RL — leave headroom.
Whether the model starts from a plain base/SFT or from dedicated cold-start CoT. DeepSeek-R1-Zero showed that RL without SFT works (R1-Zero), but production R1 does a short SFT on ~hundreds of CoT examples before RL.
Time complexity: O(N · L · |θ|) sampling + O(N · L · |θ|) trening per krok (N rolloutów, L = długość CoT). Space complexity: O(2 · |θ|) parametrów + O(N · L · d) aktywacji rolloutów.
Sampling N=16–64 rollouts of L=8k–32k tokens dominates cost — totalling 0.1–2M inference tokens per training step. That is why production pipelines (DeepSeek, OpenAI) keep a dedicated vLLM cluster just for sampling.
Reasoning RL can be combined with MoE (DeepSeek-R1 = MoE 671B/37B active) — in which case model-level activation is sparse, but the training paradigm itself remains dense.
No learned routing in Reasoning RL itself. The conditional mode reflects that CoT length depends on prompt difficulty (the model "decides" at inference how much thinking to allocate).
Production scaling (DeepSeek-R1, o1) requires separate clusters for sampling (vLLM, ~50% of hardware) and training (Megatron/FSDP, the remaining ~50%) with asynchronous rollout passing.
Reasoning RL is dominated by long-CoT sampling (vLLM/SGLang with FlashAttention/PagedAttention) and training (FSDP/Megatron). Both stages are heavily GPU-bound.
Possible but non-standard — most RL frameworks (verl, OpenRLHF, TRL) have a CUDA-first path. Google DeepMind uses TPUs in Gemini Thinking.
Algorithmically yes — implementation-wise reasoning RL requires high memory bandwidth for long rollouts; CPU is infeasible for models > 1B.
The pipeline has four components. (1) Verifier — a function R(x, y) giving rewards without a learned model: for math — comparing the final answer with ground truth (boxed{}, LaTeX equality), for code — running tests (`pytest`), for logic — symbolic solvers, for format — regex checking the `<think>...</think><answer>...</answer>` structure. Most often a composition of several R: `R = α·R_correct + β·R_format`. (2) Sampler — model π_θ generates N (typically 8–64) independent rollouts per prompt with temperature 0.6–1.0, each may have a different long-CoT and different answer. (3) Optimizer — an RL algorithm updating the policy. GRPO (Group Relative Policy Optimization, DeepSeek): within a group of N rollouts, computes relative reward Â_i = (R_i - mean(R)) / std(R) as advantage; no value model; loss = -E[π_θ/π_old · Â] with KL penalty to π_ref. PPO and REINFORCE++ are alternatives. (4) Iteration — resample from the current policy, fresh rollouts, another update. During training, CoT length grows organically (from ~500 to 5,000–10,000 tokens), and the model learns self-reflection without explicit instructions — the famous "aha moments" from DeepSeek-R1-Zero.
RLHF and DPO teach LLMs to match human preferences, but they are limited by label quality and do not scale to tasks where humans cannot judge correctness (advanced mathematics, formal proofs, complex code debugging). Yet classical math and programming tasks have a natural verifier (`==`, `pytest`) that provides a 0/1 reward without human labels. Reasoning RL exploits this asymmetry: the model solves a task with many rollouts, receives a rule-based reward, and the policy is updated with an RL algorithm (GRPO/PPO/REINFORCE++). During exploration, long-CoT emerges with self-reflection moments ("wait, let me reconsider"), yielding reasoning quality unreachable by SFT/RLHF.
External function evaluating answer correctness y for prompt x. No learned parameters: math → equality check, code → pytest, logic → solver, format → regex. Most often a composition: R = α·R_correct + β·R_format + γ·R_length.
Official
Engine generating N (typically 8–64) independent rollouts per prompt from policy π_θ. Requires a high-throughput inference engine (vLLM, SGLang) with FlashAttention/PagedAttention for efficient batched generation.
Algorithm updating the policy. GRPO (DeepSeek): relative advantage within rollout group per prompt, no value model. PPO: classic, requires a value model. REINFORCE++ with normalised baseline.
Official
Frozen SFT model used as anchor — KL(π_θ || π_ref) penalises policy drift, protects base capabilities. Some variants (DAPO, SimPO-style) drop π_ref entirely.
Official
CoT prompting shows LLMs solve much harder problems when thinking "step by step". Opens the path to training models that think long instead of short.
September 2024: OpenAI releases o1-preview. The first publicly available LLM trained on long-CoT via RL. Quality jumps on AIME (12% → 74%) and Codeforces (1 258 → 1 891 ELO). No algorithmic details published.
January 2025: DeepSeek publishes R1 and R1-Zero with a description of GRPO and the full RL pipeline on a 671B MoE. R1-Zero shows that RL works even WITHOUT SFT cold-start. Open weights and method democratise reasoning RL — within weeks reproductions appear (TinyZero, OpenR1, Hugging Face Open-R1).
January 2025: Moonshot AI publishes Kimi k1.5 — an independently discovered reasoning RL pipeline with REINFORCE++ and length penalty. Confirms the method is reproducible outside DeepSeek.
Spring 2025: Alibaba releases QwQ-32B (open-source, GRPO). OpenAI ships o3 with larger-scale reasoning RL. Anthropic adds "extended thinking" to Claude 3.7 / 4. Reasoning RL becomes mainstream.
ByteDance publishes DAPO (Decoupled Advantage Policy Optimization) with better stability. The community introduces variants without π_ref, with token-level rewards, with asynchronous rollouts. Reasoning RL stops being a GRPO monolith and becomes a research field.
The most serious problem in reasoning RL. Examples: the model only writes `\boxed{42}` without reasoning (if no format reward), steals the answer from a unit test (`assert answer == ...`), or generates short noisy answers that happen to land. Training "succeeds" (reward grows), but the model is useless.
Too aggressive KL-free policy + low temperature → the model produces N copies of the same rollout, advantage = 0, gradient = 0, training stalls. Manifests as no benchmark progress with "normal" loss.
If max_tokens is set smaller than the natural CoT length, the model learns to be artificially terse and loses reasoning quality. Symptom: quality positively correlates with the allowed budget.
In production pipelines (separate vLLM sampler cluster, separate trainer) rollouts may come from a policy lagging by tens of updates. With too large lag, the importance ratio π_θ/π_old explodes and the loss diverges.
Training only on math tasks yields a great mathematician who fails everywhere else. Reasoning RL works best on a mix of domains (math + code + logic + general QA with LLM-as-judge).