Robots AtlasRobots Atlas

Reasoning model

Training a language model with reinforcement learning to generate an extended chain of thought before producing an answer, enabling performance scaling through increased test-time compute independently of model size.

Category
Abstraction level
Operation level
Mathematics and logical reasoning tasksProgramming and debuggingDocument analysis and compliancePlanning and decision supportResearch and research automation

A reasoning model uses a more deliberative inference mode, in which the model allocates additional tokens or computational steps to think through the task. This can involve decomposing the problem into stages, comparing multiple solution paths, checking for consistency, and only then generating the final answer.

Standard generative models often respond too quickly to difficult questions, increasing the risk of logical errors, skipped steps, and shallow reasoning. A reasoning model is designed to improve response quality on tasks that require deeper analysis.

01

LLM backbone (pretrained Transformer)

Generates tokens — both reasoning tokens and final answer tokens — via autoregressive prediction.

Modular

Pretrained decoder-only language model (Transformer) forming the base of the reasoning model. The architecture is identical to standard LLMs — a reasoning model differs from a standard LLM exclusively in post-training.

02

Chain of Thought (CoT)

Extended intermediate processing that enables multiple passes over a problem before generating a final answer.

Sequence of tokens generated by the model before the final answer, containing reasoning steps, problem decomposition, self-verification, and corrections. Forms the model's working scratchpad and is the key mechanism for test-time scaling.

03

Reward Model / Process Reward Model

Supplies the learning signal to the RL algorithm that drives the development of reasoning capabilities.

Modular

Component evaluating output quality during RL training. May be an outcome reward model (ORM, evaluating only the final answer correctness) or a process reward model (PRM, evaluating individual reasoning steps). The reward signal drives CoT generation policy learning.

Outcome Reward Model (ORM)Process Reward Model (PRM)
04

RL Algorithm

Training the model to productively generate CoT reasoning that leads to correct answers on verifiable tasks.

Modular

Algorithm optimizing the model's chain-of-thought generation policy based on reward signals. DeepSeek-R1 uses GRPO (Group Relative Policy Optimization). The specific RL algorithm used in OpenAI o1 has not been published.

GRPO (Group Relative Policy Optimization)
Wąskie gardło: Reasoning token sequence length at inference

Reasoning models generate significantly longer token sequences than standard LLMs due to extended CoT before the answer. Inference cost grows linearly with CoT length per query. For complex tasks, reasoning traces can span thousands of tokens, multiplying per-query cost relative to a standard LLM.

Parallelism

Partially parallel

RL training can be parallelized by processing multiple rollouts simultaneously. Inference for different queries is independent and can be handled in parallel by multiple model instances.

Paradigm

Dense

Stage dependent

The model processes both reasoning tokens and answer tokens through the same dense decoder layers. The activation pattern is stage-dependent: the CoT generation phase (reasoning stage) can run many times longer than the final answer generation phase (answer stage), though both use the same underlying model architecture.

Thinking Budget

Critical
  • low / medium / highDiscrete selection of reasoning intensity level (e.g., o3-mini reasoning_effort API).
  • 1024 – 32768 tokenówNumeric token budget for CoT reasoning (e.g., Claude Extended Thinking).

Limit or setting controlling the maximum number of CoT tokens generated before the final answer. Directly controls the quality/inference-cost trade-off.

Training RL compute budget

Critical

Amount of compute dedicated to RL training (number of RL steps, rollout data size). OpenAI reports that o1 performance consistently improves with more RL training compute.

Typ modelu nagrody

Standard
  • ORM (outcome-based)Rewards only the correctness of the final answer. Simpler to implement; used in DeepSeek-R1-Zero.
  • PRM (process-based)Rewards correctness of individual CoT steps. Improves reasoning faithfulness but requires step-level annotated data.

Choice between outcome reward model (ORM) and process reward model (PRM). Affects CoT quality, interpretability, and training cost.

Common pitfalls

Unstable and poorly readable CoT with pure RL and no cold-start data
HIGH

As shown by DeepSeek-R1-Zero, training via pure RL without SFT leads to emergent but poorly formatted reasoning chains: language mixing, endless repetition, poor readability. DeepSeek-R1 addresses this via cold-start data (SFT on a small set of exemplary CoT data before RL).

Using cold-start data (SFT on curated CoT examples) before the RL phase to establish a baseline reasoning format. Explicitly defining the CoT format (e.g., via the <think>...</think> template).

Reward hacking – model exploits shortcuts in the reward system
HIGH

With insufficiently defined reward functions, the model may find ways to obtain high rewards without actually solving the problem (reward hacking). OpenAI noted this in the o1 system card.

Using precise, verifiable reward functions (e.g., formal mathematical verification, code execution with unit tests). Avoiding rewards based solely on CoT length or other easily gamed metrics.

Overthinking – unnecessary CoT elongation for simple queries
MEDIUM

Reasoning models may generate unnecessarily long chains of thought for simple tasks, increasing inference cost without improving answer quality. The 'overthinking' phenomenon has been described in 2025 research literature as a significant efficiency challenge.

Using configurable thinking budgets (thinking budget / reasoning effort settings). Routing complex queries to reasoning models and simple ones to standard LLMs.

CoT Infidelity – reasoning trace does not reflect actual inference
MEDIUM

Chain of thought in reasoning models does not guarantee that the visible reasoning trace corresponds to the model's actual internal computations. The CoT may be a 'post-hoc rationalization', complicating debugging and safety evaluation.

Apply CoT monitoring as described in the OpenAI o1 system card. Assess CoT faithfulness through perturbations and ablations. Account for CoT interpretability limitations when deploying in safety-critical systems.

2022

Wei et al. (Google Brain) formalize Chain-of-Thought prompting

breakthrough

Wei et al. published 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models', showing that prompting LLMs to generate intermediate steps significantly improves performance on arithmetic and symbolic tasks.

2023

Lightman et al. (OpenAI) demonstrate effectiveness of Process Reward Models

breakthrough

Paper 'Let's Verify Step by Step' showed that supervising each reasoning step (PRM) 'significantly outperforms outcome supervision' on challenging math problems.

2024

OpenAI introduces the "reasoning model" term and category with the o1 release (September 2024)

breakthrough

OpenAI released o1-preview and o1-mini on September 12, 2024 as the first publicly available 'reasoning model' series. Models trained via large-scale RL to use CoT. The term 'reasoning model' entered widespread use as a category name.

2025

DeepSeek-R1 – first open, fully documented reasoning model (January 2025)

breakthrough

DeepSeek-AI published arXiv:2501.12948. First open, comprehensive technical documentation of reasoning model training using RL (GRPO) without SFT. DeepSeek-R1-Zero showed reasoning capabilities can emerge via pure RL without supervised fine-tuning. Open-source models released publicly.

GPU Tensor CoresPRIMARY

Reasoning models use the same Transformer decoder architecture as standard LLMs and require GPUs with Tensor Cores for efficient inference. Generating long CoT chains substantially increases VRAM demand (KV cache for long sequences) and GPU time per query.

Due to very long token sequences (CoT + answer), reasoning models require GPUs with large HBM capacity. Models in the 7B–70B range require 16 to 80 GB of VRAM respectively. Inference for large reasoning models (DeepSeek-R1 671B) requires multi-GPU configurations. Test-time compute scaling directly translates to higher GPU cost per query compared to standard LLMs of the same size.

TPUGOOD

TPU v4/v5 are used to train large reasoning models (e.g., by Google). They efficiently handle long token sequences via fast HBM memory and a GEMM-optimized architecture.

RL training of reasoning models on TPUs requires infrastructure adapted to long rollouts and dynamic sequence lengths.

Reasoning models

Official overview of how reasoning models and reasoning tokens work.

documentationOpenAI
Reasoning best practices

Practical differences between reasoning models and classical GPT models.

documentationOpenAI