Robots AtlasRobots Atlas

Chain-of-Thought Reasoning

Demonstrating that prompting large language models to generate a series of intermediate natural-language reasoning steps before producing a final answer significantly improves performance on complex multi-step tasks — with this capability emerging as an emergent property of sufficiently large models.

Category
Abstraction level
Operation level
Arithmetic reasoning (GSM8K, MultiArith, SVAMP)Commonsense reasoning (CommonsenseQA, StrategyQA)Symbolic reasoning (Last Letter Concatenation, Coin Flip)Math word problem solvingStep-explained code generation (chain-of-thought code generation)LLM agents making multi-step decisions (ReAct, Tree of Thoughts)Automated business data analysis and interpretationGenerating explanations for model decisions (explainability)

1. Few-shot CoT: 4–8 exemplars are inserted into the prompt, each containing a full reasoning chain ending with a final answer (e.g. "Anna has 5 apples, gets 3 more, so 5+3=8. The answer is 8"). Conditioned on this pattern, the model produces an analogous chain for the new question. 2. Zero-shot CoT: a trigger phrase ("Let's think step by step") is appended to the question, and the model produces the chain + answer in a single pass. 3. Decoding: standard greedy decoding of one chain, or Self-Consistency — sample 10–40 independent chains with temperature > 0 and select the most frequent final answer by majority vote. 4. Extraction: the final answer is parsed from the model output after a marker like "The answer is" or as the final sentence of the chain.

Standard few-shot prompting fails on multi-step tasks — models produce immediate, incorrect answers because they try to solve a complex problem in one pass. Without explicit decomposition, models cannot reliably perform arithmetic, commonsense reasoning, or symbolic manipulations that require multiple dependent steps.

01

Prompt with CoT Examples

Conditioning the model to generate reasoning steps before producing the final answer.

Modular

In few-shot CoT, the prompt contains a small number (typically 4–8) of exemplar problems whose answers are preceded by a chain of intermediate reasoning steps. In zero-shot CoT, a trigger phrase (e.g., 'Let's think step by step') is appended instead.

Few-shot CoTZero-shot CoTAuto-CoT
02

Chain of Thought

Decomposition of a complex problem into verifiable intermediate steps

The reasoning chain is the core output artifact of CoT. It consists of natural-language sentences that articulate sub-problems, intermediate computations, or logical deductions. It appears between the question and the final answer in the model output.

03

Final Answer Extraction

Parses the final answer from model output that contains a chain-of-thought reasoning trace.

Modular

After the model generates its reasoning chain, the final answer is extracted from the output — either by greedy decoding of the last sentence, by matching a pattern (e.g., 'The answer is'), or by majority vote across multiple sampled chains (self-consistency).

Greedy decoding (single chain)Majority voting (self-consistency)
Time

k = number of CoT examples in the prompt; T = average reasoning chain length in tokens; C = cost of a single LLM inference pass. For self-consistency, multiplied by the number of sampled paths (typically 10–40).

CoT increases the number of output tokens relative to direct prompting, proportionally raising inference cost. Self-consistency multiplies this cost further by the number of sampled reasoning chains.

Memory complexity

k = number of examples, T = tokens per example chain, L = question length. The total prompt token count determines KV-cache memory requirements.

CoT examples increase prompt length, consuming more of the context window and KV-cache compared to standard prompting.

Wąskie gardło: Extended autoregressive reasoning step generation

Generating the reasoning chain requires producing many more output tokens than a direct-answer approach. Each token requires one autoregressive model forward pass, making inference latency and compute proportional to chain length.

Parallelism

Sequential

Multiple independent chains (self-consistency) can be generated in parallel across a batch dimension, provided the compute budget allows.

Paradigm

Dense

All paths active

All model parameters are active during every inference pass. There is no sparse or conditional activation. CoT is a prompting strategy applied at inference time to a standard dense LLM.

CoT Example Count

Standard
  • 0Zero-shot CoT: no examples, only a trigger phrase.
  • 4–8Standard few-shot CoT range used in Wei et al. (2022).

The number of (question, reasoning chain, answer) demonstrations included in the prompt. The original paper used 8 exemplars across benchmarks.

Zero-Shot CoT Trigger Phrase

Standard
  • Let's think step by step
  • Think carefully and solve step by step.

In zero-shot CoT, the phrase appended to the question to elicit reasoning. 'Let's think step by step' was introduced by Kojima et al. (2022).

Number of sampled reasoning paths (self-consistency)

Standard
  • 1Greedy decoding of a single chain (standard CoT).
  • 10–40Consistency range used in Wang et al. (2022).

Number of independently sampled chains for self-consistency decoding. Higher values improve accuracy but multiply compute cost.

Model parameter count

Critical
  • ≥100B (2022 threshold for emergent CoT)For base models without CoT fine-tuning.
  • 7B–70B (fine-tuned CoT models)Smaller models fine-tuned on CoT data can exhibit reasoning behavior.

CoT performance gains are strongly dependent on model scale. In Wei et al. (2022), benefits were observed primarily in models above ~100B parameters (PaLM 540B, GPT-3 175B). This threshold has shifted with later fine-tuned smaller models.

Common pitfalls

Unfaithful chains of reasoning
HIGH

A model may produce a plausible-looking reasoning chain that does not actually causally determine its final answer — the reasoning post-hoc rationalizes a decision made by other internal mechanisms. The chain may be misleading rather than explanatory.

Do not treat CoT outputs as reliable explanations. Verify final answers independently. Apply process reward models when faithful reasoning is required.

Scale dependency — small models show degraded performance
HIGH

In base models without CoT-specific fine-tuning, CoT prompting may hurt performance in small models (below ~100B parameters at the time of the originating paper), as they generate plausible-sounding but incorrect intermediate steps.

Use sufficiently large models, or models fine-tuned on CoT data when working with smaller parameter counts.

Sensitivity to example quality and selection
MEDIUM

The choice of few-shot exemplars significantly affects CoT performance. Poorly constructed, ambiguous, or domain-mismatched exemplars can degrade reasoning quality.

Carefully select examples; apply active selection methods (Active-Prompt) or automatic chain generation to identify the most informative examples for the target task.

Increased inference cost
MEDIUM

Generating reasoning chains increases output token count, proportionally increasing latency and API cost relative to direct-answer prompting.

Use CoT selectively for tasks where it demonstrably improves accuracy; for simple tasks, direct prompting may suffice at lower cost.

Error accumulation across reasoning steps
HIGH

An error in an early intermediate step propagates to all subsequent steps, often yielding a confidently stated but incorrect final answer.

Use self-consistency (majority voting over multiple sampled chains) to reduce the impact of single-chain errors; apply verification steps or external tool calls to check intermediate computations.

GENESIS · Source paper

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
2022NeurIPS 2022Jason Wei, Xuezhi Wang, Dale Schuurmans et al.
2022

Few-shot Chain-of-Thought Prompting (Wei et al.)

breakthrough

Wei et al. demonstrate that few-shot prompting with reasoning-chain exemplars significantly improves LLM performance on arithmetic, commonsense, and symbolic reasoning. Establishes CoT as an emergent capability of large-scale models.

2022

Zero-shot Chain-of-Thought (Kojima et al.)

breakthrough

Kojima et al. show that appending 'Let's think step by step' to a prompt elicits reasoning chains without any exemplars, making CoT applicable without manual annotation.

2022

Self-consistency decoding for CoT (Wang et al.)

breakthrough

Wang et al. propose sampling multiple diverse reasoning paths and selecting the most consistent final answer by majority vote, substantially improving CoT accuracy over greedy decoding.

2023

Tree of Thoughts (Yao et al.)

Yao et al. generalize CoT from linear chains to tree-structured search over intermediate thoughts, enabling backtracking and look-ahead in multi-step problem solving.

2024

Native reasoning models internalize CoT via RL (OpenAI o1)

breakthrough

OpenAI releases o1, a model trained via reinforcement learning on process-level reward signals to produce extended internal reasoning chains, rather than relying on CoT prompting. This represents a shift from prompting-elicited to trained-in reasoning.

2025

Open reasoning models released (DeepSeek-R1)

DeepSeek releases R1, an open-source model trained with group relative policy optimization (GRPO) to produce long reasoning chains natively, achieving performance comparable to o1 on reasoning benchmarks.

GPU Tensor CoresPRIMARY

CoT is an inference-time technique applied to LLMs, which operate most efficiently on GPUs with tensor cores for matrix multiplications in the attention and feed-forward layers of the transformer.

Hardware requirements are entirely determined by the underlying LLM. CoT itself adds no hardware requirements beyond those of base model inference.

TPUGOOD

TPUs are commonly used for large-scale LLM inference; CoT is compatible with any hardware capable of running the base model.

BUILT ON

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Commonly used with

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.

GO TO CONCEPT
Tool-augmented LLM

Tool-augmented LLM is an architectural pattern in which a large language model is equipped with access to one or more external tools that it can invoke during inference by generating structured function-call or API-call outputs. The model learns when and how to call tools by producing special tokens or structured output (e.g., JSON function calls) that are intercepted by a host runtime, executed against the tool, and whose results are returned to the model as new context for continued generation. The canonical formalization appeared in the Toolformer paper (Schick et al., Meta AI, 2023), which demonstrated that LLMs can learn to self-supervise their own tool-use through API call annotation without requiring large labeled datasets. Toolformer showed that models trained this way can decide which tools to call, when, and with which arguments, and that tool use substantially improves performance on tasks requiring fresh information, arithmetic, multilingual lookup, and question answering. The pattern encompasses several mechanisms: (1) in-context tool specification, where tool interfaces are described in the system prompt or context (JSON Schema, OpenAPI, natural language); (2) function calling APIs, where the model produces structured output matched to a defined schema and the host dispatches the call; (3) ReAct-style interleaving, where the model alternates reasoning traces with tool-use observations; and (4) parallel tool calling, where the model emits multiple tool calls simultaneously to be executed concurrently. Key implementations include OpenAI function calling (GPT-4, June 2023), Anthropic tool use (Claude, 2023), Google Gemini function calling, and the Model Context Protocol (MCP, 2024) which standardizes tool server connectivity.

GO TO CONCEPT
Reasoning model

A reasoning model (also: large reasoning model, LRM, reasoning language model, RLM) is a type of large language model that has been specifically post-trained to solve complex multi-step problems by explicitly generating intermediate reasoning steps before committing to a final response. Unlike standard LLMs that generate a direct response in a single forward pass, reasoning models allocate additional computation at inference time — a property known as test-time compute scaling — by producing a long internal chain of thought (CoT). The reasoning trace typically includes steps such as problem decomposition, hypothesis generation, self-verification, reflection, and correction of errors. The defining characteristics of reasoning models are: (1) post-training via large-scale reinforcement learning (RL) using reward signals based on final answer correctness (and sometimes intermediate step quality via process reward models); (2) the emergence of extended, often hidden, reasoning traces that precede the final answer; (3) a consistent empirical relationship between the length or computational budget allocated to the reasoning trace and final answer quality (test-time scaling law); (4) superior performance on verifiable tasks requiring multi-step logic, such as mathematics, competitive programming, and scientific reasoning. The term 'reasoning model' was introduced as a product category by OpenAI in September 2024 with the release of the o1-preview model. OpenAI described o1 as trained via a large-scale RL algorithm teaching the model to use chain of thought productively. The approach does not rely on explicit tree search algorithms; instead, implicit search emerges via RL-trained CoT generation. In January 2025, DeepSeek published the first detailed open technical description of this class of models in the DeepSeek-R1 paper (arXiv:2501.12948), demonstrating that reasoning capabilities can be incentivized via pure RL without supervised fine-tuning, using Group Relative Policy Optimization (GRPO) as the RL framework. Reasoning models typically employ the same base Transformer decoder architecture as standard LLMs, with the key difference residing entirely in the post-training pipeline: RL replaces or augments standard RLHF/SFT, and reward signals are grounded in verifiable outcomes. The resulting models generate substantially longer token sequences during inference (reasoning tokens), which are often hidden from end users but incur real compute costs. Performance consistently improves with both more training-time RL compute and more inference-time thinking budget.

GO TO CONCEPT