Robots AtlasRobots Atlas

In-Context Learning

Demonstrating that a large language model can learn a new task at inference time — solely from a handful of examples (demonstrations) provided in the prompt — without weight updates or fine-tuning.

Category
Abstraction level
Operation level
Few-example text classification (sentiment, intent)Machine translation between language pairs without fine-tuningData structuring: extracting JSON from text with 2–3 examplesDomain-specific question answering with few-shot examplesStyle transfer and paraphrasing with demonstrationsPrompt engineering in LLM applications (LangChain, DSPy)Foundation models for robotics — learning policies from in-prompt demonstrations (RT-2, VLA)Chatbot personalization without changing model weights

1. Prompt construction: optional natural-language task instruction + k demonstration (input, output) pairs + new query input. Each demonstration is separated (e.g. newline, '###', XML tag). 2. Tokenization and forward pass: the full prompt is fed as context to the transformer decoder. Self-attention lets every token attend to all preceding tokens, including the demonstrations. 3. Pattern induction: attention layers (particularly induction heads, Olsson et al. 2022) detect [token A → token B] patterns in demonstrations and propagate them to the new input. This is analogous to implicit gradient descent in activation space. 4. Output generation: the model autoregressively produces answer tokens, continuing the pattern from demonstrations. 5. No weight updates: unlike fine-tuning, gradients are not computed or backpropagated. All "learning" happens entirely in the activations of a single forward pass.

Traditional supervised learning requires a training set for every new task, model fine-tuning (a separate copy of weights), and training infrastructure. This prevents rapid adaptation to new tasks and blocks scaling to thousands of domains. ICL removes this problem: a single frozen LLM can perform any task defined in the prompt, without training and without weight copies.

01

Task instruction

Specifies the task for the model

Modular

Optional natural-language task description preceding the demonstrations. In instruction-tuned models (GPT-3.5+, Claude), the instruction alone is often sufficient (zero-shot ICL).

02

Demonstrations (shots)

Conditioning the model on the task pattern

(input, output) pairs illustrating the expected model behavior. The number of demonstrations k defines the variant: zero-shot (k=0), one-shot (k=1), few-shot (k=2–32). Demonstrations must fit within the model's context window.

Zero-shotOne-shotFew-shotMany-shot
03

Query input

Application point of the learned pattern

The actual input for which the model should generate an answer. It must follow the same format as the demonstration inputs so that the model recognizes the pattern.

04

Induction heads

Mechanistic substrate of in-context learning

Specific attention heads in transformer layers ≥2 that learn to recognize the [A][B] ... [A] → [B] pattern during pretraining. Olsson et al. (2022, Anthropic) showed that induction heads are the mechanistic substrate of ICL — their formation correlates with the ICL emergence phase during training.

Time

k = number of demonstrations, L_demo = average demonstration length in tokens, L_query = query length, d = model dimension. Complexity dominated by self-attention over the entire prompt.

Inference cost grows quadratically with the number of demonstrations under classical self-attention. Long-context mechanisms (FlashAttention, ring attention) reduce this to linear memory.

Memory complexity

The number of prompt tokens determines KV-cache size and context-window usage.

Many-shot ICL requires long-context models (≥128k tokens). Standard few-shot fits in 4–32k.

Wąskie gardło: Quadratic self-attention over demonstrations

Self-attention scales as O(N²) in prompt length. With k demonstrations and long inputs, cost grows quickly — particularly in many-shot ICL.

Parallelism

Sequential

Prefill of demonstrations can be fully parallel (one forward pass over the whole prompt). Answer generation is sequential, as in any transformer decoder.

Paradigm

Dense

All paths active

ICL is a prompting technique applied to a standard dense Transformer at inference time. All parameters are active; no conditional routing.

Number of shots (k)

Critical
  • 0Zero-shot — instruction only, no examples.
  • 4–8Standard few-shot range from the GPT-3 paper.
  • 32Upper bound used in Brown et al. (2020) benchmarks.
  • 100–1000+Many-shot ICL in long contexts (Gemini 1.5 Pro, Claude 3).

Number of (input, output) pairs in the prompt. Affects both quality and inference cost (context length).

Demonstration order

Standard
  • randomRandom ordering — high variance in results.
  • similarity-rankedDemonstrations ordered by similarity to the query.

The order in which demonstrations appear in the prompt. Empirically, ICL quality is strongly permutation-dependent (Lu et al. 2022).

Demonstration selection strategy

Standard
  • staticThe same demonstrations for all queries.
  • kNN retrieval (KATE)Demonstrations most semantically similar to the query (Liu et al. 2022).

How demonstrations are selected from a candidate pool. Static (fixed pool) vs. dynamic (retrieval-based, e.g. KATE — k-nearest demonstrations).

Demonstration format

Standard
  • 'Q: ... A: ...'Classic format from the GPT-3 paper.
  • XML tags ('<input>...</input>')Preferred for Claude and structured outputs.

Convention for separating input/output fields (e.g. 'Q:/A:', '###', XML tags). Affects how well the model recognizes the pattern.

Common pitfalls

Sensitivity to demonstration order
HIGH

Lu et al. (2022) showed that the same demonstration set in different orders yields results differing by 20–30 accuracy percentage points. Some permutations perform worse than the random baseline.

Average results across several permutations or use sorting heuristics (from least to most similar to the query).

Recency bias — model favors the last demonstrations
MEDIUM

Models tend to focus mainly on the final demonstrations in the prompt, ignoring earlier information. Particularly problematic in many-shot ICL.

Place key demonstrations at the end of the list; for classification tasks, balance the order of labels.

Majority label bias
HIGH

If demonstrations are imbalanced (e.g. 6/8 labeled "positive"), the model will systematically predict the dominant label for new queries.

Balance labels in demonstrations (e.g. 4 per class). Apply output calibration (Zhao et al. 2021).

Format mismatch between demonstrations and query
MEDIUM

Subtle differences in format (e.g. space before the answer, period at the end of the input) between demonstrations and the query can drastically reduce ICL quality.

Normalize the format programmatically. Always test prompts using exactly the same separator for demonstrations and the query.

Test data leakage into demonstrations
HIGH

It is easy to accidentally include examples from the test split in demonstrations. This results in inflated benchmark scores.

Strictly separate the demonstration pool from the test set. Audit all demonstrations before evaluation.

GENESIS · Source paper

Language Models are Few-Shot Learners
2020NeurIPS 2020 (Best Paper Award)Tom B. Brown, Benjamin Mann, Nick Ryder et al.
2019

GPT-2 — first observations of zero-shot transfer (Radford et al.)

Radford et al. show that GPT-2 (1.5B parameters) can perform NLP tasks without fine-tuning when prompts are appropriately framed. Precursor to full ICL.

2020

GPT-3 and formalization of few-shot ICL (Brown et al.)

breakthrough

Brown et al. introduce systematic terminology (zero-/one-/few-shot) and demonstrate that GPT-3 (175B) achieves competitive performance against fine-tuned models on dozens of NLP benchmarks, using ICL alone.

2022

Bayesian inference framework for ICL (Xie et al.)

Xie et al. propose a formal interpretation of ICL as Bayesian inference over a latent task concept, explaining why ICL works despite the absence of gradients.

2022

Induction heads as the mechanistic substrate of ICL (Olsson et al., Anthropic)

breakthrough

Anthropic identifies induction heads — attention heads forming during pretraining whose emergence correlates with a sharp jump in ICL ability. First mechanistic evidence of how ICL emerges in the transformer.

2022

Role of labels in ICL questioned (Min et al.)

Min et al. show that randomly replacing labels in demonstrations marginally decreases ICL quality — suggesting that the model learns the format and label space rather than the input→output mapping itself.

2023

ICL as implicit gradient descent (von Oswald et al.)

von Oswald et al. show formally and empirically that a transformer in ICL mode performs a gradient-descent step in attention activation space. This provides theoretical grounding for the mechanism.

2024

Many-shot ICL — hundreds/thousands of demonstrations (Agarwal et al., Google DeepMind)

breakthrough

With models supporting 1M+ tokens (Gemini 1.5, Claude 3), DeepMind shows that many-shot ICL (e.g. 1000+ demonstrations) can outperform fine-tuning on many tasks.

GPU Tensor CoresPRIMARY

ICL is applied to a standard LLM, which runs most efficiently on GPUs with tensor cores for matrix multiplications in attention and feed-forward layers.

TPUGOOD

TPUs are widely used for LLM inference. No special hardware requirements for ICL beyond the base model.

BUILT ON

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT
Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT
Self-Attention

Self-attention is a computational mechanism introduced in the Transformer architecture (Vaswani et al., 2017). For each token in the input sequence, it computes a contextual representation as a weighted sum of values (V) of all tokens, where the weights arise from the cosine similarity between the queries (Q) and keys (K) of that token and all others. This allows every token to directly attend to information from any other position in the sequence, regardless of distance, overcoming the limitations of recurrent neural networks in modeling long-range dependencies.

GO TO CONCEPT
Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT

Connects

Prompt Engineering

Prompt Engineering is a set of techniques for precisely formulating text inputs (prompts) provided to language models, to guide their responses toward a desired format, style, level of detail, or correctness. Techniques include few-shot prompting (providing examples in context), zero-shot prompting (task without examples), role prompting (assigning a system role), chain-of-thought (requesting reasoning steps), format prompting (specifying output format), and many others. Prompt Engineering is particularly important when model fine-tuning is infeasible or uneconomical. Codification of techniques occurred mainly after GPT-3 (Brown et al., 2020), which demonstrated high sensitivity of performance to prompt formulation.

GO TO CONCEPT
Induction Heads

Induction Heads are a pair of attention heads discovered by Olsson et al. (2022) at Anthropic in two-layer attention-only models. The first head (prefix-matching head) copies information from previous tokens; the second (induction head proper) uses this information to predict that after the pattern [A][B] and a subsequent [A], [B] should follow. This mechanism implements a simple sequence completion algorithm. The authors present strong evidence that induction heads develop at precisely the same point as the emergence of in-context learning ability, visible as a sharp drop in training loss. For small attention-only models, the evidence is causal; for larger models with MLPs, it is correlational. This work is foundational for the field of mechanistic interpretability.

GO TO CONCEPT

ALTERNATIVE TO

SFT

Supervised Fine-Tuning (SFT) is a post-training stage in which a pre-trained language model is further optimized on a labeled set of (prompt, response) pairs. Each pair contains an instruction or question and a reference response written by a human or filtered automatically. The model minimizes cross-entropy loss on the response tokens. SFT is the first stage of the RLHF pipeline (Ouyang et al., 2022) and is critical for teaching the model to follow instructions. SFT alone can significantly improve model usability without requiring reinforcement learning. The method is used in InstructGPT, ChatGPT, Llama-2-Chat, and many other models.

GO TO CONCEPT

Commonly used with

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT
Instruction Tuning

Instruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.

GO TO CONCEPT
RLHF

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.

GO TO CONCEPT
Emergent Abilities

Emergent abilities of large language models is an observation, formalized by Wei et al. (2022), that certain LLM capabilities — such as multi-step reasoning, zero-shot instruction following, modular arithmetic, or answering questions in low-resource languages — do not appear gradually with scale but emerge discontinuously, only after crossing a threshold of parameter count, training data, or compute FLOPs. Below the threshold, performance is random or near-zero; above it, performance jumps abruptly to substantially better-than-random. The phenomenon has been documented across more than 130 tasks from the BIG-Bench benchmark and other suites (MMLU, TruthfulQA). Canonical examples include: Chain-of-Thought reasoning (~100B-parameter threshold for PaLM/GPT-3), InstructGPT-style instruction following, modular arithmetic, International Phonetic Alphabet transliteration, and multi-step question answering. In 2023, Schaeffer, Miranda, and Koyejo (NeurIPS 2023, "Are Emergent Abilities of Large Language Models a Mirage?") challenged emergence as a real fundamental phenomenon. They showed that non-linear or discontinuous evaluation metrics (e.g. exact-match accuracy) artificially create the appearance of a jump — replacing them with continuous metrics (token edit distance, log-likelihood) reveals a smooth, predictable scaling curve. This critique is now central to the debate: some abilities are emergent in a metric-dependent sense, while others (e.g. inductive reasoning) appear to show genuine phase discontinuities. The concept has critical practical significance: if emergence is real, certain abilities cannot be predicted or trained at smaller scale — forcing organizations to train large models "blindly." If emergence is a metric artifact, then scaling laws (Hoffmann et al., Chinchilla) are sufficient to predict the behavior of larger models.

GO TO CONCEPT
RAG

Retrieval-Augmented Generation (RAG) was introduced by Lewis et al. (2020) as a general-purpose fine-tuning recipe combining pre-trained parametric memory (a seq2seq language model, specifically BART in the original paper) with non-parametric memory (a dense vector index of Wikipedia, accessed via Dense Passage Retrieval, DPR). In the original formulation, both the retriever and the generator are fine-tuned end-to-end: given an input query x, the retriever retrieves top-k documents z from the corpus, and the generator produces an output y conditioned on x and z. Two formulations were proposed: RAG-Sequence (the same retrieved documents condition the full output sequence) and RAG-Token (different documents may be used per generated token, marginalized during generation). In widespread contemporary usage (post-2022, with the growth of LLM applications), 'RAG' has expanded to describe a broader class of retrieve-then-generate pipelines, typically with a frozen LLM, a vector store containing pre-computed dense embeddings of document chunks, and a retrieval step that fetches top-k relevant chunks based on embedding similarity to the query. The retrieved chunks are appended to the prompt as context before the LLM generates a response. This non-trainable pipeline variant is technically distinct from the original Lewis et al. formulation but is the dominant practical interpretation of RAG as of 2023–2025. The canonical modern RAG pipeline consists of an offline indexing phase (document chunking, embedding computation, storage in a vector database) and an online query phase (query embedding, approximate nearest neighbor search, context-augmented generation). Key design decisions include: chunk size and overlap, embedding model choice, retrieval strategy (dense, sparse/BM25, or hybrid), number of retrieved documents k, and context integration method (prepend to prompt, cross-attention injection, or fusion-in-decoder). RAG addresses two fundamental limitations of parametric-only LLMs: the knowledge cutoff problem (inability to access post-training information) and hallucination (generation of factually incorrect content). However, RAG introduces its own failure modes, including retrieval of irrelevant or misleading context and the LLM's susceptibility to being distracted by retrieved content that contradicts its parametric knowledge.

GO TO CONCEPT
Prompt Engineering

Prompt Engineering is a set of techniques for precisely formulating text inputs (prompts) provided to language models, to guide their responses toward a desired format, style, level of detail, or correctness. Techniques include few-shot prompting (providing examples in context), zero-shot prompting (task without examples), role prompting (assigning a system role), chain-of-thought (requesting reasoning steps), format prompting (specifying output format), and many others. Prompt Engineering is particularly important when model fine-tuning is infeasible or uneconomical. Codification of techniques occurred mainly after GPT-3 (Brown et al., 2020), which demonstrated high sensitivity of performance to prompt formulation.

GO TO CONCEPT
ToT

Tree of Thoughts (ToT) is a reasoning framework for language models proposed by Yao et al. (NeurIPS 2023). It generalizes the Chain-of-Thought approach, allowing the model to explore multiple different reasoning paths instead of a single linear sequence. Each intermediate reasoning state ('thought') is assigned an evaluation of its promise by the LLM. The framework endows the model with the ability for deliberate decision-making, consideration of alternative reasoning branches, backtracking when a path is a dead end, and lookahead before making global choices. Experiments showed that ToT significantly improves LLM problem-solving capabilities on tasks requiring non-trivial planning or search (Game of 24: GPT-4+CoT 4% vs ToT 74%).

GO TO CONCEPT
Self-Consistency

Self-Consistency (Wang et al., ICLR 2023) is a decoding strategy for language models using Chain-of-Thought. Instead of using greedy decoding (selecting the most probable token at each step), Self-Consistency samples multiple diverse reasoning paths and then selects the answer that appears most frequently across all paths (marginalization over reasoning paths). The method is based on the intuition that a complex reasoning problem typically admits multiple different ways of reaching the same correct answer. Experiments demonstrated impressive CoT performance gains: +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA (Google, ICLR 2023).

GO TO CONCEPT
Reflexion

Reflexion (Shinn et al., NeurIPS 2023) is a framework for reinforcing LLM agents not through weight updates, but through linguistic feedback. Reflexion agents verbally reflect on task feedback signals, then maintain these verbal reflections in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible regarding types of feedback signals (scalar values or free-form language) and sources (external or internally simulated). The method achieved 91% pass@1 on the HumanEval benchmark (coding), surpassing the then-SOTA GPT-4 at 80%. The work aligns with the paradigm of agents learning from experience without expensive RL.

GO TO CONCEPT
Reasoning model

A reasoning model (also: large reasoning model, LRM, reasoning language model, RLM) is a type of large language model that has been specifically post-trained to solve complex multi-step problems by explicitly generating intermediate reasoning steps before committing to a final response. Unlike standard LLMs that generate a direct response in a single forward pass, reasoning models allocate additional computation at inference time — a property known as test-time compute scaling — by producing a long internal chain of thought (CoT). The reasoning trace typically includes steps such as problem decomposition, hypothesis generation, self-verification, reflection, and correction of errors. The defining characteristics of reasoning models are: (1) post-training via large-scale reinforcement learning (RL) using reward signals based on final answer correctness (and sometimes intermediate step quality via process reward models); (2) the emergence of extended, often hidden, reasoning traces that precede the final answer; (3) a consistent empirical relationship between the length or computational budget allocated to the reasoning trace and final answer quality (test-time scaling law); (4) superior performance on verifiable tasks requiring multi-step logic, such as mathematics, competitive programming, and scientific reasoning. The term 'reasoning model' was introduced as a product category by OpenAI in September 2024 with the release of the o1-preview model. OpenAI described o1 as trained via a large-scale RL algorithm teaching the model to use chain of thought productively. The approach does not rely on explicit tree search algorithms; instead, implicit search emerges via RL-trained CoT generation. In January 2025, DeepSeek published the first detailed open technical description of this class of models in the DeepSeek-R1 paper (arXiv:2501.12948), demonstrating that reasoning capabilities can be incentivized via pure RL without supervised fine-tuning, using Group Relative Policy Optimization (GRPO) as the RL framework. Reasoning models typically employ the same base Transformer decoder architecture as standard LLMs, with the key difference residing entirely in the post-training pipeline: RL replaces or augments standard RLHF/SFT, and reward signals are grounded in verifiable outcomes. The resulting models generate substantially longer token sequences during inference (reasoning tokens), which are often hidden from end users but incur real compute costs. Performance consistently improves with both more training-time RL compute and more inference-time thinking budget.

GO TO CONCEPT

Related AI models

TabPFN

1