Inference

ICL

2020ActiveUpdated: 6 May 2026Published

Key innovation

Demonstrating that a large language model can learn a new task at inference time — solely from a handful of examples (demonstrations) provided in the prompt — without weight updates or fine-tuning.

How it works

1. Prompt construction: optional natural-language task instruction + k demonstration (input, output) pairs + new query input. Each demonstration is separated (e.g. newline, '###', XML tag). 2. Tokenization and forward pass: the full prompt is fed as context to the transformer decoder. Self-attention lets every token attend to all preceding tokens, including the demonstrations. 3. Pattern induction: attention layers (particularly induction heads, Olsson et al. 2022) detect [token A → token B] patterns in demonstrations and propagate them to the new input. This is analogous to implicit gradient descent in activation space. 4. Output generation: the model autoregressively produces answer tokens, continuing the pattern from demonstrations. 5. No weight updates: unlike fine-tuning, gradients are not computed or backpropagated. All "learning" happens entirely in the activations of a single forward pass.

Problem solved

Traditional supervised learning requires a training set for every new task, model fine-tuning (a separate copy of weights), and training infrastructure. This prevents rapid adaptation to new tasks and blocks scaling to thousands of domains. ICL removes this problem: a single frozen LLM can perform any task defined in the prompt, without training and without weight copies.

Components

Task instructionSpecifies the task for the model

Optional natural-language task description preceding the demonstrations. In instruction-tuned models (GPT-3.5+, Claude), the instruction alone is often sufficient (zero-shot ICL).

Official

Demonstrations (shots)Conditioning the model on the task pattern

(input, output) pairs illustrating the expected model behavior. The number of demonstrations k defines the variant: zero-shot (k=0), one-shot (k=1), few-shot (k=2–32). Demonstrations must fit within the model's context window.

Zero-shotNo demonstrations, only natural-language instruction.

One-shotSingle demonstration before the query.

Few-shotTypically 4–8 demonstrations; standard regime from the GPT-3 paper.

Many-shotHundreds/thousands of demonstrations in long context windows (Agarwal et al. 2024, Google DeepMind).

Query inputApplication point of the learned pattern

The actual input for which the model should generate an answer. It must follow the same format as the demonstration inputs so that the model recognizes the pattern.

Induction headsMechanistic substrate of in-context learning

Specific attention heads in transformer layers ≥2 that learn to recognize the [A][B] ... [A] → [B] pattern during pretraining. Olsson et al. (2022, Anthropic) showed that induction heads are the mechanistic substrate of ICL — their formation correlates with the ICL emergence phase during training.

Implementation

Reference implementations

LangChain — FewShotPromptTemplate

Python · LangChain

DSPy — programmatic prompting (BootstrapFewShot)

Python · Stanford NLP

Official

PromptSource — ICL prompt design toolkit

Python · BigScience Workshop

Official

OpenAI Cookbook — Few-shot prompting examples

Python / Jupyter · OpenAI

Official

Implementation pitfalls

Sensitivity to demonstration orderHigh

Lu et al. (2022) showed that the same demonstration set in different orders yields results differing by 20–30 accuracy percentage points. Some permutations perform worse than the random baseline.

Fix:Average results across several permutations or use sorting heuristics (from least to most similar to the query).

Recency bias — model favors the last demonstrationsMedium

Models tend to focus mainly on the final demonstrations in the prompt, ignoring earlier information. Particularly problematic in many-shot ICL.

Fix:Place key demonstrations at the end of the list; for classification tasks, balance the order of labels.

Majority label biasHigh

If demonstrations are imbalanced (e.g. 6/8 labeled "positive"), the model will systematically predict the dominant label for new queries.

Fix:Balance labels in demonstrations (e.g. 4 per class). Apply output calibration (Zhao et al. 2021).

Format mismatch between demonstrations and queryMedium

Subtle differences in format (e.g. space before the answer, period at the end of the input) between demonstrations and the query can drastically reduce ICL quality.

Fix:Normalize the format programmatically. Always test prompts using exactly the same separator for demonstrations and the query.

Test data leakage into demonstrationsHigh

It is easy to accidentally include examples from the test split in demonstrations. This results in inflated benchmark scores.

Fix:Strictly separate the demonstration pool from the test set. Audit all demonstrations before evaluation.

Evolution

Original paper · 2020 · NeurIPS 2020 (Best Paper Award) · Tom B. Brown

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

2019

GPT-2 — first observations of zero-shot transfer (Radford et al.)

Radford et al. show that GPT-2 (1.5B parameters) can perform NLP tasks without fine-tuning when prompts are appropriately framed. Precursor to full ICL.

Language Models are Unsupervised Multitask Learners (paper)

2020

GPT-3 and formalization of few-shot ICL (Brown et al.)

Inflection point

Brown et al. introduce systematic terminology (zero-/one-/few-shot) and demonstrate that GPT-3 (175B) achieves competitive performance against fine-tuned models on dozens of NLP benchmarks, using ICL alone.

Language Models are Few-Shot Learners (paper)

2022

Bayesian inference framework for ICL (Xie et al.)

Xie et al. propose a formal interpretation of ICL as Bayesian inference over a latent task concept, explaining why ICL works despite the absence of gradients.

An Explanation of In-context Learning as Implicit Bayesian Inference (paper)

2022

Induction heads as the mechanistic substrate of ICL (Olsson et al., Anthropic)

Inflection point

Anthropic identifies induction heads — attention heads forming during pretraining whose emergence correlates with a sharp jump in ICL ability. First mechanistic evidence of how ICL emerges in the transformer.

In-context Learning and Induction Heads (paper)

2022

Role of labels in ICL questioned (Min et al.)

Min et al. show that randomly replacing labels in demonstrations marginally decreases ICL quality — suggesting that the model learns the format and label space rather than the input→output mapping itself.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (paper)

2023

ICL as implicit gradient descent (von Oswald et al.)

von Oswald et al. show formally and empirically that a transformer in ICL mode performs a gradient-descent step in attention activation space. This provides theoretical grounding for the mechanism.

Transformers learn in-context by gradient descent (paper)

2024

Many-shot ICL — hundreds/thousands of demonstrations (Agarwal et al., Google DeepMind)

Inflection point

With models supporting 1M+ tokens (Gemini 1.5, Claude 3), DeepMind shows that many-shot ICL (e.g. 1000+ demonstrations) can outperform fine-tuning on many tasks.

Many-Shot In-Context Learning (paper)

Technical details

Hyperparameters (configurable axes)

Number of shots (k)Critical

Number of (input, output) pairs in the prompt. Affects both quality and inference cost (context length).

0Zero-shot — instruction only, no examples.

4–8Standard few-shot range from the GPT-3 paper.

32Upper bound used in Brown et al. (2020) benchmarks.

100–1000+Many-shot ICL in long contexts (Gemini 1.5 Pro, Claude 3).

Demonstration orderHigh

The order in which demonstrations appear in the prompt. Empirically, ICL quality is strongly permutation-dependent (Lu et al. 2022).

randomRandom ordering — high variance in results.

similarity-rankedDemonstrations ordered by similarity to the query.

Demonstration selection strategyHigh

How demonstrations are selected from a candidate pool. Static (fixed pool) vs. dynamic (retrieval-based, e.g. KATE — k-nearest demonstrations).

staticThe same demonstrations for all queries.

kNN retrieval (KATE)Demonstrations most semantically similar to the query (Liu et al. 2022).

Demonstration formatMedium

Convention for separating input/output fields (e.g. 'Q:/A:', '###', XML tags). Affects how well the model recognizes the pattern.

'Q: ... A: ...'Classic format from the GPT-3 paper.

XML tags ('<input>...</input>')Preferred for Claude and structured outputs.

Computational complexity

Time complexity: O((k·L_demo + L_query)² · d). Space complexity: O(k·L_demo + L_query).

Compute bottleneck

Quadratic self-attention over demonstrations

Self-attention scales as O(N²) in prompt length. With k demonstrations and long inputs, cost grows quickly — particularly in many-shot ICL.

Depends on

Liczba demonstracji kDługość pojedynczej demonstracji

Execution paradigm

Primary mode

dense

ICL is a prompting technique applied to a standard dense Transformer at inference time. All parameters are active; no conditional routing.

Activation pattern

all_paths_active

Routing mechanism

Parallelism

Parallelism level

sequential

Prefill of demonstrations can be fully parallel (one forward pass over the whole prompt). Answer generation is sequential, as in any transformer decoder.

Scope

inference

Constraints

!Output tokens are generated sequentially; each depends on all preceding tokens.

Hardware requirements

Primary

ICL is applied to a standard LLM, which runs most efficiently on GPUs with tensor cores for matrix multiplications in attention and feed-forward layers.

Good fit

TPUs are widely used for LLM inference. No special hardware requirements for ICL beyond the base model.

Sources

Language Models are Few-Shot Learners (Brown et al. 2020 / GPT-3)

Paper

A Survey on In-context Learning (Dong et al. 2024)

Paper

In-context Learning and Induction Heads (Olsson et al., Anthropic 2022)

Blog

Rethinking the Role of Demonstrations (Min et al. 2022)

Paper

Many-Shot In-Context Learning (Agarwal et al., Google DeepMind 2024)

Paper