Robots Atlas>ROBOTS ATLAS
Inference

ICL

2020ActiveUpdated: 6 May 2026Published
Key innovation
Demonstrating that a large language model can learn a new task at inference time โ€” solely from a handful of examples (demonstrations) provided in the prompt โ€” without weight updates or fine-tuning.
Category
Inference
Abstraction level
Pattern
Operation level
Inference
Use cases
Few-example text classification (sentiment, intent)Machine translation between language pairs without fine-tuningData structuring: extracting JSON from text with 2โ€“3 examplesDomain-specific question answering with few-shot examplesStyle transfer and paraphrasing with demonstrationsPrompt engineering in LLM applications (LangChain, DSPy)Foundation models for robotics โ€” learning policies from in-prompt demonstrations (RT-2, VLA)Chatbot personalization without changing model weights

How it works

1. Prompt construction: optional natural-language task instruction + k demonstration (input, output) pairs + new query input. Each demonstration is separated (e.g. newline, '###', XML tag). 2. Tokenization and forward pass: the full prompt is fed as context to the transformer decoder. Self-attention lets every token attend to all preceding tokens, including the demonstrations. 3. Pattern induction: attention layers (particularly induction heads, Olsson et al. 2022) detect [token A โ†’ token B] patterns in demonstrations and propagate them to the new input. This is analogous to implicit gradient descent in activation space. 4. Output generation: the model autoregressively produces answer tokens, continuing the pattern from demonstrations. 5. No weight updates: unlike fine-tuning, gradients are not computed or backpropagated. All "learning" happens entirely in the activations of a single forward pass.

Problem solved

Traditional supervised learning requires a training set for every new task, model fine-tuning (a separate copy of weights), and training infrastructure. This prevents rapid adaptation to new tasks and blocks scaling to thousands of domains. ICL removes this problem: a single frozen LLM can perform any task defined in the prompt, without training and without weight copies.

Components

Task instructionSpecifies the task for the model

Optional natural-language task description preceding the demonstrations. In instruction-tuned models (GPT-3.5+, Claude), the instruction alone is often sufficient (zero-shot ICL).

Official

Demonstrations (shots)Conditioning the model on the task pattern

(input, output) pairs illustrating the expected model behavior. The number of demonstrations k defines the variant: zero-shot (k=0), one-shot (k=1), few-shot (k=2โ€“32). Demonstrations must fit within the model's context window.

Zero-shotNo demonstrations, only natural-language instruction.
One-shotSingle demonstration before the query.
Few-shotTypically 4โ€“8 demonstrations; standard regime from the GPT-3 paper.
Many-shotHundreds/thousands of demonstrations in long context windows (Agarwal et al. 2024, Google DeepMind).
Query inputApplication point of the learned pattern

The actual input for which the model should generate an answer. It must follow the same format as the demonstration inputs so that the model recognizes the pattern.

Induction headsMechanistic substrate of in-context learning

Specific attention heads in transformer layers โ‰ฅ2 that learn to recognize the [A][B] ... [A] โ†’ [B] pattern during pretraining. Olsson et al. (2022, Anthropic) showed that induction heads are the mechanistic substrate of ICL โ€” their formation correlates with the ICL emergence phase during training.

Implementation

Implementation pitfalls
Sensitivity to demonstration orderHigh

Lu et al. (2022) showed that the same demonstration set in different orders yields results differing by 20โ€“30 accuracy percentage points. Some permutations perform worse than the random baseline.

Fix:Average results across several permutations or use sorting heuristics (from least to most similar to the query).
Recency bias โ€” model favors the last demonstrationsMedium

Models tend to focus mainly on the final demonstrations in the prompt, ignoring earlier information. Particularly problematic in many-shot ICL.

Fix:Place key demonstrations at the end of the list; for classification tasks, balance the order of labels.
Majority label biasHigh

If demonstrations are imbalanced (e.g. 6/8 labeled "positive"), the model will systematically predict the dominant label for new queries.

Fix:Balance labels in demonstrations (e.g. 4 per class). Apply output calibration (Zhao et al. 2021).
Format mismatch between demonstrations and queryMedium

Subtle differences in format (e.g. space before the answer, period at the end of the input) between demonstrations and the query can drastically reduce ICL quality.

Fix:Normalize the format programmatically. Always test prompts using exactly the same separator for demonstrations and the query.
Test data leakage into demonstrationsHigh

It is easy to accidentally include examples from the test split in demonstrations. This results in inflated benchmark scores.

Fix:Strictly separate the demonstration pool from the test set. Audit all demonstrations before evaluation.

Evolution

Original paper ยท 2020 ยท NeurIPS 2020 (Best Paper Award) ยท Tom B. Brown
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
2019
GPT-2 โ€” first observations of zero-shot transfer (Radford et al.)

Radford et al. show that GPT-2 (1.5B parameters) can perform NLP tasks without fine-tuning when prompts are appropriately framed. Precursor to full ICL.

2020
GPT-3 and formalization of few-shot ICL (Brown et al.)
Inflection point

Brown et al. introduce systematic terminology (zero-/one-/few-shot) and demonstrate that GPT-3 (175B) achieves competitive performance against fine-tuned models on dozens of NLP benchmarks, using ICL alone.

2022
Bayesian inference framework for ICL (Xie et al.)

Xie et al. propose a formal interpretation of ICL as Bayesian inference over a latent task concept, explaining why ICL works despite the absence of gradients.

2022
Induction heads as the mechanistic substrate of ICL (Olsson et al., Anthropic)
Inflection point

Anthropic identifies induction heads โ€” attention heads forming during pretraining whose emergence correlates with a sharp jump in ICL ability. First mechanistic evidence of how ICL emerges in the transformer.

2022
Role of labels in ICL questioned (Min et al.)

Min et al. show that randomly replacing labels in demonstrations marginally decreases ICL quality โ€” suggesting that the model learns the format and label space rather than the inputโ†’output mapping itself.

2023
ICL as implicit gradient descent (von Oswald et al.)

von Oswald et al. show formally and empirically that a transformer in ICL mode performs a gradient-descent step in attention activation space. This provides theoretical grounding for the mechanism.

2024
Many-shot ICL โ€” hundreds/thousands of demonstrations (Agarwal et al., Google DeepMind)
Inflection point

With models supporting 1M+ tokens (Gemini 1.5, Claude 3), DeepMind shows that many-shot ICL (e.g. 1000+ demonstrations) can outperform fine-tuning on many tasks.

Technical details

Hyperparameters (configurable axes)

Number of shots (k)Critical

Number of (input, output) pairs in the prompt. Affects both quality and inference cost (context length).

0Zero-shot โ€” instruction only, no examples.
4โ€“8Standard few-shot range from the GPT-3 paper.
32Upper bound used in Brown et al. (2020) benchmarks.
100โ€“1000+Many-shot ICL in long contexts (Gemini 1.5 Pro, Claude 3).
Demonstration orderHigh

The order in which demonstrations appear in the prompt. Empirically, ICL quality is strongly permutation-dependent (Lu et al. 2022).

randomRandom ordering โ€” high variance in results.
similarity-rankedDemonstrations ordered by similarity to the query.
Demonstration selection strategyHigh

How demonstrations are selected from a candidate pool. Static (fixed pool) vs. dynamic (retrieval-based, e.g. KATE โ€” k-nearest demonstrations).

staticThe same demonstrations for all queries.
kNN retrieval (KATE)Demonstrations most semantically similar to the query (Liu et al. 2022).
Demonstration formatMedium

Convention for separating input/output fields (e.g. 'Q:/A:', '###', XML tags). Affects how well the model recognizes the pattern.

'Q: ... A: ...'Classic format from the GPT-3 paper.
XML tags ('<input>...</input>')Preferred for Claude and structured outputs.

Computational complexity

Time complexity: O((kยทL_demo + L_query)ยฒ ยท d). Space complexity: O(kยทL_demo + L_query).

Compute bottleneck

Quadratic self-attention over demonstrations

Self-attention scales as O(Nยฒ) in prompt length. With k demonstrations and long inputs, cost grows quickly โ€” particularly in many-shot ICL.

Depends on
Liczba demonstracji kDล‚ugoล›ฤ‡ pojedynczej demonstracji

Execution paradigm

Primary mode
dense

ICL is a prompting technique applied to a standard dense Transformer at inference time. All parameters are active; no conditional routing.

Activation pattern
all_paths_active
Routing mechanism

Parallelism

Parallelism level
sequential

Prefill of demonstrations can be fully parallel (one forward pass over the whole prompt). Answer generation is sequential, as in any transformer decoder.

Scope
inference
Constraints
!Output tokens are generated sequentially; each depends on all preceding tokens.

Hardware requirements

Primary

ICL is applied to a standard LLM, which runs most efficiently on GPUs with tensor cores for matrix multiplications in attention and feed-forward layers.

Good fit

TPUs are widely used for LLM inference. No special hardware requirements for ICL beyond the base model.