In-Context Learning

Demonstrating that a large language model can learn a new task at inference time — solely from a handful of examples (demonstrations) provided in the prompt — without weight updates or fine-tuning.

Task instruction

Specifies the task for the model

Modular

Optional natural-language task description preceding the demonstrations. In instruction-tuned models (GPT-3.5+, Claude), the instruction alone is often sufficient (zero-shot ICL).

Demonstrations (shots)

Conditioning the model on the task pattern

(input, output) pairs illustrating the expected model behavior. The number of demonstrations k defines the variant: zero-shot (k=0), one-shot (k=1), few-shot (k=2–32). Demonstrations must fit within the model's context window.

Query input

Application point of the learned pattern

The actual input for which the model should generate an answer. It must follow the same format as the demonstration inputs so that the model recognizes the pattern.

Induction heads

Mechanistic substrate of in-context learning

Specific attention heads in transformer layers ≥2 that learn to recognize the [A][B] ... [A] → [B] pattern during pretraining. Olsson et al. (2022, Anthropic) showed that induction heads are the mechanistic substrate of ICL — their formation correlates with the ICL emergence phase during training.

Time

…

k = number of demonstrations, L_demo = average demonstration length in tokens, L_query = query length, d = model dimension. Complexity dominated by self-attention over the entire prompt.

Inference cost grows quadratically with the number of demonstrations under classical self-attention. Long-context mechanisms (FlashAttention, ring attention) reduce this to linear memory.

Memory complexity

…

The number of prompt tokens determines KV-cache size and context-window usage.

Many-shot ICL requires long-context models (≥128k tokens). Standard few-shot fits in 4–32k.

Wąskie gardło: Quadratic self-attention over demonstrations

Self-attention scales as O(N²) in prompt length. With k demonstrations and long inputs, cost grows quickly — particularly in many-shot ICL.

Parallelism

Sequential

Prefill of demonstrations can be fully parallel (one forward pass over the whole prompt). Answer generation is sequential, as in any transformer decoder.

Paradigm

Dense

All paths active

ICL is a prompting technique applied to a standard dense Transformer at inference time. All parameters are active; no conditional routing.

Number of shots (k)

Critical

0Zero-shot — instruction only, no examples.
4–8Standard few-shot range from the GPT-3 paper.
32Upper bound used in Brown et al. (2020) benchmarks.
100–1000+Many-shot ICL in long contexts (Gemini 1.5 Pro, Claude 3).

Number of (input, output) pairs in the prompt. Affects both quality and inference cost (context length).

Demonstration order

Standard

randomRandom ordering — high variance in results.
similarity-rankedDemonstrations ordered by similarity to the query.

The order in which demonstrations appear in the prompt. Empirically, ICL quality is strongly permutation-dependent (Lu et al. 2022).

Demonstration selection strategy

Standard

staticThe same demonstrations for all queries.
kNN retrieval (KATE)Demonstrations most semantically similar to the query (Liu et al. 2022).

How demonstrations are selected from a candidate pool. Static (fixed pool) vs. dynamic (retrieval-based, e.g. KATE — k-nearest demonstrations).

Demonstration format

Standard

'Q: ... A: ...'Classic format from the GPT-3 paper.
XML tags ('<input>...</input>')Preferred for Claude and structured outputs.

Convention for separating input/output fields (e.g. 'Q:/A:', '###', XML tags). Affects how well the model recognizes the pattern.

Common pitfalls

Sensitivity to demonstration order

HIGH

Lu et al. (2022) showed that the same demonstration set in different orders yields results differing by 20–30 accuracy percentage points. Some permutations perform worse than the random baseline.

Average results across several permutations or use sorting heuristics (from least to most similar to the query).

Recency bias — model favors the last demonstrations

MEDIUM

Models tend to focus mainly on the final demonstrations in the prompt, ignoring earlier information. Particularly problematic in many-shot ICL.

Place key demonstrations at the end of the list; for classification tasks, balance the order of labels.

Majority label bias

HIGH

If demonstrations are imbalanced (e.g. 6/8 labeled "positive"), the model will systematically predict the dominant label for new queries.

Balance labels in demonstrations (e.g. 4 per class). Apply output calibration (Zhao et al. 2021).

Format mismatch between demonstrations and query

MEDIUM

Subtle differences in format (e.g. space before the answer, period at the end of the input) between demonstrations and the query can drastically reduce ICL quality.

Normalize the format programmatically. Always test prompts using exactly the same separator for demonstrations and the query.

Test data leakage into demonstrations

HIGH

It is easy to accidentally include examples from the test split in demonstrations. This results in inflated benchmark scores.

Strictly separate the demonstration pool from the test set. Audit all demonstrations before evaluation.

Reference implementations

LangChain — FewShotPromptTemplate

Python · LangChain

DSPy — programmatic prompting (BootstrapFewShot)official

Python · Stanford NLP

PromptSource — ICL prompt design toolkitofficial

Python · BigScience Workshop

OpenAI Cookbook — Few-shot prompting examplesofficial

Python / Jupyter · OpenAI

GENESIS · Source paper

Language Models are Few-Shot Learners

2020NeurIPS 2020 (Best Paper Award)Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2019

GPT-2 — first observations of zero-shot transfer (Radford et al.)

Radford et al. show that GPT-2 (1.5B parameters) can perform NLP tasks without fine-tuning when prompts are appropriately framed. Precursor to full ICL.

Language Models are Unsupervised Multitask Learners

2020

GPT-3 and formalization of few-shot ICL (Brown et al.)

breakthrough

Brown et al. introduce systematic terminology (zero-/one-/few-shot) and demonstrate that GPT-3 (175B) achieves competitive performance against fine-tuned models on dozens of NLP benchmarks, using ICL alone.

Language Models are Few-Shot Learners

2022

Bayesian inference framework for ICL (Xie et al.)

Xie et al. propose a formal interpretation of ICL as Bayesian inference over a latent task concept, explaining why ICL works despite the absence of gradients.

An Explanation of In-context Learning as Implicit Bayesian Inference

2022

Induction heads as the mechanistic substrate of ICL (Olsson et al., Anthropic)

breakthrough

Anthropic identifies induction heads — attention heads forming during pretraining whose emergence correlates with a sharp jump in ICL ability. First mechanistic evidence of how ICL emerges in the transformer.

In-context Learning and Induction Heads

2022

Role of labels in ICL questioned (Min et al.)

Min et al. show that randomly replacing labels in demonstrations marginally decreases ICL quality — suggesting that the model learns the format and label space rather than the input→output mapping itself.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

2023

ICL as implicit gradient descent (von Oswald et al.)

von Oswald et al. show formally and empirically that a transformer in ICL mode performs a gradient-descent step in attention activation space. This provides theoretical grounding for the mechanism.

Transformers learn in-context by gradient descent

2024

Many-shot ICL — hundreds/thousands of demonstrations (Agarwal et al., Google DeepMind)

breakthrough

With models supporting 1M+ tokens (Gemini 1.5, Claude 3), DeepMind shows that many-shot ICL (e.g. 1000+ demonstrations) can outperform fine-tuning on many tasks.

Many-Shot In-Context Learning

GPU Tensor CoresPRIMARY

ICL is applied to a standard LLM, which runs most efficiently on GPUs with tensor cores for matrix multiplications in attention and feed-forward layers.

TPUGOOD

TPUs are widely used for LLM inference. No special hardware requirements for ICL beyond the base model.

BUILT ON

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

Title	Publisher	Type
Language Models are Few-Shot Learners (Brown et al. 2020 / GPT-3)	—	scientific article
A Survey on In-context Learning (Dong et al. 2024)	—	scientific article
In-context Learning and Induction Heads (Olsson et al., Anthropic 2022)	—	blog
Rethinking the Role of Demonstrations (Min et al. 2022)	—	scientific article
Many-Shot In-Context Learning (Agarwal et al., Google DeepMind 2024)	—	scientific article

Use cases

How it works

Problem solved

Main components

Computational complexity

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Semantic relations

BUILT ON

Connects

ALTERNATIVE TO

Commonly used with

Related models and families

Related AI models

TabPFN

Sources