ICL
How it works
1. Prompt construction: optional natural-language task instruction + k demonstration (input, output) pairs + new query input. Each demonstration is separated (e.g. newline, '###', XML tag). 2. Tokenization and forward pass: the full prompt is fed as context to the transformer decoder. Self-attention lets every token attend to all preceding tokens, including the demonstrations. 3. Pattern induction: attention layers (particularly induction heads, Olsson et al. 2022) detect [token A โ token B] patterns in demonstrations and propagate them to the new input. This is analogous to implicit gradient descent in activation space. 4. Output generation: the model autoregressively produces answer tokens, continuing the pattern from demonstrations. 5. No weight updates: unlike fine-tuning, gradients are not computed or backpropagated. All "learning" happens entirely in the activations of a single forward pass.
Problem solved
Traditional supervised learning requires a training set for every new task, model fine-tuning (a separate copy of weights), and training infrastructure. This prevents rapid adaptation to new tasks and blocks scaling to thousands of domains. ICL removes this problem: a single frozen LLM can perform any task defined in the prompt, without training and without weight copies.
Components
Optional natural-language task description preceding the demonstrations. In instruction-tuned models (GPT-3.5+, Claude), the instruction alone is often sufficient (zero-shot ICL).
Official
(input, output) pairs illustrating the expected model behavior. The number of demonstrations k defines the variant: zero-shot (k=0), one-shot (k=1), few-shot (k=2โ32). Demonstrations must fit within the model's context window.
The actual input for which the model should generate an answer. It must follow the same format as the demonstration inputs so that the model recognizes the pattern.
Specific attention heads in transformer layers โฅ2 that learn to recognize the [A][B] ... [A] โ [B] pattern during pretraining. Olsson et al. (2022, Anthropic) showed that induction heads are the mechanistic substrate of ICL โ their formation correlates with the ICL emergence phase during training.
Implementation
Lu et al. (2022) showed that the same demonstration set in different orders yields results differing by 20โ30 accuracy percentage points. Some permutations perform worse than the random baseline.
Models tend to focus mainly on the final demonstrations in the prompt, ignoring earlier information. Particularly problematic in many-shot ICL.
If demonstrations are imbalanced (e.g. 6/8 labeled "positive"), the model will systematically predict the dominant label for new queries.
Subtle differences in format (e.g. space before the answer, period at the end of the input) between demonstrations and the query can drastically reduce ICL quality.
It is easy to accidentally include examples from the test split in demonstrations. This results in inflated benchmark scores.
Evolution
Radford et al. show that GPT-2 (1.5B parameters) can perform NLP tasks without fine-tuning when prompts are appropriately framed. Precursor to full ICL.
Brown et al. introduce systematic terminology (zero-/one-/few-shot) and demonstrate that GPT-3 (175B) achieves competitive performance against fine-tuned models on dozens of NLP benchmarks, using ICL alone.
Xie et al. propose a formal interpretation of ICL as Bayesian inference over a latent task concept, explaining why ICL works despite the absence of gradients.
Anthropic identifies induction heads โ attention heads forming during pretraining whose emergence correlates with a sharp jump in ICL ability. First mechanistic evidence of how ICL emerges in the transformer.
Min et al. show that randomly replacing labels in demonstrations marginally decreases ICL quality โ suggesting that the model learns the format and label space rather than the inputโoutput mapping itself.
von Oswald et al. show formally and empirically that a transformer in ICL mode performs a gradient-descent step in attention activation space. This provides theoretical grounding for the mechanism.
With models supporting 1M+ tokens (Gemini 1.5, Claude 3), DeepMind shows that many-shot ICL (e.g. 1000+ demonstrations) can outperform fine-tuning on many tasks.
Technical details
Hyperparameters (configurable axes)
Number of (input, output) pairs in the prompt. Affects both quality and inference cost (context length).
The order in which demonstrations appear in the prompt. Empirically, ICL quality is strongly permutation-dependent (Lu et al. 2022).
How demonstrations are selected from a candidate pool. Static (fixed pool) vs. dynamic (retrieval-based, e.g. KATE โ k-nearest demonstrations).
Convention for separating input/output fields (e.g. 'Q:/A:', '###', XML tags). Affects how well the model recognizes the pattern.
Computational complexity
Time complexity: O((kยทL_demo + L_query)ยฒ ยท d). Space complexity: O(kยทL_demo + L_query).
Compute bottleneck
Self-attention scales as O(Nยฒ) in prompt length. With k demonstrations and long inputs, cost grows quickly โ particularly in many-shot ICL.
Execution paradigm
ICL is a prompting technique applied to a standard dense Transformer at inference time. All parameters are active; no conditional routing.
Parallelism
Prefill of demonstrations can be fully parallel (one forward pass over the whole prompt). Answer generation is sequential, as in any transformer decoder.
Hardware requirements
ICL is applied to a standard LLM, which runs most efficiently on GPUs with tensor cores for matrix multiplications in attention and feed-forward layers.
TPUs are widely used for LLM inference. No special hardware requirements for ICL beyond the base model.