Emergent Abilities
How it works
1. Choose a benchmark and metric: a specific task (e.g. 3-digit addition, MMLU, BIG-Bench Hard) with a discrete success metric (exact-match, multiple-choice accuracy). 2. Train/evaluate models of varying scale: a series of models from the same family with fixed architecture but different sizes (e.g. GPT-3: 125M โ 175B; PaLM: 8B โ 540B; LaMDA: 137B; Gopher: 280B). 3. Measure task performance as a function of scale (parameters, data, FLOPs). Most tasks show: below threshold โ random performance (e.g. 25% for 4-way multiple choice); above threshold โ sharp rise. 4. Identify the threshold: the point where performance significantly exceeds random. For CoT: ~100B parameters. For modular arithmetic: ~10ยฒยฒ FLOPs. 5. Critical analysis (Schaeffer et al. 2023): replace the discrete metric with a continuous one (e.g. token edit distance instead of exact-match) โ if the curve becomes smooth, the emergence was a metric artifact, not a model phenomenon.
Problem solved
Language-model pretraining loss (cross-entropy) scales smoothly according to scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022 โ Chinchilla), but downstream task performance is not smooth. In practice, it is hard to predict when a model will acquire a specific capability (reasoning, code generation, instruction following) based solely on smaller-scale curves โ making it difficult to plan training budgets and pick architectures.
Components
The dimension along which emergence is measured โ most commonly parameter count (e.g. 8B โ 540B), training tokens, or compute FLOPs. More rigorous analyses use FLOPs as a single axis capturing both parameter and data scale.
Official
A function mapping model output to a scalar task-performance score. The choice of metric directly determines whether we observe a discontinuity (emergence) or a smooth scaling curve. Schaeffer et al. (2023) showed that discrete metrics (exact-match) create apparent emergence.
Official
A point on the scale axis where task performance abruptly exceeds the random baseline. Task- and metric-specific. For Chain-of-Thought on arithmetic: ~100B parameters. For modular arithmetic: ~10ยฒยฒ FLOPs. For simple classification: no threshold (smooth scaling).
Implementation
Most reported "emergences" disappear when discrete metrics (exact-match) are replaced with continuous ones (token edit distance, log-likelihood). Interpreting an exact-match jump as a fundamental capability jump leads to unjustified claims of unpredictability.
Pre-Chinchilla, most model families were undertrained. Apparent emergence thresholds as a function of parameters may have been artifacts of insufficient data โ a larger model simply utilized the same corpus better.
Wei et al. (2022) selected emergent tasks from a large BIG-Bench pool. Many tasks show no emergence and scale smoothly. Selective reporting of only emergent tasks distorts the picture of LLM behavior.
Many "emergence thresholds" come from a single training seed. The same model trained with a different seed may show the threshold at a different point or not at all.
Evolution
Kaplan et al. show that language-model pretraining loss scales smoothly with parameters, data, and FLOPs. This creates the expectation of smooth scaling on downstream tasks too.
Beyond the Imitation Game Benchmark (BIG-Bench) โ collaborative benchmark with 204 tasks evaluated across model scales. Forms the empirical basis for emergence observations.
Wei et al. publish "Emergent Abilities of Large Language Models" in TMLR, documenting discontinuous capability emergence across 137 BIG-Bench and other benchmark tasks. They introduce the formal definition: a capability is emergent if it is absent in smaller models but present in larger ones.
Hoffmann et al. show that prior models (GPT-3, Gopher) were undertrained: optimal scaling balances parameters and data equally. This shifts the interpretation of "emergence thresholds" โ some apparent thresholds may be artifacts of undertraining rather than parameter scale.
Schaeffer, Miranda, and Koyejo (NeurIPS 2023, Outstanding Paper Award) show that emergence is largely an artifact of discrete, non-linear evaluation metric choice. Replacing them with continuous metrics (token edit distance, log-likelihood) makes the scaling curve smooth and predictable.
In the GPT-4 technical report, OpenAI demonstrates that certain capabilities (HumanEval pass-rate) can be predicted with <1% error from models 10,000ร smaller โ provided an appropriate continuous metric is used. This reinforces Schaeffer et al.'s argument.
Technical details
Hyperparameters (configurable axes)
Parameter count or training FLOPs against which emergence is measured.
The metric choice determines whether emergence is observed. Discrete metrics amplify it; continuous metrics eliminate it.
Some task types (multi-step reasoning, instruction following) show clear emergence; others (sentiment classification) scale smoothly.
Execution paradigm
Emergent abilities is an observed behavioral phenomenon of LLMs (typically dense Transformers), not an execution mode itself. This field describes the execution of models exhibiting emergence โ standardly dense Transformers.
Parallelism
Field not directly applicable to an observational phenomenon. Describes parallelism of training the models in which emergence is measured โ standardly the fully parallel training of dense Transformers.
Hardware requirements
Emergent abilities is a behavioral observation โ independent of specific hardware. Hardware requirements are determined by the underlying LLM, not by the concept itself.