Evaluation

Emergent Abilities

2022ActiveUpdated: 5 May 2026Published

Key innovation

Formalizing and empirically documenting the observation that certain capabilities of large language models appear discontinuously only above a specific scale threshold (parameters, data, FLOPs) and cannot be predicted from the performance of smaller models.

How it works

1. Choose a benchmark and metric: a specific task (e.g. 3-digit addition, MMLU, BIG-Bench Hard) with a discrete success metric (exact-match, multiple-choice accuracy). 2. Train/evaluate models of varying scale: a series of models from the same family with fixed architecture but different sizes (e.g. GPT-3: 125M → 175B; PaLM: 8B → 540B; LaMDA: 137B; Gopher: 280B). 3. Measure task performance as a function of scale (parameters, data, FLOPs). Most tasks show: below threshold — random performance (e.g. 25% for 4-way multiple choice); above threshold — sharp rise. 4. Identify the threshold: the point where performance significantly exceeds random. For CoT: ~100B parameters. For modular arithmetic: ~10²² FLOPs. 5. Critical analysis (Schaeffer et al. 2023): replace the discrete metric with a continuous one (e.g. token edit distance instead of exact-match) — if the curve becomes smooth, the emergence was a metric artifact, not a model phenomenon.

Problem solved

Language-model pretraining loss (cross-entropy) scales smoothly according to scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022 — Chinchilla), but downstream task performance is not smooth. In practice, it is hard to predict when a model will acquire a specific capability (reasoning, code generation, instruction following) based solely on smaller-scale curves — making it difficult to plan training budgets and pick architectures.

Components

Model scale axisX-axis of the emergence plot

The dimension along which emergence is measured — most commonly parameter count (e.g. 8B → 540B), training tokens, or compute FLOPs. More rigorous analyses use FLOPs as a single axis capturing both parameter and data scale.

Parameter countMost common measure in the original Wei et al. paper.

Training FLOPsA single scalar measure capturing both parameters and data (Kaplan et al. 2020).

Training tokensCritical post-Chinchilla (Hoffmann et al. 2022) — undertrained models show apparent emergence.

Official

Evaluation metricDetermines whether emergence is observed

A function mapping model output to a scalar task-performance score. The choice of metric directly determines whether we observe a discontinuity (emergence) or a smooth scaling curve. Schaeffer et al. (2023) showed that discrete metrics (exact-match) create apparent emergence.

Exact-match accuracyDiscrete; amplifies apparent emergence.

Multiple-choice accuracyDiscrete; jumps above random baseline (e.g. 25%).

Token edit distanceContinuous; reveals smooth scaling (Schaeffer et al.).

Log-likelihood of correct answerContinuous; the model's direct loss objective, smooth across scale.

Official

Emergence thresholdDefines where the capability emerges

A point on the scale axis where task performance abruptly exceeds the random baseline. Task- and metric-specific. For Chain-of-Thought on arithmetic: ~100B parameters. For modular arithmetic: ~10²² FLOPs. For simple classification: no threshold (smooth scaling).

Implementation

Reference implementations

BIG-Bench — benchmark for measuring emergence

Python · Google Research + community

Official

BIG-Bench Hard — emergent subset of BIG-Bench

Python · Suzgun et al.

Official

Mirage — emergence-as-metric-artifact analysis code

Python · Rylan Schaeffer (Stanford)

Official

MMLU — multi-subject scaling benchmark

Python · Hendrycks et al.

Official

Implementation pitfalls

Confusing metric emergence with model emergenceHigh

Most reported "emergences" disappear when discrete metrics (exact-match) are replaced with continuous ones (token edit distance, log-likelihood). Interpreting an exact-match jump as a fundamental capability jump leads to unjustified claims of unpredictability.

Fix:Always report results with at least one continuous metric (log-likelihood of the correct answer). Use predictive methods as in the GPT-4 technical report.

Undertrained models along the scale axisHigh

Pre-Chinchilla, most model families were undertrained. Apparent emergence thresholds as a function of parameters may have been artifacts of insufficient data — a larger model simply utilized the same corpus better.

Fix:Scale parameters and data together per Chinchilla scaling laws. Report emergence as a function of FLOPs, not parameters alone.

Cherry-picking benchmarks that show emergenceMedium

Wei et al. (2022) selected emergent tasks from a large BIG-Bench pool. Many tasks show no emergence and scale smoothly. Selective reporting of only emergent tasks distorts the picture of LLM behavior.

Fix:Report the full distribution of scaling behaviors, not only emergent tasks. Use aggregates like BIG-Bench Hard with a representative subset.

Lack of replication and variance analysisMedium

Many "emergence thresholds" come from a single training seed. The same model trained with a different seed may show the threshold at a different point or not at all.

Fix:Train multiple seeds, report error bars and confidence intervals on scaling curves.

Evolution

Original paper · 2022 · TMLR 2022 · Jason Wei

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus

2020

Scaling laws for neural language models (Kaplan et al.)

Kaplan et al. show that language-model pretraining loss scales smoothly with parameters, data, and FLOPs. This creates the expectation of smooth scaling on downstream tasks too.

Scaling Laws for Neural Language Models (paper)

2022

BIG-Bench releases 200+ diverse tasks

Beyond the Imitation Game Benchmark (BIG-Bench) — collaborative benchmark with 204 tasks evaluated across model scales. Forms the empirical basis for emergence observations.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (paper)

2022

Emergent abilities concept formalized (Wei et al.)

Inflection point

Wei et al. publish "Emergent Abilities of Large Language Models" in TMLR, documenting discontinuous capability emergence across 137 BIG-Bench and other benchmark tasks. They introduce the formal definition: a capability is emergent if it is absent in smaller models but present in larger ones.

Emergent Abilities of Large Language Models (paper)

2022

Chinchilla — optimal data allocation revisited (Hoffmann et al.)

Inflection point

Hoffmann et al. show that prior models (GPT-3, Gopher) were undertrained: optimal scaling balances parameters and data equally. This shifts the interpretation of "emergence thresholds" — some apparent thresholds may be artifacts of undertraining rather than parameter scale.

Training Compute-Optimal Large Language Models (paper)

2023

"Mirage" critique — emergence as metric artifact (Schaeffer et al.)

Inflection point

Schaeffer, Miranda, and Koyejo (NeurIPS 2023, Outstanding Paper Award) show that emergence is largely an artifact of discrete, non-linear evaluation metric choice. Replacing them with continuous metrics (token edit distance, log-likelihood) makes the scaling curve smooth and predictable.

Are Emergent Abilities of Large Language Models a Mirage? (paper)

2024

Predictable metrics — GPT-4 capability prediction (OpenAI)

In the GPT-4 technical report, OpenAI demonstrates that certain capabilities (HumanEval pass-rate) can be predicted with <1% error from models 10,000× smaller — provided an appropriate continuous metric is used. This reinforces Schaeffer et al.'s argument.

GPT-4 Technical Report (paper)

Technical details

Hyperparameters (configurable axes)

Model scaleCritical

Parameter count or training FLOPs against which emergence is measured.

8B–540B parametersScale of the PaLM family used in Wei et al. (2022).

10²⁰–10²⁴ FLOPsTraining FLOPs range for frontier models from 2020–2024.

Evaluation metricCritical

The metric choice determines whether emergence is observed. Discrete metrics amplify it; continuous metrics eliminate it.

exact_matchDiscrete; exhibits jumps.

token_edit_distanceContinuous; smooth curve.

Task typeHigh

Some task types (multi-step reasoning, instruction following) show clear emergence; others (sentiment classification) scale smoothly.

Multi-step reasoning (GSM8K)Strong emergence above ~100B parameters.

Sentiment classificationSmooth scaling, no threshold.

Execution paradigm

Primary mode

dense

Emergent abilities is an observed behavioral phenomenon of LLMs (typically dense Transformers), not an execution mode itself. This field describes the execution of models exhibiting emergence — standardly dense Transformers.

Activation pattern

all_paths_active

Routing mechanism

Parallelism

Parallelism level

fully_parallel

Field not directly applicable to an observational phenomenon. Describes parallelism of training the models in which emergence is measured — standardly the fully parallel training of dense Transformers.

Scope

training

Hardware requirements

Primary

Emergent abilities is a behavioral observation — independent of specific hardware. Hardware requirements are determined by the underlying LLM, not by the concept itself.

Sources

Emergent Abilities of Large Language Models (Wei et al. 2022)

Paper

Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al., NeurIPS 2023)

Paper

BIG-Bench (Beyond the Imitation Game)

Paper

Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al. 2022)

Paper

Google AI Blog — Characterizing Emergent Phenomena in Large Language Models

Blog