Robots Atlas>ROBOTS ATLAS
Evaluation

Emergent Abilities

2022ActiveUpdated: 5 May 2026Published
Key innovation
Formalizing and empirically documenting the observation that certain capabilities of large language models appear discontinuously only above a specific scale threshold (parameters, data, FLOPs) and cannot be predicted from the performance of smaller models.
Category
Evaluation
Abstraction level
Pattern
Operation level
ModelTraining
Use cases
Training-budget planning for a new LLM familyChoosing a scale threshold for the emergence of reasoning (CoT)Evaluating LLMs on BIG-Bench Hard, MMLU, GSM8K as a function of scaleArguing for further scaling of frontier modelsCritiquing LLM evaluation methodology (Schaeffer et al.)Deciding between instruction tuning and pure pretrainingAI safety analysis (unpredictable emergence of capabilities)Alignment research (whether dangerous capabilities are also emergent)

How it works

1. Choose a benchmark and metric: a specific task (e.g. 3-digit addition, MMLU, BIG-Bench Hard) with a discrete success metric (exact-match, multiple-choice accuracy). 2. Train/evaluate models of varying scale: a series of models from the same family with fixed architecture but different sizes (e.g. GPT-3: 125M โ†’ 175B; PaLM: 8B โ†’ 540B; LaMDA: 137B; Gopher: 280B). 3. Measure task performance as a function of scale (parameters, data, FLOPs). Most tasks show: below threshold โ€” random performance (e.g. 25% for 4-way multiple choice); above threshold โ€” sharp rise. 4. Identify the threshold: the point where performance significantly exceeds random. For CoT: ~100B parameters. For modular arithmetic: ~10ยฒยฒ FLOPs. 5. Critical analysis (Schaeffer et al. 2023): replace the discrete metric with a continuous one (e.g. token edit distance instead of exact-match) โ€” if the curve becomes smooth, the emergence was a metric artifact, not a model phenomenon.

Problem solved

Language-model pretraining loss (cross-entropy) scales smoothly according to scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022 โ€” Chinchilla), but downstream task performance is not smooth. In practice, it is hard to predict when a model will acquire a specific capability (reasoning, code generation, instruction following) based solely on smaller-scale curves โ€” making it difficult to plan training budgets and pick architectures.

Components

Model scale axisX-axis of the emergence plot

The dimension along which emergence is measured โ€” most commonly parameter count (e.g. 8B โ†’ 540B), training tokens, or compute FLOPs. More rigorous analyses use FLOPs as a single axis capturing both parameter and data scale.

Parameter countMost common measure in the original Wei et al. paper.
Training FLOPsA single scalar measure capturing both parameters and data (Kaplan et al. 2020).
Training tokensCritical post-Chinchilla (Hoffmann et al. 2022) โ€” undertrained models show apparent emergence.

Official

Evaluation metricDetermines whether emergence is observed

A function mapping model output to a scalar task-performance score. The choice of metric directly determines whether we observe a discontinuity (emergence) or a smooth scaling curve. Schaeffer et al. (2023) showed that discrete metrics (exact-match) create apparent emergence.

Exact-match accuracyDiscrete; amplifies apparent emergence.
Multiple-choice accuracyDiscrete; jumps above random baseline (e.g. 25%).
Token edit distanceContinuous; reveals smooth scaling (Schaeffer et al.).
Log-likelihood of correct answerContinuous; the model's direct loss objective, smooth across scale.

Official

Emergence thresholdDefines where the capability emerges

A point on the scale axis where task performance abruptly exceeds the random baseline. Task- and metric-specific. For Chain-of-Thought on arithmetic: ~100B parameters. For modular arithmetic: ~10ยฒยฒ FLOPs. For simple classification: no threshold (smooth scaling).

Implementation

Implementation pitfalls
Confusing metric emergence with model emergenceHigh

Most reported "emergences" disappear when discrete metrics (exact-match) are replaced with continuous ones (token edit distance, log-likelihood). Interpreting an exact-match jump as a fundamental capability jump leads to unjustified claims of unpredictability.

Fix:Always report results with at least one continuous metric (log-likelihood of the correct answer). Use predictive methods as in the GPT-4 technical report.
Undertrained models along the scale axisHigh

Pre-Chinchilla, most model families were undertrained. Apparent emergence thresholds as a function of parameters may have been artifacts of insufficient data โ€” a larger model simply utilized the same corpus better.

Fix:Scale parameters and data together per Chinchilla scaling laws. Report emergence as a function of FLOPs, not parameters alone.
Cherry-picking benchmarks that show emergenceMedium

Wei et al. (2022) selected emergent tasks from a large BIG-Bench pool. Many tasks show no emergence and scale smoothly. Selective reporting of only emergent tasks distorts the picture of LLM behavior.

Fix:Report the full distribution of scaling behaviors, not only emergent tasks. Use aggregates like BIG-Bench Hard with a representative subset.
Lack of replication and variance analysisMedium

Many "emergence thresholds" come from a single training seed. The same model trained with a different seed may show the threshold at a different point or not at all.

Fix:Train multiple seeds, report error bars and confidence intervals on scaling curves.

Evolution

Original paper ยท 2022 ยท TMLR 2022 ยท Jason Wei
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus
2020
Scaling laws for neural language models (Kaplan et al.)

Kaplan et al. show that language-model pretraining loss scales smoothly with parameters, data, and FLOPs. This creates the expectation of smooth scaling on downstream tasks too.

2022
BIG-Bench releases 200+ diverse tasks

Beyond the Imitation Game Benchmark (BIG-Bench) โ€” collaborative benchmark with 204 tasks evaluated across model scales. Forms the empirical basis for emergence observations.

2022
Emergent abilities concept formalized (Wei et al.)
Inflection point

Wei et al. publish "Emergent Abilities of Large Language Models" in TMLR, documenting discontinuous capability emergence across 137 BIG-Bench and other benchmark tasks. They introduce the formal definition: a capability is emergent if it is absent in smaller models but present in larger ones.

2022
Chinchilla โ€” optimal data allocation revisited (Hoffmann et al.)
Inflection point

Hoffmann et al. show that prior models (GPT-3, Gopher) were undertrained: optimal scaling balances parameters and data equally. This shifts the interpretation of "emergence thresholds" โ€” some apparent thresholds may be artifacts of undertraining rather than parameter scale.

2023
"Mirage" critique โ€” emergence as metric artifact (Schaeffer et al.)
Inflection point

Schaeffer, Miranda, and Koyejo (NeurIPS 2023, Outstanding Paper Award) show that emergence is largely an artifact of discrete, non-linear evaluation metric choice. Replacing them with continuous metrics (token edit distance, log-likelihood) makes the scaling curve smooth and predictable.

2024
Predictable metrics โ€” GPT-4 capability prediction (OpenAI)

In the GPT-4 technical report, OpenAI demonstrates that certain capabilities (HumanEval pass-rate) can be predicted with <1% error from models 10,000ร— smaller โ€” provided an appropriate continuous metric is used. This reinforces Schaeffer et al.'s argument.

Technical details

Hyperparameters (configurable axes)

Model scaleCritical

Parameter count or training FLOPs against which emergence is measured.

8Bโ€“540B parametersScale of the PaLM family used in Wei et al. (2022).
10ยฒโฐโ€“10ยฒโด FLOPsTraining FLOPs range for frontier models from 2020โ€“2024.
Evaluation metricCritical

The metric choice determines whether emergence is observed. Discrete metrics amplify it; continuous metrics eliminate it.

exact_matchDiscrete; exhibits jumps.
token_edit_distanceContinuous; smooth curve.
Task typeHigh

Some task types (multi-step reasoning, instruction following) show clear emergence; others (sentiment classification) scale smoothly.

Multi-step reasoning (GSM8K)Strong emergence above ~100B parameters.
Sentiment classificationSmooth scaling, no threshold.

Execution paradigm

Primary mode
dense

Emergent abilities is an observed behavioral phenomenon of LLMs (typically dense Transformers), not an execution mode itself. This field describes the execution of models exhibiting emergence โ€” standardly dense Transformers.

Activation pattern
all_paths_active
Routing mechanism

Parallelism

Parallelism level
fully_parallel

Field not directly applicable to an observational phenomenon. Describes parallelism of training the models in which emergence is measured โ€” standardly the fully parallel training of dense Transformers.

Scope
training

Hardware requirements

Primary

Emergent abilities is a behavioral observation โ€” independent of specific hardware. Hardware requirements are determined by the underlying LLM, not by the concept itself.