Robots AtlasRobots Atlas

Emergent Abilities of Large Language Models

Formalizing and empirically documenting the observation that certain capabilities of large language models appear discontinuously only above a specific scale threshold (parameters, data, FLOPs) and cannot be predicted from the performance of smaller models.

Category
Abstraction level
Operation level
Training-budget planning for a new LLM familyChoosing a scale threshold for the emergence of reasoning (CoT)Evaluating LLMs on BIG-Bench Hard, MMLU, GSM8K as a function of scaleArguing for further scaling of frontier modelsCritiquing LLM evaluation methodology (Schaeffer et al.)Deciding between instruction tuning and pure pretrainingAI safety analysis (unpredictable emergence of capabilities)Alignment research (whether dangerous capabilities are also emergent)

1. Choose a benchmark and metric: a specific task (e.g. 3-digit addition, MMLU, BIG-Bench Hard) with a discrete success metric (exact-match, multiple-choice accuracy). 2. Train/evaluate models of varying scale: a series of models from the same family with fixed architecture but different sizes (e.g. GPT-3: 125M β†’ 175B; PaLM: 8B β†’ 540B; LaMDA: 137B; Gopher: 280B). 3. Measure task performance as a function of scale (parameters, data, FLOPs). Most tasks show: below threshold β€” random performance (e.g. 25% for 4-way multiple choice); above threshold β€” sharp rise. 4. Identify the threshold: the point where performance significantly exceeds random. For CoT: ~100B parameters. For modular arithmetic: ~10Β²Β² FLOPs. 5. Critical analysis (Schaeffer et al. 2023): replace the discrete metric with a continuous one (e.g. token edit distance instead of exact-match) β€” if the curve becomes smooth, the emergence was a metric artifact, not a model phenomenon.

Language-model pretraining loss (cross-entropy) scales smoothly according to scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022 β€” Chinchilla), but downstream task performance is not smooth. In practice, it is hard to predict when a model will acquire a specific capability (reasoning, code generation, instruction following) based solely on smaller-scale curves β€” making it difficult to plan training budgets and pick architectures.

01

Model scale axis

X-axis of the emergence plot

Modular

The dimension along which emergence is measured β€” most commonly parameter count (e.g. 8B β†’ 540B), training tokens, or compute FLOPs. More rigorous analyses use FLOPs as a single axis capturing both parameter and data scale.

Parameter countTraining FLOPsTraining tokens
02

Evaluation metric

Determines whether emergence is observed

Modular

A function mapping model output to a scalar task-performance score. The choice of metric directly determines whether we observe a discontinuity (emergence) or a smooth scaling curve. Schaeffer et al. (2023) showed that discrete metrics (exact-match) create apparent emergence.

Exact-match accuracyMultiple-choice accuracyToken edit distanceLog-likelihood of correct answer
03

Emergence threshold

Defines where the capability emerges

A point on the scale axis where task performance abruptly exceeds the random baseline. Task- and metric-specific. For Chain-of-Thought on arithmetic: ~100B parameters. For modular arithmetic: ~10Β²Β² FLOPs. For simple classification: no threshold (smooth scaling).

Parallelism

Fully parallel

Field not directly applicable to an observational phenomenon. Describes parallelism of training the models in which emergence is measured β€” standardly the fully parallel training of dense Transformers.

Paradigm

Dense

All paths active

Emergent abilities is an observed behavioral phenomenon of LLMs (typically dense Transformers), not an execution mode itself. This field describes the execution of models exhibiting emergence β€” standardly dense Transformers.

Model scale

Critical
  • 8B–540B parametersScale of the PaLM family used in Wei et al. (2022).
  • 10²⁰–10²⁴ FLOPsTraining FLOPs range for frontier models from 2020–2024.

Parameter count or training FLOPs against which emergence is measured.

Evaluation metric

Critical
  • exact_matchDiscrete; exhibits jumps.
  • token_edit_distanceContinuous; smooth curve.

The metric choice determines whether emergence is observed. Discrete metrics amplify it; continuous metrics eliminate it.

Task type

Standard
  • Multi-step reasoning (GSM8K)Strong emergence above ~100B parameters.
  • Sentiment classificationSmooth scaling, no threshold.

Some task types (multi-step reasoning, instruction following) show clear emergence; others (sentiment classification) scale smoothly.

Common pitfalls

Confusing metric emergence with model emergence
HIGH

Most reported "emergences" disappear when discrete metrics (exact-match) are replaced with continuous ones (token edit distance, log-likelihood). Interpreting an exact-match jump as a fundamental capability jump leads to unjustified claims of unpredictability.

Always report results with at least one continuous metric (log-likelihood of the correct answer). Use predictive methods as in the GPT-4 technical report.

Undertrained models along the scale axis
HIGH

Pre-Chinchilla, most model families were undertrained. Apparent emergence thresholds as a function of parameters may have been artifacts of insufficient data β€” a larger model simply utilized the same corpus better.

Scale parameters and data together per Chinchilla scaling laws. Report emergence as a function of FLOPs, not parameters alone.

Cherry-picking benchmarks that show emergence
MEDIUM

Wei et al. (2022) selected emergent tasks from a large BIG-Bench pool. Many tasks show no emergence and scale smoothly. Selective reporting of only emergent tasks distorts the picture of LLM behavior.

Report the full distribution of scaling behaviors, not only emergent tasks. Use aggregates like BIG-Bench Hard with a representative subset.

Lack of replication and variance analysis
MEDIUM

Many "emergence thresholds" come from a single training seed. The same model trained with a different seed may show the threshold at a different point or not at all.

Train multiple seeds, report error bars and confidence intervals on scaling curves.

GENESIS Β· Source paper

Emergent Abilities of Large Language Models
2022TMLR 2022Jason Wei, Yi Tay, Rishi Bommasani et al.
2020

Scaling laws for neural language models (Kaplan et al.)

Kaplan et al. show that language-model pretraining loss scales smoothly with parameters, data, and FLOPs. This creates the expectation of smooth scaling on downstream tasks too.

2022

BIG-Bench releases 200+ diverse tasks

Beyond the Imitation Game Benchmark (BIG-Bench) β€” collaborative benchmark with 204 tasks evaluated across model scales. Forms the empirical basis for emergence observations.

2022

Emergent abilities concept formalized (Wei et al.)

breakthrough

Wei et al. publish "Emergent Abilities of Large Language Models" in TMLR, documenting discontinuous capability emergence across 137 BIG-Bench and other benchmark tasks. They introduce the formal definition: a capability is emergent if it is absent in smaller models but present in larger ones.

2022

Chinchilla β€” optimal data allocation revisited (Hoffmann et al.)

breakthrough

Hoffmann et al. show that prior models (GPT-3, Gopher) were undertrained: optimal scaling balances parameters and data equally. This shifts the interpretation of "emergence thresholds" β€” some apparent thresholds may be artifacts of undertraining rather than parameter scale.

2023

"Mirage" critique β€” emergence as metric artifact (Schaeffer et al.)

breakthrough

Schaeffer, Miranda, and Koyejo (NeurIPS 2023, Outstanding Paper Award) show that emergence is largely an artifact of discrete, non-linear evaluation metric choice. Replacing them with continuous metrics (token edit distance, log-likelihood) makes the scaling curve smooth and predictable.

2024

Predictable metrics β€” GPT-4 capability prediction (OpenAI)

In the GPT-4 technical report, OpenAI demonstrates that certain capabilities (HumanEval pass-rate) can be predicted with <1% error from models 10,000Γ— smaller β€” provided an appropriate continuous metric is used. This reinforces Schaeffer et al.'s argument.

Hardware agnosticPRIMARY

Emergent abilities is a behavioral observation β€” independent of specific hardware. Hardware requirements are determined by the underlying LLM, not by the concept itself.

BUILT ON

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Commonly used with

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT