Emergent Abilities of Large Language Models

Formalizing and empirically documenting the observation that certain capabilities of large language models appear discontinuously only above a specific scale threshold (parameters, data, FLOPs) and cannot be predicted from the performance of smaller models.

Model scale axis

X-axis of the emergence plot

Modular

The dimension along which emergence is measured — most commonly parameter count (e.g. 8B → 540B), training tokens, or compute FLOPs. More rigorous analyses use FLOPs as a single axis capturing both parameter and data scale.

Evaluation metric

Determines whether emergence is observed

Modular

A function mapping model output to a scalar task-performance score. The choice of metric directly determines whether we observe a discontinuity (emergence) or a smooth scaling curve. Schaeffer et al. (2023) showed that discrete metrics (exact-match) create apparent emergence.

Emergence threshold

Defines where the capability emerges

A point on the scale axis where task performance abruptly exceeds the random baseline. Task- and metric-specific. For Chain-of-Thought on arithmetic: ~100B parameters. For modular arithmetic: ~10²² FLOPs. For simple classification: no threshold (smooth scaling).

Parallelism

Fully parallel

Field not directly applicable to an observational phenomenon. Describes parallelism of training the models in which emergence is measured — standardly the fully parallel training of dense Transformers.

Paradigm

Dense

All paths active

Emergent abilities is an observed behavioral phenomenon of LLMs (typically dense Transformers), not an execution mode itself. This field describes the execution of models exhibiting emergence — standardly dense Transformers.

Model scale

Critical

8B–540B parametersScale of the PaLM family used in Wei et al. (2022).
10²⁰–10²⁴ FLOPsTraining FLOPs range for frontier models from 2020–2024.

Parameter count or training FLOPs against which emergence is measured.

Evaluation metric

Critical

exact_matchDiscrete; exhibits jumps.
token_edit_distanceContinuous; smooth curve.

The metric choice determines whether emergence is observed. Discrete metrics amplify it; continuous metrics eliminate it.

Task type

Standard

Multi-step reasoning (GSM8K)Strong emergence above ~100B parameters.
Sentiment classificationSmooth scaling, no threshold.

Some task types (multi-step reasoning, instruction following) show clear emergence; others (sentiment classification) scale smoothly.

Common pitfalls

Confusing metric emergence with model emergence

HIGH

Most reported "emergences" disappear when discrete metrics (exact-match) are replaced with continuous ones (token edit distance, log-likelihood). Interpreting an exact-match jump as a fundamental capability jump leads to unjustified claims of unpredictability.

Always report results with at least one continuous metric (log-likelihood of the correct answer). Use predictive methods as in the GPT-4 technical report.

Undertrained models along the scale axis

HIGH

Pre-Chinchilla, most model families were undertrained. Apparent emergence thresholds as a function of parameters may have been artifacts of insufficient data — a larger model simply utilized the same corpus better.

Scale parameters and data together per Chinchilla scaling laws. Report emergence as a function of FLOPs, not parameters alone.

Cherry-picking benchmarks that show emergence

MEDIUM

Wei et al. (2022) selected emergent tasks from a large BIG-Bench pool. Many tasks show no emergence and scale smoothly. Selective reporting of only emergent tasks distorts the picture of LLM behavior.

Report the full distribution of scaling behaviors, not only emergent tasks. Use aggregates like BIG-Bench Hard with a representative subset.

Lack of replication and variance analysis

MEDIUM

Many "emergence thresholds" come from a single training seed. The same model trained with a different seed may show the threshold at a different point or not at all.

Train multiple seeds, report error bars and confidence intervals on scaling curves.

Reference implementations

BIG-Bench — benchmark for measuring emergenceofficial

Python · Google Research + community

BIG-Bench Hard — emergent subset of BIG-Benchofficial

Python · Suzgun et al.

Mirage — emergence-as-metric-artifact analysis codeofficial

Python · Rylan Schaeffer (Stanford)

MMLU — multi-subject scaling benchmarkofficial

Python · Hendrycks et al.

GENESIS · Source paper

Emergent Abilities of Large Language Models

2022TMLR 2022Jason Wei, Yi Tay, Rishi Bommasani et al.

2020

Scaling laws for neural language models (Kaplan et al.)

Kaplan et al. show that language-model pretraining loss scales smoothly with parameters, data, and FLOPs. This creates the expectation of smooth scaling on downstream tasks too.

Scaling Laws for Neural Language Models

2022

BIG-Bench releases 200+ diverse tasks

Beyond the Imitation Game Benchmark (BIG-Bench) — collaborative benchmark with 204 tasks evaluated across model scales. Forms the empirical basis for emergence observations.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

2022

Emergent abilities concept formalized (Wei et al.)

breakthrough

Wei et al. publish "Emergent Abilities of Large Language Models" in TMLR, documenting discontinuous capability emergence across 137 BIG-Bench and other benchmark tasks. They introduce the formal definition: a capability is emergent if it is absent in smaller models but present in larger ones.

Emergent Abilities of Large Language Models

2022

Chinchilla — optimal data allocation revisited (Hoffmann et al.)

breakthrough

Hoffmann et al. show that prior models (GPT-3, Gopher) were undertrained: optimal scaling balances parameters and data equally. This shifts the interpretation of "emergence thresholds" — some apparent thresholds may be artifacts of undertraining rather than parameter scale.

Training Compute-Optimal Large Language Models

2023

"Mirage" critique — emergence as metric artifact (Schaeffer et al.)

breakthrough

Schaeffer, Miranda, and Koyejo (NeurIPS 2023, Outstanding Paper Award) show that emergence is largely an artifact of discrete, non-linear evaluation metric choice. Replacing them with continuous metrics (token edit distance, log-likelihood) makes the scaling curve smooth and predictable.

Are Emergent Abilities of Large Language Models a Mirage?

2024

Predictable metrics — GPT-4 capability prediction (OpenAI)

In the GPT-4 technical report, OpenAI demonstrates that certain capabilities (HumanEval pass-rate) can be predicted with <1% error from models 10,000× smaller — provided an appropriate continuous metric is used. This reinforces Schaeffer et al.'s argument.

GPT-4 Technical Report

Hardware agnosticPRIMARY

Emergent abilities is a behavioral observation — independent of specific hardware. Hardware requirements are determined by the underlying LLM, not by the concept itself.

BUILT ON

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Commonly used with

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT

Title	Publisher	Type
Emergent Abilities of Large Language Models (Wei et al. 2022)	—	scientific article
Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al., NeurIPS 2023)	—	scientific article
BIG-Bench (Beyond the Imitation Game)	—	scientific article
Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al. 2022)	—	scientific article
Google AI Blog — Characterizing Emergent Phenomena in Large Language Models	—	blog