Scaling Laws

Language model training planningCompute budget allocationModel performance predictionArchitecture decisions

For language models, loss L scales as L(N) ~ N^(-alpha_N), L(D) ~ D^(-alpha_D), L(C) ~ C^(-alpha_C), where the exponents alpha are characteristic of the model and task. Researchers fit these power laws to experimental results at various N, D, C and extrapolate to larger scales.

Lack of predictable compute allocation principles: it was unclear whether training a large model briefly versus a small model for longer is optimal, or how many parameters are needed for a given compute budget.

01

N — model size

Representational capacity axis

Number of learned weights in the model (excluding embeddings in Kaplan's original formulation). The primary axis of representational capacity.

02

D — dataset size

Information capacity axis

Number of tokens (or examples) in the training set. Defines the maximum information available for the model to learn from.

03

C — compute

Resource axis

Total training compute, typically expressed in FLOPs. For dense-attention transformers: C ≈ 6 · N · D.

04

L — loss

Dependent (measured) variable

Cross-entropy loss (test/val) as the dependent variable in scaling laws: L(N), L(D), L(C) follow power-law forms with an irreducible-loss asymptote.

05

α — exponents

Curve-shape parameters

Empirically fitted exponents that control how fast loss decreases as N, D, or C grows. In Kaplan's original paper α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.050 (details depend on the fit).

Parameter count (N)

Critical

125M
1.5B
70B
175B

Scales from ~10^6 (small probes) to ~10^12+ (frontier LLMs). Increasing N reduces loss as L(N) ~ N^(-α_N) at fixed C.

Dataset size in tokens (D)

Critical

300B
1.4T
15T+

Tokens in the training corpus. Chinchilla shows D should scale roughly linearly with N (≈ 20 tokens per parameter) for compute-optimal training.

Compute budget in FLOPs (C)

Critical

~3e23 FLOP
~5.7e23 FLOP

Total training compute. For a given C, minimum loss is achieved at a specific (N*, D*) pair — Chinchilla gives N* ≈ D*/20.

Critical batch size (B_crit)

Standard

Batch size above which the returns from greater data-parallelism diminish. Also follows a power-law in L (McCandlish et al. 2018).

Learning rate schedule

Standard

Optimal LR and its cooldown depend on (N, D). A poor LR can mask the true scaling law in empirical sweeps.

Architecture shape (depth/width)

Standard

Kaplan et al. showed that at fixed N the architecture shape (depth vs width) has marginal impact on L — so scale N rather than tune shape.

Common pitfalls

Using Kaplan-optimal allocation instead of Chinchilla

CRITICAL

Kaplan's paper suggested scaling N much faster than D (GPT-3 was the result). Chinchilla showed this stemmed from suboptimal LR cooldown and undertraining; the actual compute-optimal allocation is roughly equal scaling of N and D (≈ 20 tokens per parameter).

Use Chinchilla-optimal allocation (~20 tokens/parameter) as a baseline. For deployment-cost-aware training, over-training (>>20:1) is rational — a smaller model trained on more tokens lowers inference cost.

Extrapolating scaling laws beyond the measurement range

HIGH

The exponents α are fit over a limited range of (N, D, C). Extrapolating by 2–3 orders of magnitude can be inaccurate, especially near the irreducible-loss asymptote.

Measure scaling laws on overlapping ranges (small + mid models) and validate the fit on a held-out scale. Include the irreducible-loss term in the fitting function.

Confusing compute-optimal with deployment-optimal

MEDIUM

Chinchilla optimizes training cost. In production, inference cost matters too — for models served at scale it pays off to train smaller models longer (Llama, Mistral).

Define the objective as training_cost + λ · inference_cost · usage_volume. For high-usage products, λ shifts the optimum toward smaller models trained on more data.

Assuming language scaling laws are universal

HIGH

The α exponents differ across modalities (vision, code, multimodal) and tasks (capability vs perplexity). Directly transferring Kaplan or Chinchilla numbers to other domains yields incorrect predictions.

For a new domain, fit your own scaling laws on small models before any large run. Be cautious with capability benchmarks — they do not scale as smoothly as loss.

Reference implementations

Kaplan et al. 2020 — paper + figuresofficial

OpenAI

Chinchilla — Hoffmann et al. 2022official

DeepMind

Chinchilla scaling laws — community replication / fit (EpochAI)

Epoch AI

GENESIS · Source paper

Scaling Laws for Neural Language Models

2020arXiv 2020Jared Kaplan, Sam McCandlish, Tom Henighan et al.

2020

Scaling Laws for Neural Language Models (OpenAI)

breakthrough

Kaplan et al. establish power-law relationships between compute, data, parameters, and loss.

2022

Chinchilla scaling laws by Hoffmann et al.

breakthrough

Hoffmann et al. show GPT-3-era models were undertrained; optimal compute split is ~equal between N and D.

2023

Scaling laws for specific domains and modalities

Researchers extend scaling laws to code, vision, multimodal, and reasoning tasks.

2018

Empirical batch-size scaling laws (precursor)

McCandlish et al. (OpenAI) describe the scaling of critical batch size via gradient noise scale — a methodological precursor to Kaplan.

An Empirical Model of Large-Batch Training

2024

Over-training era (Llama, Mistral, Gemma)

breakthrough

For production-served models, training beyond Chinchilla-optimal pays off: smaller N, much larger D (e.g. 100+ tokens/parameter in Llama-3) lowers inference cost.

2024

Chinchilla critique and refit (Epoch AI)

Independent replications (Epoch AI) showed Chinchilla's original fits may underestimate optimal D — the effective tokens/parameter ratio may exceed 20.

Chinchilla's scaling law fits are not as accurate as they seem

Hardware agnosticPRIMARY

Scaling laws are empirical observations about the (N, D, C) → L relationship. They do not depend on a specific hardware architecture — they hold as long as training FLOPs and loss can be measured.

GPU Tensor CoresGOOD

In practice, fitting scaling laws requires running many training jobs at varying N and D — which depends on efficient LLM training hardware (H100/A100/B200/TPU). The critical batch size comes from data-parallel GPU literature.

TPUGOOD

Chinchilla was trained on TPUs (Google). Scaling laws apply to TPU training just as much as to GPU training.

BUILT ON

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT

Commonly used with

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT

Emergent Abilities

Emergent abilities of large language models is an observation, formalized by Wei et al. (2022), that certain LLM capabilities — such as multi-step reasoning, zero-shot instruction following, modular arithmetic, or answering questions in low-resource languages — do not appear gradually with scale but emerge discontinuously, only after crossing a threshold of parameter count, training data, or compute FLOPs. Below the threshold, performance is random or near-zero; above it, performance jumps abruptly to substantially better-than-random. The phenomenon has been documented across more than 130 tasks from the BIG-Bench benchmark and other suites (MMLU, TruthfulQA). Canonical examples include: Chain-of-Thought reasoning (~100B-parameter threshold for PaLM/GPT-3), InstructGPT-style instruction following, modular arithmetic, International Phonetic Alphabet transliteration, and multi-step question answering. In 2023, Schaeffer, Miranda, and Koyejo (NeurIPS 2023, "Are Emergent Abilities of Large Language Models a Mirage?") challenged emergence as a real fundamental phenomenon. They showed that non-linear or discontinuous evaluation metrics (e.g. exact-match accuracy) artificially create the appearance of a jump — replacing them with continuous metrics (token edit distance, log-likelihood) reveals a smooth, predictable scaling curve. This critique is now central to the debate: some abilities are emergent in a metric-dependent sense, while others (e.g. inductive reasoning) appear to show genuine phase discontinuities. The concept has critical practical significance: if emergence is real, certain abilities cannot be predicted or trained at smaller scale — forcing organizations to train large models "blindly." If emergence is a metric artifact, then scaling laws (Hoffmann et al., Chinchilla) are sufficient to predict the behavior of larger models.

GO TO CONCEPT

MoE

Mixture of Experts (MoE) is an architecture in which a model is composed of multiple parallel sub-networks — the experts — along with a gating (routing) network that determines, for each input, which subset of experts to activate and how to combine their outputs. The gating network produces a weighting over experts; in the original soft formulation (Jacobs et al., 1991), all experts are weighted and summed. In the sparse formulation (Shazeer et al., 2017), only the top-k scoring experts are activated, and the remaining experts produce no output and incur no compute cost for that input. In the context of large language models, MoE is typically applied as a replacement for the feed-forward network (FFN) sub-layer within each Transformer block. Each token is routed to a small number of expert FFNs (commonly top-1 or top-2), with the router being a learned linear projection followed by a softmax. The outputs of the selected experts are weighted by the corresponding router scores and summed. A central challenge in sparse MoE is load balancing: without explicit regularization, the router tends to collapse onto a small set of preferred experts, leaving others undertrained. This is addressed via auxiliary load balancing losses added to the training objective, which encourage a roughly uniform distribution of tokens across experts. Expert parallelism is the standard distributed training and inference strategy for large MoE models: each expert is placed on a separate device, so that the total parameter count scales with the number of devices without increasing per-device memory or per-token FLOPs proportionally. The capacity factor controls the maximum number of tokens each expert can process per batch; tokens that overflow the capacity are either dropped or passed through a residual connection. Tuning the capacity factor is a critical practical consideration.

GO TO CONCEPT

Title	Publisher	Type
Scaling Laws for Neural Language Models	—	scientific article
Training Compute-Optimal Large Language Models (Chinchilla)	—	scientific article
An Empirical Model of Large-Batch Training (McCandlish et al.)	—	scientific article
Chinchilla's scaling law fits are not as accurate as they seem (Epoch AI)	—	blog

Scaling Laws for Neural Language Models

scientific article

Training Compute-Optimal Large Language Models (Chinchilla)

scientific article

An Empirical Model of Large-Batch Training (McCandlish et al.)

scientific article

Chinchilla's scaling law fits are not as accurate as they seem (Epoch AI)

blog

Use cases

How it works

Problem solved

Main components

N — model size

D — dataset size

C — compute

L — loss

α — exponents

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Semantic relations

BUILT ON

Commonly used with

Sources