Scaling Laws (Kaplan / Chinchilla)
How it works
For language models, loss L scales as L(N) ~ N^(-alpha_N), L(D) ~ D^(-alpha_D), L(C) ~ C^(-alpha_C), where the exponents alpha are characteristic of the model and task. Researchers fit these power laws to experimental results at various N, D, C and extrapolate to larger scales.
Problem solved
Lack of predictable compute allocation principles: it was unclear whether training a large model briefly versus a small model for longer is optimal, or how many parameters are needed for a given compute budget.
Components
Number of learned weights in the model (excluding embeddings in Kaplan's original formulation). The primary axis of representational capacity.
Number of tokens (or examples) in the training set. Defines the maximum information available for the model to learn from.
Total training compute, typically expressed in FLOPs. For dense-attention transformers: C ≈ 6 · N · D.
Cross-entropy loss (test/val) as the dependent variable in scaling laws: L(N), L(D), L(C) follow power-law forms with an irreducible-loss asymptote.
Empirically fitted exponents that control how fast loss decreases as N, D, or C grows. In Kaplan's original paper α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.050 (details depend on the fit).
Implementation
Kaplan's paper suggested scaling N much faster than D (GPT-3 was the result). Chinchilla showed this stemmed from suboptimal LR cooldown and undertraining; the actual compute-optimal allocation is roughly equal scaling of N and D (≈ 20 tokens per parameter).
The exponents α are fit over a limited range of (N, D, C). Extrapolating by 2–3 orders of magnitude can be inaccurate, especially near the irreducible-loss asymptote.
Chinchilla optimizes training cost. In production, inference cost matters too — for models served at scale it pays off to train smaller models longer (Llama, Mistral).
The α exponents differ across modalities (vision, code, multimodal) and tasks (capability vs perplexity). Directly transferring Kaplan or Chinchilla numbers to other domains yields incorrect predictions.
Evolution
Kaplan et al. establish power-law relationships between compute, data, parameters, and loss.
Hoffmann et al. show GPT-3-era models were undertrained; optimal compute split is ~equal between N and D.
Researchers extend scaling laws to code, vision, multimodal, and reasoning tasks.
McCandlish et al. (OpenAI) describe the scaling of critical batch size via gradient noise scale — a methodological precursor to Kaplan.
For production-served models, training beyond Chinchilla-optimal pays off: smaller N, much larger D (e.g. 100+ tokens/parameter in Llama-3) lowers inference cost.
Independent replications (Epoch AI) showed Chinchilla's original fits may underestimate optimal D — the effective tokens/parameter ratio may exceed 20.
Technical details
Hyperparameters (configurable axes)
Scales from ~10^6 (small probes) to ~10^12+ (frontier LLMs). Increasing N reduces loss as L(N) ~ N^(-α_N) at fixed C.
Tokens in the training corpus. Chinchilla shows D should scale roughly linearly with N (≈ 20 tokens per parameter) for compute-optimal training.
Total training compute. For a given C, minimum loss is achieved at a specific (N*, D*) pair — Chinchilla gives N* ≈ D*/20.
Batch size above which the returns from greater data-parallelism diminish. Also follows a power-law in L (McCandlish et al. 2018).
Optimal LR and its cooldown depend on (N, D). A poor LR can mask the true scaling law in empirical sweeps.
Kaplan et al. showed that at fixed N the architecture shape (depth vs width) has marginal impact on L — so scale N rather than tune shape.
Hardware requirements
Scaling laws are empirical observations about the (N, D, C) → L relationship. They do not depend on a specific hardware architecture — they hold as long as training FLOPs and loss can be measured.
In practice, fitting scaling laws requires running many training jobs at varying N and D — which depends on efficient LLM training hardware (H100/A100/B200/TPU). The critical batch size comes from data-parallel GPU literature.
Chinchilla was trained on TPUs (Google). Scaling laws apply to TPU training just as much as to GPU training.