Training

Scaling Laws (Kaplan / Chinchilla)

2020ActiveUpdated: 7 May 2026Published

Key innovation

Formalized empirical power-law relationships linking model performance to parameter count, data size, and compute budget, enabling performance prediction and optimal resource allocation.

How it works

For language models, loss L scales as L(N) ~ N^(-alpha_N), L(D) ~ D^(-alpha_D), L(C) ~ C^(-alpha_C), where the exponents alpha are characteristic of the model and task. Researchers fit these power laws to experimental results at various N, D, C and extrapolate to larger scales.

Problem solved

Lack of predictable compute allocation principles: it was unclear whether training a large model briefly versus a small model for longer is optimal, or how many parameters are needed for a given compute budget.

Components

Model parameter count (N)Representational capacity axis

Number of learned weights in the model (excluding embeddings in Kaplan's original formulation). The primary axis of representational capacity.

Training dataset size (D)Information capacity axis

Number of tokens (or examples) in the training set. Defines the maximum information available for the model to learn from.

Compute budget (C)Resource axis

Total training compute, typically expressed in FLOPs. For dense-attention transformers: C ≈ 6 · N · D.

Loss (L)Dependent (measured) variable

Cross-entropy loss (test/val) as the dependent variable in scaling laws: L(N), L(D), L(C) follow power-law forms with an irreducible-loss asymptote.

Power-law exponents (α_N, α_D, α_C)Curve-shape parameters

Empirically fitted exponents that control how fast loss decreases as N, D, or C grows. In Kaplan's original paper α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.050 (details depend on the fit).

Implementation

Reference implementations

Kaplan et al. 2020 — paper + figures

OpenAI

Official

Chinchilla — Hoffmann et al. 2022

DeepMind

Official

Chinchilla scaling laws — community replication / fit (EpochAI)

Epoch AI

Implementation pitfalls

Using Kaplan-optimal allocation instead of ChinchillaCritical

Kaplan's paper suggested scaling N much faster than D (GPT-3 was the result). Chinchilla showed this stemmed from suboptimal LR cooldown and undertraining; the actual compute-optimal allocation is roughly equal scaling of N and D (≈ 20 tokens per parameter).

Fix:Use Chinchilla-optimal allocation (~20 tokens/parameter) as a baseline. For deployment-cost-aware training, over-training (>>20:1) is rational — a smaller model trained on more tokens lowers inference cost.

Extrapolating scaling laws beyond the measurement rangeHigh

The exponents α are fit over a limited range of (N, D, C). Extrapolating by 2–3 orders of magnitude can be inaccurate, especially near the irreducible-loss asymptote.

Fix:Measure scaling laws on overlapping ranges (small + mid models) and validate the fit on a held-out scale. Include the irreducible-loss term in the fitting function.

Confusing compute-optimal with deployment-optimalMedium

Chinchilla optimizes training cost. In production, inference cost matters too — for models served at scale it pays off to train smaller models longer (Llama, Mistral).

Fix:Define the objective as training_cost + λ · inference_cost · usage_volume. For high-usage products, λ shifts the optimum toward smaller models trained on more data.

Assuming language scaling laws are universalHigh

The α exponents differ across modalities (vision, code, multimodal) and tasks (capability vs perplexity). Directly transferring Kaplan or Chinchilla numbers to other domains yields incorrect predictions.

Fix:For a new domain, fit your own scaling laws on small models before any large run. Be cautious with capability benchmarks — they do not scale as smoothly as loss.

Evolution

Original paper · 2020 · arXiv 2020 · Jared Kaplan

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

2020

Scaling Laws for Neural Language Models (OpenAI)

Inflection point

Kaplan et al. establish power-law relationships between compute, data, parameters, and loss.

2022

Chinchilla scaling laws by Hoffmann et al.

Inflection point

Hoffmann et al. show GPT-3-era models were undertrained; optimal compute split is ~equal between N and D.

2023

Scaling laws for specific domains and modalities

Researchers extend scaling laws to code, vision, multimodal, and reasoning tasks.

2018

Empirical batch-size scaling laws (precursor)

McCandlish et al. (OpenAI) describe the scaling of critical batch size via gradient noise scale — a methodological precursor to Kaplan.

An Empirical Model of Large-Batch Training (paper)

2024

Over-training era (Llama, Mistral, Gemma)

Inflection point

For production-served models, training beyond Chinchilla-optimal pays off: smaller N, much larger D (e.g. 100+ tokens/parameter in Llama-3) lowers inference cost.

2024

Chinchilla critique and refit (Epoch AI)

Independent replications (Epoch AI) showed Chinchilla's original fits may underestimate optimal D — the effective tokens/parameter ratio may exceed 20.

Chinchilla's scaling law fits are not as accurate as they seem (paper)

Technical details

Hyperparameters (configurable axes)

Parameter count (N)Critical

Scales from ~10^6 (small probes) to ~10^12+ (frontier LLMs). Increasing N reduces loss as L(N) ~ N^(-α_N) at fixed C.

125M

1.5B

70B

175B

Dataset size in tokens (D)Critical

Tokens in the training corpus. Chinchilla shows D should scale roughly linearly with N (≈ 20 tokens per parameter) for compute-optimal training.

300B

1.4T

15T+

Compute budget in FLOPs (C)Critical

Total training compute. For a given C, minimum loss is achieved at a specific (N*, D*) pair — Chinchilla gives N* ≈ D*/20.

~3e23 FLOP

~5.7e23 FLOP

Critical batch size (B_crit)High

Batch size above which the returns from greater data-parallelism diminish. Also follows a power-law in L (McCandlish et al. 2018).

Learning rate scheduleHigh

Optimal LR and its cooldown depend on (N, D). A poor LR can mask the true scaling law in empirical sweeps.

Architecture shape (depth/width)Low

Kaplan et al. showed that at fixed N the architecture shape (depth vs width) has marginal impact on L — so scale N rather than tune shape.

Hardware requirements

Primary

Scaling laws are empirical observations about the (N, D, C) → L relationship. They do not depend on a specific hardware architecture — they hold as long as training FLOPs and loss can be measured.

Good fit

In practice, fitting scaling laws requires running many training jobs at varying N and D — which depends on efficient LLM training hardware (H100/A100/B200/TPU). The critical batch size comes from data-parallel GPU literature.

Good fit

Chinchilla was trained on TPUs (Google). Scaling laws apply to TPU training just as much as to GPU training.

Sources

Scaling Laws for Neural Language Models

Paper

Training Compute-Optimal Large Language Models (Chinchilla)

Paper

An Empirical Model of Large-Batch Training (McCandlish et al.)

Paper

Chinchilla's scaling law fits are not as accurate as they seem (Epoch AI)

Blog