Robots Atlas>ROBOTS ATLAS
Training

Scaling Laws (Kaplan / Chinchilla)

2020ActiveUpdated: 7 May 2026Published
Key innovation
Formalized empirical power-law relationships linking model performance to parameter count, data size, and compute budget, enabling performance prediction and optimal resource allocation.
Category
Training
Abstraction level
Pattern
Operation level
Training
Use cases
Language model training planningCompute budget allocationModel performance predictionArchitecture decisions

How it works

For language models, loss L scales as L(N) ~ N^(-alpha_N), L(D) ~ D^(-alpha_D), L(C) ~ C^(-alpha_C), where the exponents alpha are characteristic of the model and task. Researchers fit these power laws to experimental results at various N, D, C and extrapolate to larger scales.

Problem solved

Lack of predictable compute allocation principles: it was unclear whether training a large model briefly versus a small model for longer is optimal, or how many parameters are needed for a given compute budget.

Components

Model parameter count (N)Representational capacity axis

Number of learned weights in the model (excluding embeddings in Kaplan's original formulation). The primary axis of representational capacity.

Training dataset size (D)Information capacity axis

Number of tokens (or examples) in the training set. Defines the maximum information available for the model to learn from.

Compute budget (C)Resource axis

Total training compute, typically expressed in FLOPs. For dense-attention transformers: C ≈ 6 · N · D.

Loss (L)Dependent (measured) variable

Cross-entropy loss (test/val) as the dependent variable in scaling laws: L(N), L(D), L(C) follow power-law forms with an irreducible-loss asymptote.

Power-law exponents (α_N, α_D, α_C)Curve-shape parameters

Empirically fitted exponents that control how fast loss decreases as N, D, or C grows. In Kaplan's original paper α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.050 (details depend on the fit).

Implementation

Implementation pitfalls
Using Kaplan-optimal allocation instead of ChinchillaCritical

Kaplan's paper suggested scaling N much faster than D (GPT-3 was the result). Chinchilla showed this stemmed from suboptimal LR cooldown and undertraining; the actual compute-optimal allocation is roughly equal scaling of N and D (≈ 20 tokens per parameter).

Fix:Use Chinchilla-optimal allocation (~20 tokens/parameter) as a baseline. For deployment-cost-aware training, over-training (>>20:1) is rational — a smaller model trained on more tokens lowers inference cost.
Extrapolating scaling laws beyond the measurement rangeHigh

The exponents α are fit over a limited range of (N, D, C). Extrapolating by 2–3 orders of magnitude can be inaccurate, especially near the irreducible-loss asymptote.

Fix:Measure scaling laws on overlapping ranges (small + mid models) and validate the fit on a held-out scale. Include the irreducible-loss term in the fitting function.
Confusing compute-optimal with deployment-optimalMedium

Chinchilla optimizes training cost. In production, inference cost matters too — for models served at scale it pays off to train smaller models longer (Llama, Mistral).

Fix:Define the objective as training_cost + λ · inference_cost · usage_volume. For high-usage products, λ shifts the optimum toward smaller models trained on more data.
Assuming language scaling laws are universalHigh

The α exponents differ across modalities (vision, code, multimodal) and tasks (capability vs perplexity). Directly transferring Kaplan or Chinchilla numbers to other domains yields incorrect predictions.

Fix:For a new domain, fit your own scaling laws on small models before any large run. Be cautious with capability benchmarks — they do not scale as smoothly as loss.

Evolution

Original paper · 2020 · arXiv 2020 · Jared Kaplan
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei
2020
Scaling Laws for Neural Language Models (OpenAI)
Inflection point

Kaplan et al. establish power-law relationships between compute, data, parameters, and loss.

2022
Chinchilla scaling laws by Hoffmann et al.
Inflection point

Hoffmann et al. show GPT-3-era models were undertrained; optimal compute split is ~equal between N and D.

2023
Scaling laws for specific domains and modalities

Researchers extend scaling laws to code, vision, multimodal, and reasoning tasks.

2018
Empirical batch-size scaling laws (precursor)

McCandlish et al. (OpenAI) describe the scaling of critical batch size via gradient noise scale — a methodological precursor to Kaplan.

2024
Over-training era (Llama, Mistral, Gemma)
Inflection point

For production-served models, training beyond Chinchilla-optimal pays off: smaller N, much larger D (e.g. 100+ tokens/parameter in Llama-3) lowers inference cost.

2024
Chinchilla critique and refit (Epoch AI)

Independent replications (Epoch AI) showed Chinchilla's original fits may underestimate optimal D — the effective tokens/parameter ratio may exceed 20.

Technical details

Hyperparameters (configurable axes)

Parameter count (N)Critical

Scales from ~10^6 (small probes) to ~10^12+ (frontier LLMs). Increasing N reduces loss as L(N) ~ N^(-α_N) at fixed C.

125M
1.5B
70B
175B
Dataset size in tokens (D)Critical

Tokens in the training corpus. Chinchilla shows D should scale roughly linearly with N (≈ 20 tokens per parameter) for compute-optimal training.

300B
1.4T
15T+
Compute budget in FLOPs (C)Critical

Total training compute. For a given C, minimum loss is achieved at a specific (N*, D*) pair — Chinchilla gives N* ≈ D*/20.

~3e23 FLOP
~5.7e23 FLOP
Critical batch size (B_crit)High

Batch size above which the returns from greater data-parallelism diminish. Also follows a power-law in L (McCandlish et al. 2018).

Learning rate scheduleHigh

Optimal LR and its cooldown depend on (N, D). A poor LR can mask the true scaling law in empirical sweeps.

Architecture shape (depth/width)Low

Kaplan et al. showed that at fixed N the architecture shape (depth vs width) has marginal impact on L — so scale N rather than tune shape.

Hardware requirements

Primary

Scaling laws are empirical observations about the (N, D, C) → L relationship. They do not depend on a specific hardware architecture — they hold as long as training FLOPs and loss can be measured.

Good fit

In practice, fitting scaling laws requires running many training jobs at varying N and D — which depends on efficient LLM training hardware (H100/A100/B200/TPU). The critical batch size comes from data-parallel GPU literature.

Good fit

Chinchilla was trained on TPUs (Google). Scaling laws apply to TPU training just as much as to GPU training.