N β model size
Representational capacity axis
Number of learned weights in the model (excluding embeddings in Kaplan's original formulation). The primary axis of representational capacity.
Formalized empirical power-law relationships linking model performance to parameter count, data size, and compute budget, enabling performance prediction and optimal resource allocation.
For language models, loss L scales as L(N) ~ N^(-alpha_N), L(D) ~ D^(-alpha_D), L(C) ~ C^(-alpha_C), where the exponents alpha are characteristic of the model and task. Researchers fit these power laws to experimental results at various N, D, C and extrapolate to larger scales.
Lack of predictable compute allocation principles: it was unclear whether training a large model briefly versus a small model for longer is optimal, or how many parameters are needed for a given compute budget.
Representational capacity axis
Number of learned weights in the model (excluding embeddings in Kaplan's original formulation). The primary axis of representational capacity.
Information capacity axis
Number of tokens (or examples) in the training set. Defines the maximum information available for the model to learn from.
Resource axis
Total training compute, typically expressed in FLOPs. For dense-attention transformers: C β 6 Β· N Β· D.
Dependent (measured) variable
Cross-entropy loss (test/val) as the dependent variable in scaling laws: L(N), L(D), L(C) follow power-law forms with an irreducible-loss asymptote.
Curve-shape parameters
Empirically fitted exponents that control how fast loss decreases as N, D, or C grows. In Kaplan's original paper Ξ±_N β 0.076, Ξ±_D β 0.095, Ξ±_C β 0.050 (details depend on the fit).
Parameter count (N)
Scales from ~10^6 (small probes) to ~10^12+ (frontier LLMs). Increasing N reduces loss as L(N) ~ N^(-Ξ±_N) at fixed C.
Dataset size in tokens (D)
Tokens in the training corpus. Chinchilla shows D should scale roughly linearly with N (β 20 tokens per parameter) for compute-optimal training.
Compute budget in FLOPs (C)
Total training compute. For a given C, minimum loss is achieved at a specific (N*, D*) pair β Chinchilla gives N* β D*/20.
Critical batch size (B_crit)
Batch size above which the returns from greater data-parallelism diminish. Also follows a power-law in L (McCandlish et al. 2018).
Learning rate schedule
Optimal LR and its cooldown depend on (N, D). A poor LR can mask the true scaling law in empirical sweeps.
Architecture shape (depth/width)
Kaplan et al. showed that at fixed N the architecture shape (depth vs width) has marginal impact on L β so scale N rather than tune shape.
Kaplan's paper suggested scaling N much faster than D (GPT-3 was the result). Chinchilla showed this stemmed from suboptimal LR cooldown and undertraining; the actual compute-optimal allocation is roughly equal scaling of N and D (β 20 tokens per parameter).
Use Chinchilla-optimal allocation (~20 tokens/parameter) as a baseline. For deployment-cost-aware training, over-training (>>20:1) is rational β a smaller model trained on more tokens lowers inference cost.
The exponents Ξ± are fit over a limited range of (N, D, C). Extrapolating by 2β3 orders of magnitude can be inaccurate, especially near the irreducible-loss asymptote.
Measure scaling laws on overlapping ranges (small + mid models) and validate the fit on a held-out scale. Include the irreducible-loss term in the fitting function.
Chinchilla optimizes training cost. In production, inference cost matters too β for models served at scale it pays off to train smaller models longer (Llama, Mistral).
Define the objective as training_cost + Ξ» Β· inference_cost Β· usage_volume. For high-usage products, Ξ» shifts the optimum toward smaller models trained on more data.
The Ξ± exponents differ across modalities (vision, code, multimodal) and tasks (capability vs perplexity). Directly transferring Kaplan or Chinchilla numbers to other domains yields incorrect predictions.
For a new domain, fit your own scaling laws on small models before any large run. Be cautious with capability benchmarks β they do not scale as smoothly as loss.
GENESIS Β· Source paper
Scaling Laws for Neural Language ModelsScaling Laws for Neural Language Models (OpenAI)
breakthroughKaplan et al. establish power-law relationships between compute, data, parameters, and loss.
Chinchilla scaling laws by Hoffmann et al.
breakthroughHoffmann et al. show GPT-3-era models were undertrained; optimal compute split is ~equal between N and D.
Scaling laws for specific domains and modalities
Researchers extend scaling laws to code, vision, multimodal, and reasoning tasks.
Empirical batch-size scaling laws (precursor)
McCandlish et al. (OpenAI) describe the scaling of critical batch size via gradient noise scale β a methodological precursor to Kaplan.
Over-training era (Llama, Mistral, Gemma)
breakthroughFor production-served models, training beyond Chinchilla-optimal pays off: smaller N, much larger D (e.g. 100+ tokens/parameter in Llama-3) lowers inference cost.
Chinchilla critique and refit (Epoch AI)
Independent replications (Epoch AI) showed Chinchilla's original fits may underestimate optimal D β the effective tokens/parameter ratio may exceed 20.
Scaling laws are empirical observations about the (N, D, C) β L relationship. They do not depend on a specific hardware architecture β they hold as long as training FLOPs and loss can be measured.
In practice, fitting scaling laws requires running many training jobs at varying N and D β which depends on efficient LLM training hardware (H100/A100/B200/TPU). The critical batch size comes from data-parallel GPU literature.
Chinchilla was trained on TPUs (Google). Scaling laws apply to TPU training just as much as to GPU training.
Transformer is a neural network architecture proposed by Vaswani et al. in βAttention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation β quadratic attention complexity with respect to sequence length (O(nΒ²)) β is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTA Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTPretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data β next tokens in text, masked words, future video frames, future robot states β without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" β dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPTTransformer is a neural network architecture proposed by Vaswani et al. in βAttention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation β quadratic attention complexity with respect to sequence length (O(nΒ²)) β is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTA Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTPretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data β next tokens in text, masked words, future video frames, future robot states β without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" β dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPTEmergent abilities of large language models is an observation, formalized by Wei et al. (2022), that certain LLM capabilities β such as multi-step reasoning, zero-shot instruction following, modular arithmetic, or answering questions in low-resource languages β do not appear gradually with scale but emerge discontinuously, only after crossing a threshold of parameter count, training data, or compute FLOPs. Below the threshold, performance is random or near-zero; above it, performance jumps abruptly to substantially better-than-random. The phenomenon has been documented across more than 130 tasks from the BIG-Bench benchmark and other suites (MMLU, TruthfulQA). Canonical examples include: Chain-of-Thought reasoning (~100B-parameter threshold for PaLM/GPT-3), InstructGPT-style instruction following, modular arithmetic, International Phonetic Alphabet transliteration, and multi-step question answering. In 2023, Schaeffer, Miranda, and Koyejo (NeurIPS 2023, "Are Emergent Abilities of Large Language Models a Mirage?") challenged emergence as a real fundamental phenomenon. They showed that non-linear or discontinuous evaluation metrics (e.g. exact-match accuracy) artificially create the appearance of a jump β replacing them with continuous metrics (token edit distance, log-likelihood) reveals a smooth, predictable scaling curve. This critique is now central to the debate: some abilities are emergent in a metric-dependent sense, while others (e.g. inductive reasoning) appear to show genuine phase discontinuities. The concept has critical practical significance: if emergence is real, certain abilities cannot be predicted or trained at smaller scale β forcing organizations to train large models "blindly." If emergence is a metric artifact, then scaling laws (Hoffmann et al., Chinchilla) are sufficient to predict the behavior of larger models.
GO TO CONCEPTMixture of Experts (MoE) is an architecture in which a model is composed of multiple parallel sub-networks β the experts β along with a gating (routing) network that determines, for each input, which subset of experts to activate and how to combine their outputs. The gating network produces a weighting over experts; in the original soft formulation (Jacobs et al., 1991), all experts are weighted and summed. In the sparse formulation (Shazeer et al., 2017), only the top-k scoring experts are activated, and the remaining experts produce no output and incur no compute cost for that input. In the context of large language models, MoE is typically applied as a replacement for the feed-forward network (FFN) sub-layer within each Transformer block. Each token is routed to a small number of expert FFNs (commonly top-1 or top-2), with the router being a learned linear projection followed by a softmax. The outputs of the selected experts are weighted by the corresponding router scores and summed. A central challenge in sparse MoE is load balancing: without explicit regularization, the router tends to collapse onto a small set of preferred experts, leaving others undertrained. This is addressed via auxiliary load balancing losses added to the training objective, which encourage a roughly uniform distribution of tokens across experts. Expert parallelism is the standard distributed training and inference strategy for large MoE models: each expert is placed on a separate device, so that the total parameter count scales with the number of devices without increasing per-device memory or per-token FLOPs proportionally. The capacity factor controls the maximum number of tokens each expert can process per batch; tokens that overflow the capacity are either dropped or passed through a residual connection. Tuning the capacity factor is a critical practical consideration.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Scaling Laws for Neural Language Models | β | scientific article |
| Training Compute-Optimal Large Language Models (Chinchilla) | β | scientific article |
| An Empirical Model of Large-Batch Training (McCandlish et al.) | β | scientific article |
| Chinchilla's scaling law fits are not as accurate as they seem (Epoch AI) | β | blog |