Training

Pretraining

2018ActivePublished: 6 May 2026Updated: 6 May 2026Published

Key innovation

Training a model on massive unlabeled corpora using self-supervised objectives (e.g., next-token prediction, masked language modeling) to learn general-purpose representations before fine-tuning for specific tasks.

How it works

The model receives an input fragment with partially hidden or shifted information (next-token prediction in GPT, masked language modeling in BERT, contrastive learning in CLIP, next-frame prediction in world models). A loss function measures reconstruction/prediction quality. Training runs on GPU/TPU clusters for weeks or months over trillions of tokens. The pretrained model becomes a foundation that can be further fine-tuned, instruction-tuned, RLHF-aligned, or LoRA-adapted for specific applications.

Problem solved

Traditional supervised learning required enormous hand-labeled datasets per task, which did not scale. Self-supervised pretraining solves this by learning from raw unlabeled data — practically unlimited in supply — and transferring that knowledge to many downstream tasks with minimal supervised fine-tuning.

Components

Raw data corpusSource of data for self-supervised training

Massive unlabeled dataset (web crawl, code, books, video, robot telemetry). Typical scale: 10¹²–10¹³ tokens for LLMs.

Self-supervised objectiveLoss function without human labels

Predictive task that uses the data structure as the training signal — next-token prediction, masked language modeling, contrastive loss, next-frame prediction.

Base architectureBackbone holding the representations learned during pretraining

Most often a Transformer (encoder-only, decoder-only, or encoder-decoder); also Diffusion Models in generative video and images.

Compute clusterTraining infrastructure

Thousands of GPUs/TPUs running in parallel for weeks/months. Pretraining a GPT-4-class LLM typically requires 10²⁵+ FLOPs.

Implementation

Implementation pitfalls

Data contaminationHigh

Benchmark data (MMLU, HellaSwag) leaking into the pretraining corpus artificially inflates evaluation scores.

Fix:Decontamination pipeline — remove benchmark n-grams from the training corpus and evaluate on fresh datasets (held-out, post-training).

Loss spikes and training instabilityCritical

With large learning rates and fp16, training loss can spike and corrupt weights. Restart requires a checkpoint from days earlier.

Fix:Mixed precision (bfloat16), gradient clipping, learning rate warmup, frequent checkpointing, gradient-statistics monitoring.

Suboptimal tokens-to-parameters ratioHigh

Training too large a model on too little data (pre-Chinchilla) wastes compute and underperforms a smaller model on a larger corpus.

Fix:Apply Chinchilla scaling laws (~20 tokens/param) or newer ones (Llama 3 trained at >100 tokens/param for inference efficiency).

Low data qualityHigh

Raw web crawl contains duplicates, spam, low-quality, and toxic content. Without filtering, the result is a model weaker than one trained on a 10× smaller clean corpus.

Fix:Deduplication pipeline (MinHash, exact match), quality classification (FastText, Wikipedia-style classifier), toxicity filtering.

Evolution

Original paper · 2018 · OpenAI Tech Report · Alec Radford

Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

2013

Word2Vec — pretraining of word embeddings

Inflection point

Mikolov et al. show that self-supervised pretraining (skip-gram, CBOW) yields general-purpose word representations.

2018

GPT-1 and BERT — pretraining + fine-tuning as a paradigm

Inflection point

OpenAI GPT (causal LM) and Google BERT (masked LM) establish the standard: large pretraining + small task-specific fine-tuning.

2020

GPT-3 — pretraining produces models capable of in-context learning

Inflection point

175B parameters + 300B tokens demonstrates that pretrained knowledge alone solves many tasks without fine-tuning.

2021

CLIP — multimodal contrastive pretraining

OpenAI CLIP unifies image and text in one embedding space via contrastive pretraining on 400M pairs.

2022

Chinchilla — optimal tokens-to-parameters ratio

Inflection point

DeepMind shows that prior LLMs were undertrained — compute-optimal training needs roughly 20 tokens per parameter.

2023

Llama 2 — frontier-scale open-weight pretraining

Meta releases weights of a model trained on 2T tokens, democratizing access to large pretrained models.

2024

Robotics foundation models — pretraining for VLA

Inflection point

Pi-Zero (Physical Intelligence), Gemini Robotics, and RT-2 apply pretraining on multimodal + robot data as the foundation of VLAs.

2025

Frontier-scale pretraining — 10²⁶ FLOPs

GPT-5, Gemini 3, Claude Opus 4, and Grok 4 reach scales requiring clusters of 100k+ H100/B200 GPUs.

Technical details

Hyperparameters (configurable axes)

Corpus size (tokens)Critical

Number of tokens in the training corpus. Scale: 10⁹ (small models) to 10¹³+ (frontier LLMs).

Model size (parameters)Critical

Number of model parameters. Chinchilla scaling laws suggest an optimal tokens-to-params ratio of about 20:1.

Self-supervised objective typeCritical

Choice of task: causal LM (GPT), masked LM (BERT), contrastive (CLIP), denoising (T5), next-frame (world models).

Compute budget (FLOPs)High

Total floating-point operations. GPT-3 ≈ 3·10²³, GPT-4 ≈ 2·10²⁵, frontier 2025+ ≈ 10²⁶.

Data quality filteringHigh

Deduplication, quality classification, and toxicity filtering pipeline. Determines the effective ratio of "useful tokens".

Execution paradigm

Primary mode

dense

In standard pretraining, all parameters are updated at every step. MoE variants introduce sparse activation, but pretraining itself remains dense in the backward pass.

Activation pattern

all_paths_active

Parallelism

Parallelism level

fully_parallel

Pretraining is fully parallel across data parallelism + tensor parallelism + pipeline parallelism. Gradient synchronization is the main bottleneck on very large clusters.

Scope

trainingacross_devices