Pretraining
How it works
The model receives an input fragment with partially hidden or shifted information (next-token prediction in GPT, masked language modeling in BERT, contrastive learning in CLIP, next-frame prediction in world models). A loss function measures reconstruction/prediction quality. Training runs on GPU/TPU clusters for weeks or months over trillions of tokens. The pretrained model becomes a foundation that can be further fine-tuned, instruction-tuned, RLHF-aligned, or LoRA-adapted for specific applications.
Problem solved
Traditional supervised learning required enormous hand-labeled datasets per task, which did not scale. Self-supervised pretraining solves this by learning from raw unlabeled data — practically unlimited in supply — and transferring that knowledge to many downstream tasks with minimal supervised fine-tuning.
Components
Massive unlabeled dataset (web crawl, code, books, video, robot telemetry). Typical scale: 10¹²–10¹³ tokens for LLMs.
Predictive task that uses the data structure as the training signal — next-token prediction, masked language modeling, contrastive loss, next-frame prediction.
Most often a Transformer (encoder-only, decoder-only, or encoder-decoder); also Diffusion Models in generative video and images.
Thousands of GPUs/TPUs running in parallel for weeks/months. Pretraining a GPT-4-class LLM typically requires 10²⁵+ FLOPs.
Implementation
Benchmark data (MMLU, HellaSwag) leaking into the pretraining corpus artificially inflates evaluation scores.
With large learning rates and fp16, training loss can spike and corrupt weights. Restart requires a checkpoint from days earlier.
Training too large a model on too little data (pre-Chinchilla) wastes compute and underperforms a smaller model on a larger corpus.
Raw web crawl contains duplicates, spam, low-quality, and toxic content. Without filtering, the result is a model weaker than one trained on a 10× smaller clean corpus.
Evolution
Mikolov et al. show that self-supervised pretraining (skip-gram, CBOW) yields general-purpose word representations.
OpenAI GPT (causal LM) and Google BERT (masked LM) establish the standard: large pretraining + small task-specific fine-tuning.
175B parameters + 300B tokens demonstrates that pretrained knowledge alone solves many tasks without fine-tuning.
OpenAI CLIP unifies image and text in one embedding space via contrastive pretraining on 400M pairs.
DeepMind shows that prior LLMs were undertrained — compute-optimal training needs roughly 20 tokens per parameter.
Meta releases weights of a model trained on 2T tokens, democratizing access to large pretrained models.
Pi-Zero (Physical Intelligence), Gemini Robotics, and RT-2 apply pretraining on multimodal + robot data as the foundation of VLAs.
GPT-5, Gemini 3, Claude Opus 4, and Grok 4 reach scales requiring clusters of 100k+ H100/B200 GPUs.
Technical details
Hyperparameters (configurable axes)
Number of tokens in the training corpus. Scale: 10⁹ (small models) to 10¹³+ (frontier LLMs).
Number of model parameters. Chinchilla scaling laws suggest an optimal tokens-to-params ratio of about 20:1.
Choice of task: causal LM (GPT), masked LM (BERT), contrastive (CLIP), denoising (T5), next-frame (world models).
Total floating-point operations. GPT-3 ≈ 3·10²³, GPT-4 ≈ 2·10²⁵, frontier 2025+ ≈ 10²⁶.
Deduplication, quality classification, and toxicity filtering pipeline. Determines the effective ratio of "useful tokens".
Execution paradigm
In standard pretraining, all parameters are updated at every step. MoE variants introduce sparse activation, but pretraining itself remains dense in the backward pass.
Parallelism
Pretraining is fully parallel across data parallelism + tensor parallelism + pipeline parallelism. Gradient synchronization is the main bottleneck on very large clusters.
Hardware requirements
LLM pretraining is the dominant workload for H100/B200/GB200 GPUs — fp16/bf16/fp8 GEMM ops are their primary design target.
Google TPU v4/v5/Trillium are designed around pretraining Gemini and earlier models — high systolic-array throughput and InterChip Interconnect.
CPUs can train small R&D models, but frontier-scale pretraining is infeasible on CPUs due to limited tensor-ops throughput.