Robots Atlas>ROBOTS ATLAS
Training

Pretraining

2018ActivePublished: 6 May 2026Updated: 6 May 2026Published
Key innovation
Training a model on massive unlabeled corpora using self-supervised objectives (e.g., next-token prediction, masked language modeling) to learn general-purpose representations before fine-tuning for specific tasks.
Category
Training
Abstraction level
Paradigm
Operation level
Data
Use cases
Foundation models (LLMs, VLMs, VLAs)Pretrained text encoders (BERT, RoBERTa)Generative language models (GPT)Multimodal contrastive models (CLIP)Robotics foundation models (Pi-Zero, Gemini Robotics, Ti0)World models (action-conditioned video generation)Pretrained audio/speech models (Wav2Vec, Whisper)

How it works

The model receives an input fragment with partially hidden or shifted information (next-token prediction in GPT, masked language modeling in BERT, contrastive learning in CLIP, next-frame prediction in world models). A loss function measures reconstruction/prediction quality. Training runs on GPU/TPU clusters for weeks or months over trillions of tokens. The pretrained model becomes a foundation that can be further fine-tuned, instruction-tuned, RLHF-aligned, or LoRA-adapted for specific applications.

Problem solved

Traditional supervised learning required enormous hand-labeled datasets per task, which did not scale. Self-supervised pretraining solves this by learning from raw unlabeled data — practically unlimited in supply — and transferring that knowledge to many downstream tasks with minimal supervised fine-tuning.

Components

Raw data corpusSource of data for self-supervised training

Massive unlabeled dataset (web crawl, code, books, video, robot telemetry). Typical scale: 10¹²–10¹³ tokens for LLMs.

Self-supervised objectiveLoss function without human labels

Predictive task that uses the data structure as the training signal — next-token prediction, masked language modeling, contrastive loss, next-frame prediction.

Base architectureBackbone holding the representations learned during pretraining

Most often a Transformer (encoder-only, decoder-only, or encoder-decoder); also Diffusion Models in generative video and images.

Compute clusterTraining infrastructure

Thousands of GPUs/TPUs running in parallel for weeks/months. Pretraining a GPT-4-class LLM typically requires 10²⁵+ FLOPs.

Implementation

Implementation pitfalls
Data contaminationHigh

Benchmark data (MMLU, HellaSwag) leaking into the pretraining corpus artificially inflates evaluation scores.

Fix:Decontamination pipeline — remove benchmark n-grams from the training corpus and evaluate on fresh datasets (held-out, post-training).
Loss spikes and training instabilityCritical

With large learning rates and fp16, training loss can spike and corrupt weights. Restart requires a checkpoint from days earlier.

Fix:Mixed precision (bfloat16), gradient clipping, learning rate warmup, frequent checkpointing, gradient-statistics monitoring.
Suboptimal tokens-to-parameters ratioHigh

Training too large a model on too little data (pre-Chinchilla) wastes compute and underperforms a smaller model on a larger corpus.

Fix:Apply Chinchilla scaling laws (~20 tokens/param) or newer ones (Llama 3 trained at >100 tokens/param for inference efficiency).
Low data qualityHigh

Raw web crawl contains duplicates, spam, low-quality, and toxic content. Without filtering, the result is a model weaker than one trained on a 10× smaller clean corpus.

Fix:Deduplication pipeline (MinHash, exact match), quality classification (FastText, Wikipedia-style classifier), toxicity filtering.

Evolution

Original paper · 2018 · OpenAI Tech Report · Alec Radford
Improving Language Understanding by Generative Pre-Training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever
2013
Word2Vec — pretraining of word embeddings
Inflection point

Mikolov et al. show that self-supervised pretraining (skip-gram, CBOW) yields general-purpose word representations.

2018
GPT-1 and BERT — pretraining + fine-tuning as a paradigm
Inflection point

OpenAI GPT (causal LM) and Google BERT (masked LM) establish the standard: large pretraining + small task-specific fine-tuning.

2020
GPT-3 — pretraining produces models capable of in-context learning
Inflection point

175B parameters + 300B tokens demonstrates that pretrained knowledge alone solves many tasks without fine-tuning.

2021
CLIP — multimodal contrastive pretraining

OpenAI CLIP unifies image and text in one embedding space via contrastive pretraining on 400M pairs.

2022
Chinchilla — optimal tokens-to-parameters ratio
Inflection point

DeepMind shows that prior LLMs were undertrained — compute-optimal training needs roughly 20 tokens per parameter.

2023
Llama 2 — frontier-scale open-weight pretraining

Meta releases weights of a model trained on 2T tokens, democratizing access to large pretrained models.

2024
Robotics foundation models — pretraining for VLA
Inflection point

Pi-Zero (Physical Intelligence), Gemini Robotics, and RT-2 apply pretraining on multimodal + robot data as the foundation of VLAs.

2025
Frontier-scale pretraining — 10²⁶ FLOPs

GPT-5, Gemini 3, Claude Opus 4, and Grok 4 reach scales requiring clusters of 100k+ H100/B200 GPUs.

Technical details

Hyperparameters (configurable axes)

Corpus size (tokens)Critical

Number of tokens in the training corpus. Scale: 10⁹ (small models) to 10¹³+ (frontier LLMs).

Model size (parameters)Critical

Number of model parameters. Chinchilla scaling laws suggest an optimal tokens-to-params ratio of about 20:1.

Self-supervised objective typeCritical

Choice of task: causal LM (GPT), masked LM (BERT), contrastive (CLIP), denoising (T5), next-frame (world models).

Compute budget (FLOPs)High

Total floating-point operations. GPT-3 ≈ 3·10²³, GPT-4 ≈ 2·10²⁵, frontier 2025+ ≈ 10²⁶.

Data quality filteringHigh

Deduplication, quality classification, and toxicity filtering pipeline. Determines the effective ratio of "useful tokens".

Execution paradigm

Primary mode
dense

In standard pretraining, all parameters are updated at every step. MoE variants introduce sparse activation, but pretraining itself remains dense in the backward pass.

Activation pattern
all_paths_active

Parallelism

Parallelism level
fully_parallel

Pretraining is fully parallel across data parallelism + tensor parallelism + pipeline parallelism. Gradient synchronization is the main bottleneck on very large clusters.

Scope
trainingacross_devices

Hardware requirements

Primary

LLM pretraining is the dominant workload for H100/B200/GB200 GPUs — fp16/bf16/fp8 GEMM ops are their primary design target.

Primary

Google TPU v4/v5/Trillium are designed around pretraining Gemini and earlier models — high systolic-array throughput and InterChip Interconnect.

Limited

CPUs can train small R&D models, but frontier-scale pretraining is infeasible on CPUs due to limited tensor-ops throughput.