RLHF
Components
First RLHF stage: supervised fine-tuning of the base model on a dataset of human-written demonstrations (prompt–response pairs). The resulting model π_SFT serves as the starting point for RL training and as the reference model for computing the KL penalty.
Official
Scalar model r_φ(x, y) trained on pairwise human comparison data. Learns to predict which response a human would prefer and provides the reward signal for the RL stage. Typically uses a Bradley-Terry objective: minimizes -log σ(r(x, y_w) - r(x, y_l)), where y_w is the preferred and y_l the rejected response.
Official
Third RLHF stage: optimization of policy π_θ via PPO to maximize the reward model's score while penalizing deviation from the reference policy (SFT). Objective: r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β prevents reward hacking.
Official
Dataset collected from human annotators containing pairwise comparisons of model responses (y_w preferred over y_l, or vice versa) for the same prompts. Used to train the reward model. Annotator quality and consistency directly affects reward model quality.
Official
Implementation
The policy model may learn to generate responses that score highly on the reward model but are actually low quality: excessively long, repetitive, formulaic, or containing phrases the reward model learned to reward disproportionately.
PPO training is sensitive to hyperparameters: learning rate, KL penalty β, batch size, PPO clipping range, and number of PPO epochs per batch. Small changes can cause training divergence or loss of language capabilities.
Different human annotators may have inconsistent preferences, introducing noise into the preference dataset and degrading reward model quality. Both the number of annotators and clarity of annotation guidelines impact the outcome.
RLHF can degrade model performance on standard NLP benchmarks (alignment tax): the model becomes more helpful and safe but may lose some raw language capabilities if β and LR are not tuned appropriately.
The RL stage requires loading four models simultaneously into GPU memory (policy, reference, reward model, value model). For 7B models this is ~56 GB of weights in fp16 alone, requiring advanced memory management techniques.
Evolution
Paper 'Deep reinforcement learning from human preferences' showed that human preferences between trajectory segments can effectively replace the reward function in RL, enabling complex behavior learning in Atari and simulated robotics environments.
Paper 'Learning to summarize with human feedback' extended RLHF to text summarization with GPT models, demonstrating the transfer of the technique from RL tasks to NLP tasks with language models.
Paper 'Training language models to follow instructions with human feedback' presented the full RLHF pipeline (SFT → RM Training → PPO) for GPT-3, creating InstructGPT. Showed that a 1.3B model trained with RLHF is preferred by humans over a 175B GPT-3 model without RLHF.
OpenAI deployed RLHF in ChatGPT, the first widely-used AI assistant trained with RLHF techniques. This triggered widespread adoption of RLHF by other labs (Anthropic, Google, Meta).
Rafailov et al. published DPO (arXiv:2305.18290), showing that the RLHF objective can be optimized directly via a single supervised loss on preference pairs, without a separate reward model or PPO loop.
Technical details
Hyperparameters (configurable axes)
Weight of the KL penalty in the PPO objective: Objective = r_φ(x,y) − β · KL(π_θ||π_SFT). Too small → reward hacking. Too large → minimal policy change from SFT.
The reward model typically shares the same architecture as the policy LLM, with a scalar output head instead of a language head. Reward model size affects preference signal quality.
Size of the preference dataset used to train the reward model. Directly impacts annotation cost and reward model quality.
Learning rate in the PPO stage. Too high → instability and reward hacking; too low → slow convergence.
Compute bottleneck
The standard RLHF RL stage requires loading four models simultaneously into GPU memory: the active policy (π_θ), a frozen reference policy (π_SFT) for KL penalty computation, the reward model (r_φ), and a value/critic model for PPO advantage estimation. For 7B models, this means ~4×14 GB = ~56 GB of weights in fp16 alone, before optimizer states and activations.
Execution paradigm
RLHF is a training pipeline, not an inference paradigm. Each of its three stages uses a standard dense Transformer. The term "stage-dependent" refers to the fact that each stage has a distinct training objective: cross-entropy (SFT), binary cross-entropy on pairs (RM), and a KL-penalized policy gradient (RL).
Parallelism
Within each stage, data parallelism and model parallelism (tensor/pipeline parallelism) can be applied across multiple GPUs/TPUs. Rollout generation during the RL stage can be parallelized across multiple policy replicas.
Hardware requirements
RLHF requires efficient GEMM operations across four simultaneous Transformer models during the PPO stage, accelerated by Tensor Cores (NVIDIA A100, H100). On-policy rollout generation is computationally expensive and demands GPUs with large HBM capacity (40–80 GB).
TPU v4/v5 are used by Google for RLHF on Gemini and PaLM-RLHF models. They efficiently handle GEMM operations and can support all four models in a TPU Pod configuration.