Alignment

RLHF

Updated: 10 May 2026

Key innovation

Replacing hand-crafted reward functions with a reward model trained on human preference data, enabling complex behavior aligned with human intent without requiring explicit specification of all reward criteria.

Components

Supervised Fine-Tuning (SFT)Establishes an initial policy capable of following instructions at a baseline level before the preference signal is applied.

First RLHF stage: supervised fine-tuning of the base model on a dataset of human-written demonstrations (prompt–response pairs). The resulting model π_SFT serves as the starting point for RL training and as the reference model for computing the KL penalty.

Official

Reward ModelTransforms subjective human preferences into a scalar reward signal optimizable by an RL algorithm.

Scalar model r_φ(x, y) trained on pairwise human comparison data. Learns to predict which response a human would prefer and provides the reward signal for the RL stage. Typically uses a Bradley-Terry objective: minimizes -log σ(r(x, y_w) - r(x, y_l)), where y_w is the preferred and y_l the rejected response.

Official

RL Stage (PPO with KL Penalty)Optimizes the model policy to generate responses aligned with human preferences while maintaining generation stability.

Third RLHF stage: optimization of policy π_θ via PPO to maximize the reward model's score while penalizing deviation from the reference policy (SFT). Objective: r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β prevents reward hacking.

PPO (Proximal Policy Optimization)Default RL algorithm in RLHF. Constrains policy updates via clipping or KL penalization to ensure training stability.

A2C (Advantage Actor-Critic)Alternative RL algorithm used by DeepMind in Gopher/GopherCite instead of PPO.

Official

Human Preference DatasetEncodes human preferences into a reward model in a machine-learning-compatible format.

Dataset collected from human annotators containing pairwise comparisons of model responses (y_w preferred over y_l, or vice versa) for the same prompts. Used to train the reward model. Annotator quality and consistency directly affects reward model quality.

Pairwise comparisons (pair rankings)Annotators choose the better of two responses. Most common form of preference collection in RLHF (used in InstructGPT, Claude, Gemini).

Absolute ratings (Likert ratings)Annotators rate responses on a numeric scale instead of comparing pairs. Less common but used in some approaches.

Official

Implementation

Reference implementations

TRL (Transformer Reinforcement Learning) – Hugging Face

Python · Hugging Face

Hugging Face RLHF Blog Post with Sample Code

Python · Hugging Face

Implementation pitfalls

Reward hacking – exploiting weaknesses in the reward modelHigh

The policy model may learn to generate responses that score highly on the reward model but are actually low quality: excessively long, repetitive, formulaic, or containing phrases the reward model learned to reward disproportionately.

Fix:Applying a KL penalty (β) to constrain deviation from π_SFT. Regularly monitoring response quality on a human-evaluated test set. Limiting the number of PPO steps and tracking reward scale.

PPO training instabilityHigh

PPO training is sensitive to hyperparameters: learning rate, KL penalty β, batch size, PPO clipping range, and number of PPO epochs per batch. Small changes can cause training divergence or loss of language capabilities.

Fix:Use well-established hyperparameter ranges (LR ~1e-6–1e-5, β ~0.01–0.1). Monitor reward, KL loss, and model-generated samples throughout training. Checkpoint the model regularly.

Annotator Inconsistency and SubjectivityHigh

Different human annotators may have inconsistent preferences, introducing noise into the preference dataset and degrading reward model quality. Both the number of annotators and clarity of annotation guidelines impact the outcome.

Fix:Precise annotation guidelines with examples. Filtering annotators based on inter-annotator agreement. Multiple annotations per example with aggregation. Applying additional quality control mechanisms (screening tests as used in InstructGPT).

Alignment tax – loss of base model capabilitiesMedium

RLHF can degrade model performance on standard NLP benchmarks (alignment tax): the model becomes more helpful and safe but may lose some raw language capabilities if β and LR are not tuned appropriately.

Fix:Applying PPO-ptx (mixing PPO updates with pre-training gradients, as in InstructGPT). Regularly evaluating on benchmarks both during and after RL training. Careful tuning of β.

Extremely high GPU memory requirements during the PPO stageMedium

The RL stage requires loading four models simultaneously into GPU memory (policy, reference, reward model, value model). For 7B models this is ~56 GB of weights in fp16 alone, requiring advanced memory management techniques.

Fix:Using libraries such as TRL + DeepSpeed ZeRO-3. Gradient checkpointing for the policy model. Offloading frozen models (reference, RM) to CPU when not actively used. Consider DPO as an alternative requiring only two models.

Evolution

Original paper · 2017 · NeurIPS 2017 (Advances in Neural Information Processing Systems 30) · Paul Christiano

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei

2017

Christiano et al. define RLHF in the context of deep RL (NeurIPS 2017)

Inflection point

Paper 'Deep reinforcement learning from human preferences' showed that human preferences between trajectory segments can effectively replace the reward function in RL, enabling complex behavior learning in Atari and simulated robotics environments.

Deep reinforcement learning from human preferences (paper)

2020

Stiennon et al. (OpenAI) apply RLHF to text summarization

Paper 'Learning to summarize with human feedback' extended RLHF to text summarization with GPT models, demonstrating the transfer of the technique from RL tasks to NLP tasks with language models.

Learning to summarize with human feedback (paper)

2022

InstructGPT (Ouyang et al., NeurIPS 2022) – RLHF established as standard LLM alignment method

Inflection point

Paper 'Training language models to follow instructions with human feedback' presented the full RLHF pipeline (SFT → RM Training → PPO) for GPT-3, creating InstructGPT. Showed that a 1.3B model trained with RLHF is preferred by humans over a 175B GPT-3 model without RLHF.

Training language models to follow instructions with human feedback (paper)

2022

ChatGPT (December 2022) – broad deployment of RLHF in consumer products

Inflection point

OpenAI deployed RLHF in ChatGPT, the first widely-used AI assistant trained with RLHF techniques. This triggered widespread adoption of RLHF by other labs (Anthropic, Google, Meta).

2023

Direct Preference Optimization (DPO) – mathematically equivalent alternative to RLHF without RL

Inflection point

Rafailov et al. published DPO (arXiv:2305.18290), showing that the RLHF objective can be optimized directly via a single supervised loss on preference pairs, without a separate reward model or PPO loop.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (paper)

Technical details

Hyperparameters (configurable axes)

KL penalty coefficient (β)Critical

Weight of the KL penalty in the PPO objective: Objective = r_φ(x,y) − β · KL(π_θ||π_SFT). Too small → reward hacking. Too large → minimal policy change from SFT.

0.01 – 0.1Typical range of β values in RLHF for LLMs. InstructGPT used values within this range.

0.2Higher β value — stronger regularization relative to SFT.

Reward Model ArchitectureHigh

The reward model typically shares the same architecture as the policy LLM, with a scalar output head instead of a language head. Reward model size affects preference signal quality.

Taki sam rozmiar jak polityka (np. 7B)Used in InstructGPT. Provides a consistent level of comprehension alignment.

Mniejszy niż politykaSaves GPU memory; may reduce reward signal quality.

Number of preference comparison pairsHigh

Size of the preference dataset used to train the reward model. Directly impacts annotation cost and reward model quality.

~33 000Number of comparison pairs used by OpenAI to train the reward model for InstructGPT (Ouyang et al. 2022).

~500 000+Scaling for larger models (e.g., Claude, Gemini).

Learning rate PPOHigh

Learning rate in the PPO stage. Too high → instability and reward hacking; too low → slow convergence.

1e-6 – 1e-5Typical range for RLHF LLM fine-tuning. Significantly lower than during pretraining.

Compute bottleneck

Storing four models simultaneously in memory during the PPO stage

The standard RLHF RL stage requires loading four models simultaneously into GPU memory: the active policy (π_θ), a frozen reference policy (π_SFT) for KL penalty computation, the reward model (r_φ), and a value/critic model for PPO advantage estimation. For 7B models, this means ~4×14 GB = ~56 GB of weights in fp16 alone, before optimizer states and activations.

Depends on

Rozmiar modelu politykiLiczba kroków rollout i batch size PPO

Execution paradigm

Primary mode

dense

RLHF is a training pipeline, not an inference paradigm. Each of its three stages uses a standard dense Transformer. The term "stage-dependent" refers to the fact that each stage has a distinct training objective: cross-entropy (SFT), binary cross-entropy on pairs (RM), and a KL-penalized policy gradient (RL).

Activation pattern

stage_dependent

Parallelism

Parallelism level

partially_parallel

Within each stage, data parallelism and model parallelism (tensor/pipeline parallelism) can be applied across multiple GPUs/TPUs. Rollout generation during the RL stage can be parallelized across multiple policy replicas.

Scope

trainingacross_devices

Constraints

!The three RLHF stages (SFT → RM Training → RL Fine-Tuning) must be executed sequentially. The RL stage requires the reward model from the previous stage.

!PPO requires generating rollouts with the current policy, precluding full parallelism with earlier samples.

Hardware requirements

Primary

RLHF requires efficient GEMM operations across four simultaneous Transformer models during the PPO stage, accelerated by Tensor Cores (NVIDIA A100, H100). On-policy rollout generation is computationally expensive and demands GPUs with large HBM capacity (40–80 GB).

Good fit

TPU v4/v5 are used by Google for RLHF on Gemini and PaLM-RLHF models. They efficiently handle GEMM operations and can support all four models in a TPU Pod configuration.

Sources

Training language models to follow instructions with human feedback

Deep reinforcement learning from human preferences

Paper

arXiv (NeurIPS 2017)

Learning to summarize with human feedback

Paper

arXiv (NeurIPS 2020)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper

arXiv (NeurIPS 2023)

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Blog

Hugging Face