Reinforcement Learning from Human Feedback

Replacing hand-crafted reward functions with a reward model trained on human preference data, enabling complex behavior aligned with human intent without requiring explicit specification of all reward criteria.

SFT

Establishes an initial policy capable of following instructions at a baseline level before the preference signal is applied.

Modular

First RLHF stage: supervised fine-tuning of the base model on a dataset of human-written demonstrations (prompt–response pairs). The resulting model π_SFT serves as the starting point for RL training and as the reference model for computing the KL penalty.

Reward Model (RM)

Transforms subjective human preferences into a scalar reward signal optimizable by an RL algorithm.

Modular

Scalar model r_φ(x, y) trained on pairwise human comparison data. Learns to predict which response a human would prefer and provides the reward signal for the RL stage. Typically uses a Bradley-Terry objective: minimizes -log σ(r(x, y_w) - r(x, y_l)), where y_w is the preferred and y_l the rejected response.

RL Fine-Tuning (PPO)

Optimizes the model policy to generate responses aligned with human preferences while maintaining generation stability.

Modular

Third RLHF stage: optimization of policy π_θ via PPO to maximize the reward model's score while penalizing deviation from the reference policy (SFT). Objective: r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β prevents reward hacking.

Human Preference Dataset

Encodes human preferences into a reward model in a machine-learning-compatible format.

Modular

Dataset collected from human annotators containing pairwise comparisons of model responses (y_w preferred over y_l, or vice versa) for the same prompts. Used to train the reward model. Annotator quality and consistency directly affects reward model quality.

Wąskie gardło: Storing four models simultaneously in memory during the PPO stage

The standard RLHF RL stage requires loading four models simultaneously into GPU memory: the active policy (π_θ), a frozen reference policy (π_SFT) for KL penalty computation, the reward model (r_φ), and a value/critic model for PPO advantage estimation. For 7B models, this means ~4×14 GB = ~56 GB of weights in fp16 alone, before optimizer states and activations.

Parallelism

Partially parallel

Within each stage, data parallelism and model parallelism (tensor/pipeline parallelism) can be applied across multiple GPUs/TPUs. Rollout generation during the RL stage can be parallelized across multiple policy replicas.

Paradigm

Dense

Stage dependent

RLHF is a training pipeline, not an inference paradigm. Each of its three stages uses a standard dense Transformer. The term "stage-dependent" refers to the fact that each stage has a distinct training objective: cross-entropy (SFT), binary cross-entropy on pairs (RM), and a KL-penalized policy gradient (RL).

KL penalty coefficient (β)

Critical

0.01 – 0.1Typical range of β values in RLHF for LLMs. InstructGPT used values within this range.
0.2Higher β value — stronger regularization relative to SFT.

Weight of the KL penalty in the PPO objective: Objective = r_φ(x,y) − β · KL(π_θ||π_SFT). Too small → reward hacking. Too large → minimal policy change from SFT.

Reward Model Architecture

Standard

Taki sam rozmiar jak polityka (np. 7B)Used in InstructGPT. Provides a consistent level of comprehension alignment.
Mniejszy niż politykaSaves GPU memory; may reduce reward signal quality.

The reward model typically shares the same architecture as the policy LLM, with a scalar output head instead of a language head. Reward model size affects preference signal quality.

Number of preference comparison pairs

Standard

~33 000Number of comparison pairs used by OpenAI to train the reward model for InstructGPT (Ouyang et al. 2022).
~500 000+Scaling for larger models (e.g., Claude, Gemini).

Size of the preference dataset used to train the reward model. Directly impacts annotation cost and reward model quality.

Learning rate PPO

Standard

1e-6 – 1e-5Typical range for RLHF LLM fine-tuning. Significantly lower than during pretraining.

Learning rate in the PPO stage. Too high → instability and reward hacking; too low → slow convergence.

Common pitfalls

Reward hacking – exploiting weaknesses in the reward model

HIGH

The policy model may learn to generate responses that score highly on the reward model but are actually low quality: excessively long, repetitive, formulaic, or containing phrases the reward model learned to reward disproportionately.

Applying a KL penalty (β) to constrain deviation from π_SFT. Regularly monitoring response quality on a human-evaluated test set. Limiting the number of PPO steps and tracking reward scale.

PPO training instability

HIGH

PPO training is sensitive to hyperparameters: learning rate, KL penalty β, batch size, PPO clipping range, and number of PPO epochs per batch. Small changes can cause training divergence or loss of language capabilities.

Use well-established hyperparameter ranges (LR ~1e-6–1e-5, β ~0.01–0.1). Monitor reward, KL loss, and model-generated samples throughout training. Checkpoint the model regularly.

Annotator Inconsistency and Subjectivity

HIGH

Different human annotators may have inconsistent preferences, introducing noise into the preference dataset and degrading reward model quality. Both the number of annotators and clarity of annotation guidelines impact the outcome.

Precise annotation guidelines with examples. Filtering annotators based on inter-annotator agreement. Multiple annotations per example with aggregation. Applying additional quality control mechanisms (screening tests as used in InstructGPT).

Alignment tax – loss of base model capabilities

MEDIUM

RLHF can degrade model performance on standard NLP benchmarks (alignment tax): the model becomes more helpful and safe but may lose some raw language capabilities if β and LR are not tuned appropriately.

Applying PPO-ptx (mixing PPO updates with pre-training gradients, as in InstructGPT). Regularly evaluating on benchmarks both during and after RL training. Careful tuning of β.

Extremely high GPU memory requirements during the PPO stage

MEDIUM

The RL stage requires loading four models simultaneously into GPU memory (policy, reference, reward model, value model). For 7B models this is ~56 GB of weights in fp16 alone, requiring advanced memory management techniques.

Using libraries such as TRL + DeepSpeed ZeRO-3. Gradient checkpointing for the policy model. Offloading frozen models (reference, RM) to CPU when not actively used. Consider DPO as an alternative requiring only two models.

Reference implementations

TRL (Transformer Reinforcement Learning) – Hugging Face

Python · Hugging Face

Hugging Face RLHF Blog Post with Sample Code

Python · Hugging Face

GENESIS · Source paper

Deep reinforcement learning from human preferences

2017NeurIPS 2017 (Advances in Neural Information Processing Systems 30)Paul Christiano, Jan Leike, Tom B. Brown et al.

2017

Christiano et al. define RLHF in the context of deep RL (NeurIPS 2017)

breakthrough

Paper 'Deep reinforcement learning from human preferences' showed that human preferences between trajectory segments can effectively replace the reward function in RL, enabling complex behavior learning in Atari and simulated robotics environments.

Deep reinforcement learning from human preferences

2020

Stiennon et al. (OpenAI) apply RLHF to text summarization

Paper 'Learning to summarize with human feedback' extended RLHF to text summarization with GPT models, demonstrating the transfer of the technique from RL tasks to NLP tasks with language models.

Learning to summarize with human feedback

2022

InstructGPT (Ouyang et al., NeurIPS 2022) – RLHF established as standard LLM alignment method

breakthrough

Paper 'Training language models to follow instructions with human feedback' presented the full RLHF pipeline (SFT → RM Training → PPO) for GPT-3, creating InstructGPT. Showed that a 1.3B model trained with RLHF is preferred by humans over a 175B GPT-3 model without RLHF.

Training language models to follow instructions with human feedback

2022

ChatGPT (December 2022) – broad deployment of RLHF in consumer products

breakthrough

OpenAI deployed RLHF in ChatGPT, the first widely-used AI assistant trained with RLHF techniques. This triggered widespread adoption of RLHF by other labs (Anthropic, Google, Meta).

2023

Direct Preference Optimization (DPO) – mathematically equivalent alternative to RLHF without RL

breakthrough

Rafailov et al. published DPO (arXiv:2305.18290), showing that the RLHF objective can be optimized directly via a single supervised loss on preference pairs, without a separate reward model or PPO loop.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

GPU Tensor CoresPRIMARY

RLHF requires efficient GEMM operations across four simultaneous Transformer models during the PPO stage, accelerated by Tensor Cores (NVIDIA A100, H100). On-policy rollout generation is computationally expensive and demands GPUs with large HBM capacity (40–80 GB).

Multi-node multi-GPU setups are standard for RLHF on models larger than 7B. Training RLHF on 70B+ models requires at least a dozen A100-80GB GPUs with efficient tensor parallelism and pipeline parallelism.

TPUGOOD

TPU v4/v5 are used by Google for RLHF on Gemini and PaLM-RLHF models. They efficiently handle GEMM operations and can support all four models in a TPU Pod configuration.

Implementation of RLHF on TPUs requires JAX/Flax frameworks and adaptation of the PPO loop to the TPU environment.

BUILT ON

SFT

Supervised Fine-Tuning (SFT) is a post-training stage in which a pre-trained language model is further optimized on a labeled set of (prompt, response) pairs. Each pair contains an instruction or question and a reference response written by a human or filtered automatically. The model minimizes cross-entropy loss on the response tokens. SFT is the first stage of the RLHF pipeline (Ouyang et al., 2022) and is critical for teaching the model to follow instructions. SFT alone can significantly improve model usability without requiring reinforcement learning. The method is used in InstructGPT, ChatGPT, Llama-2-Chat, and many other models.

GO TO CONCEPT

Commonly used with

SFT

GO TO CONCEPT

Instruction Tuning

Instruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.

GO TO CONCEPT