Supervised Fine-Tuning
Enabled adaptation of large pre-trained language models to specific tasks and instruction-following behavior using relatively small, labeled datasets of demonstrations.
The SFT dataset contains (prompt p, response y) pairs. The loss is L = -sum log P(y_t | p, y_<t). The model is trained with gradient descent on these pairs, typically with a small learning rate. Techniques like LoRA or QLoRA are often used to reduce compute costs. Data may come from human annotators (e.g. FLAN, Dolly) or be synthetically generated by a stronger model.
Pre-trained models are good at text completion but not at following user instructions, answering questions in chat format, or generating safe and helpful responses.
Common pitfalls
Catastrophic forgettingHIGH
SFT on a narrow dataset can cause the model to forget previously learned capabilities. Use diverse datasets or regularization.
Overfitting on small SFT datasetMEDIUM
With too few examples or too many epochs, the model memorizes demonstrations rather than generalizing.
Data quality is criticalHIGH
Noisy, inconsistent, or biased SFT data is directly reflected in model behavior. Quality > quantity.
GENESIS · Source paper
Training language models to follow instructions with human feedbackPre-training + fine-tuning paradigm (GPT-1, BERT)
breakthroughRadford et al. and Devlin et al. establish the pre-train/fine-tune paradigm.
FLAN - SFT on instruction datasets
Wei et al. show that fine-tuning on diverse instruction datasets improves zero-shot performance.
InstructGPT - SFT as stage 1 of RLHF
breakthroughOuyang et al. formalize SFT as the first step before reward modeling and PPO.
LoRA and QLoRA - efficient SFT
Hu et al. (LoRA) and Dettmers et al. (QLoRA) enable SFT on consumer hardware by training only low-rank adapters.
Commonly used with
RLHF
Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.
GO TO CONCEPTInstruction Tuning
Instruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Training language models to follow instructions with human feedback | — | scientific article |