Training

SFT

2022ActiveUpdated: 6 May 2026Published

Key innovation

Enabled adaptation of large pre-trained language models to specific tasks and instruction-following behavior using relatively small, labeled datasets of demonstrations.

How it works

The SFT dataset contains (prompt p, response y) pairs. The loss is L = -sum log P(y_t | p, y_<t). The model is trained with gradient descent on these pairs, typically with a small learning rate. Techniques like LoRA or QLoRA are often used to reduce compute costs. Data may come from human annotators (e.g. FLAN, Dolly) or be synthetically generated by a stronger model.

Problem solved

Pre-trained models are good at text completion but not at following user instructions, answering questions in chat format, or generating safe and helpful responses.

Implementation

Implementation pitfalls

Catastrophic forgettingHigh

SFT on a narrow dataset can cause the model to forget previously learned capabilities. Use diverse datasets or regularization.

Overfitting on small SFT datasetMedium

With too few examples or too many epochs, the model memorizes demonstrations rather than generalizing.

Data quality is criticalHigh

Noisy, inconsistent, or biased SFT data is directly reflected in model behavior. Quality > quantity.

Evolution

Original paper · 2022 · NeurIPS 2022 · Long Ouyang

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray

2019

Pre-training + fine-tuning paradigm (GPT-1, BERT)

Inflection point

Radford et al. and Devlin et al. establish the pre-train/fine-tune paradigm.

2021

FLAN - SFT on instruction datasets

Wei et al. show that fine-tuning on diverse instruction datasets improves zero-shot performance.

2022

InstructGPT - SFT as stage 1 of RLHF

Inflection point

Ouyang et al. formalize SFT as the first step before reward modeling and PPO.

2023

LoRA and QLoRA - efficient SFT

Hu et al. (LoRA) and Dettmers et al. (QLoRA) enable SFT on consumer hardware by training only low-rank adapters.

Sources

Training language models to follow instructions with human feedback

Paper