SFT
How it works
The SFT dataset contains (prompt p, response y) pairs. The loss is L = -sum log P(y_t | p, y_<t). The model is trained with gradient descent on these pairs, typically with a small learning rate. Techniques like LoRA or QLoRA are often used to reduce compute costs. Data may come from human annotators (e.g. FLAN, Dolly) or be synthetically generated by a stronger model.
Problem solved
Pre-trained models are good at text completion but not at following user instructions, answering questions in chat format, or generating safe and helpful responses.
Implementation
SFT on a narrow dataset can cause the model to forget previously learned capabilities. Use diverse datasets or regularization.
With too few examples or too many epochs, the model memorizes demonstrations rather than generalizing.
Noisy, inconsistent, or biased SFT data is directly reflected in model behavior. Quality > quantity.
Evolution
Radford et al. and Devlin et al. establish the pre-train/fine-tune paradigm.
Wei et al. show that fine-tuning on diverse instruction datasets improves zero-shot performance.
Ouyang et al. formalize SFT as the first step before reward modeling and PPO.
Hu et al. (LoRA) and Dettmers et al. (QLoRA) enable SFT on consumer hardware by training only low-rank adapters.