Instruction Tuning
Components
A curated collection of (instruction, optional input, output) triples covering diverse task types. The breadth of task clusters and template diversity are key factors in how well the tuned model generalizes to unseen tasks. Common formats include the Alpaca format and conversation-style chat templates.
Official
The pretrained language model (typically a causal LM or seq2seq model) that serves as the starting point for fine-tuning. The base model's scale is a key factor: instruction tuning has been shown to provide greater generalization benefits at larger model sizes (Wei et al. 2021).
Official
Standard token-level cross-entropy loss computed only on the target response tokens. Loss is masked (set to -100 or equivalent) on the instruction and input tokens, so the model only learns to predict the output. Training uses standard gradient descent with backpropagation through the model weights.
A natural language prompt template that frames each training example as an instruction. FLAN used 10 distinct templates per dataset. Templates typically include: a verb-based task description, the input context (if any), and a prompt for the output. Template diversity is an important factor in zero-shot generalization.
Official
Implementation
Instruction tuning on a small or narrow dataset can cause the model to lose pretrained capabilities (e.g., in-context learning, reasoning on tasks not represented in the fine-tuning distribution). This is a well-documented risk of SFT on small, low-diversity datasets.
Training on too few task types or task clusters limits zero-shot generalization. Wei et al. (2021) ablation studies showed that generalization on unseen tasks consistently improves as the number of task clusters in training increases.
The quality of instruction-response pairs directly determines model behavior. Noisy, inconsistent, or factually incorrect responses in training data cause the model to learn undesirable behaviors including hallucination.
If loss is computed on both instruction and response tokens, the model learns to predict the instruction tokens too, which wastes model capacity and can produce worse instruction-following behavior. Loss should only be computed on the target response tokens.
Inconsistent use of prompt templates, special tokens, or chat templates across training examples confuses the model and degrades instruction-following quality.
Evolution
Wei et al. from Google Research published 'Finetuned Language Models Are Zero-Shot Learners' (arXiv September 2021, ICLR 2022), introducing the term 'instruction tuning' and demonstrating that fine-tuning a 137B model on 62 NLP tasks verbalized as natural language instructions substantially improves zero-shot performance on unseen tasks.
Ouyang et al. (OpenAI) published 'Training language models to follow instructions with human feedback' (NeurIPS 2022), combining supervised instruction tuning with RLHF to produce InstructGPT models that are significantly preferred by human evaluators over base GPT-3 despite having far fewer parameters.
Chung et al. published 'Scaling Instruction-Finetuned Language Models' (arXiv October 2022), demonstrating that scaling the number of instruction tasks to 1,836, adding chain-of-thought data, and using mixed prompting strategies dramatically improves performance across PaLM and T5 model families. Flan-T5 checkpoints were publicly released.
Stanford CRFM released Alpaca (March 2023), an instruction-tuned LLaMA 7B model trained on ~52k examples generated by GPT-3 text-davinci-003, demonstrating that effective instruction tuning is achievable with synthetic datasets at low cost for smaller open-source models.
Technical details
Hyperparameters (configurable axes)
The number and diversity of task types included in the instruction dataset. Ablation studies in Wei et al. (2021) and Chung et al. (2022) show that more task clusters systematically improve zero-shot generalization on unseen tasks.
The number of parameters in the pretrained base model. Wei et al. (2021) found that instruction tuning generalization benefits increase with model scale, with smaller models showing minimal improvement.
The total number of (instruction, output) examples used for fine-tuning. Instruction tuning can be effective with relatively small datasets (thousands to hundreds of thousands) compared to pretraining.
Whether chain-of-thought (CoT) examples are included in the instruction tuning mixture. Chung et al. (2022) found that including CoT data significantly improves reasoning capabilities and zero-shot CoT performance without degrading other benchmarks.
The step size for gradient updates during SFT. Instruction tuning typically uses smaller learning rates than pretraining to avoid catastrophic forgetting of pretrained knowledge.
Parallelism
Instruction tuning is a standard supervised fine-tuning procedure. Training examples are independent and can be processed in parallel across GPUs/TPUs using data parallelism and tensor parallelism. No sequential dependencies exist between training examples.
Hardware requirements
Instruction tuning is computationally equivalent to supervised fine-tuning of a large language model. GPU Tensor Cores are the dominant hardware for this workload, supporting mixed-precision (BF16/FP16) matrix multiplications required for transformer fine-tuning.
Large-scale instruction tuning experiments (e.g., Flan-PaLM 540B, Flan-T5) were conducted on TPU pods at Google. TPUs provide high throughput for large-scale SFT and are well-supported by JAX/Flax and PyTorch XLA.