Training

Instruction Tuning

Key innovation

Fine-tuning a large pretrained language model on NLP tasks framed as natural-language instructions significantly improves zero-shot performance on unseen tasks.

Components

Instruction DatasetTraining dataset of (instruction, [input], expected output) examples covering diverse task types. Quality, variety, and number of tasks directly affect the model's generalization ability.

A curated collection of (instruction, optional input, output) triples covering diverse task types. The breadth of task clusters and template diversity are key factors in how well the tuned model generalizes to unseen tasks. Common formats include the Alpaca format and conversation-style chat templates.

Multi-task instruction datasetA mixture of many NLP task types (e.g., translation, QA, summarization, classification) verbalized via instruction templates. Example: the FLAN collection (Wei et al. 2021, Chung et al. 2022).

Human demonstration datasetA smaller dataset of human-written demonstrations of desired model behavior, covering open-ended tasks. Example: the SFT dataset in InstructGPT (Ouyang et al. 2022), ~13k examples.

Synthetic instruction datasetInstructions and responses generated by a stronger LLM (e.g., GPT-4) to bootstrap instruction tuning at lower cost. Example: Stanford Alpaca (52k examples from text-davinci-003).

Official

Pretrained Base ModelPre-trained language model fine-tuned on an instruction dataset. The quality and size of the base model determine the upper bound of instruction tuning effectiveness.

The pretrained language model (typically a causal LM or seq2seq model) that serves as the starting point for fine-tuning. The base model's scale is a key factor: instruction tuning has been shown to provide greater generalization benefits at larger model sizes (Wei et al. 2021).

Official

Supervised Fine-Tuning ObjectiveTraining objective: minimizes cross-entropy loss on response tokens, with loss masked on instruction tokens. Gradients are back-propagated exclusively through output tokens.

Standard token-level cross-entropy loss computed only on the target response tokens. Loss is masked (set to -100 or equivalent) on the instruction and input tokens, so the model only learns to predict the output. Training uses standard gradient descent with backpropagation through the model weights.

Instruction TemplateText format or template that converts examples from original datasets into natural-language instruction form. Template diversity improves model generalization.

A natural language prompt template that frames each training example as an instruction. FLAN used 10 distinct templates per dataset. Templates typically include: a verb-based task description, the input context (if any), and a prompt for the output. Template diversity is an important factor in zero-shot generalization.

Official

Implementation

Reference implementations

google-research/FLAN

Python · Google Research

Official

Hugging Face TRL – SFTTrainer

Python · Hugging Face

Implementation pitfalls

Catastrophic forgetting of pretrained knowledgeHigh

Instruction tuning on a small or narrow dataset can cause the model to lose pretrained capabilities (e.g., in-context learning, reasoning on tasks not represented in the fine-tuning distribution). This is a well-documented risk of SFT on small, low-diversity datasets.

Fix:Use a sufficiently large and diverse instruction dataset covering many task types. Include a small proportion of pretraining-style data in the fine-tuning mixture (pretraining regularization, as in InstructGPT PPO-ptx). Use PEFT methods (LoRA) to limit parameter updates.

Insufficient task diversityHigh

Training on too few task types or task clusters limits zero-shot generalization. Wei et al. (2021) ablation studies showed that generalization on unseen tasks consistently improves as the number of task clusters in training increases.

Fix:Include examples from as many distinct task types as possible. Use multiple prompt templates per task to increase template diversity. Include chain-of-thought examples when reasoning is required.

Low quality instruction dataCritical

The quality of instruction-response pairs directly determines model behavior. Noisy, inconsistent, or factually incorrect responses in training data cause the model to learn undesirable behaviors including hallucination.

Fix:Use human-curated or carefully filtered instruction datasets. Apply quality filtering to synthetic datasets generated by LLMs. Prefer diverse, high-quality demonstrations over large quantities of lower-quality examples.

Incorrect loss masking on instruction tokensMedium

If loss is computed on both instruction and response tokens, the model learns to predict the instruction tokens too, which wastes model capacity and can produce worse instruction-following behavior. Loss should only be computed on the target response tokens.

Fix:Apply a loss mask (e.g., -100 label index) to all instruction/input tokens so that gradient updates only come from response token predictions.

Inconsistent instruction template formattingMedium

Inconsistent use of prompt templates, special tokens, or chat templates across training examples confuses the model and degrades instruction-following quality.

Fix:Choose a single prompt template consistent with the target model's expected format (e.g., the model's official chat template) and apply it uniformly across all training examples.

Evolution

Original paper · 2022 · ICLR 2022 · Jason Wei

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le

2021

FLAN — first formal large-scale definition of instruction tuning (Wei et al.)

Inflection point

Wei et al. from Google Research published 'Finetuned Language Models Are Zero-Shot Learners' (arXiv September 2021, ICLR 2022), introducing the term 'instruction tuning' and demonstrating that fine-tuning a 137B model on 62 NLP tasks verbalized as natural language instructions substantially improves zero-shot performance on unseen tasks.

Finetuned Language Models Are Zero-Shot Learners (paper)

2022

InstructGPT — instruction tuning with human feedback (Ouyang et al., OpenAI)

Inflection point

Ouyang et al. (OpenAI) published 'Training language models to follow instructions with human feedback' (NeurIPS 2022), combining supervised instruction tuning with RLHF to produce InstructGPT models that are significantly preferred by human evaluators over base GPT-3 despite having far fewer parameters.

Training language models to follow instructions with human feedback (paper)

2022

Scaling instruction fine-tuning — Flan-T5 and Flan-PaLM (Chung et al.)

Inflection point

Chung et al. published 'Scaling Instruction-Finetuned Language Models' (arXiv October 2022), demonstrating that scaling the number of instruction tasks to 1,836, adding chain-of-thought data, and using mixed prompting strategies dramatically improves performance across PaLM and T5 model families. Flan-T5 checkpoints were publicly released.

Scaling Instruction-Finetuned Language Models (paper)

2023

Stanford Alpaca — instruction tuning of open-source models using synthetic data

Stanford CRFM released Alpaca (March 2023), an instruction-tuned LLaMA 7B model trained on ~52k examples generated by GPT-3 text-davinci-003, demonstrating that effective instruction tuning is achievable with synthetic datasets at low cost for smaller open-source models.

Alpaca: A Strong, Replicable Instruction-Following Model (paper)

Technical details

Hyperparameters (configurable axes)

Number and diversity of task typesCritical

The number and diversity of task types included in the instruction dataset. Ablation studies in Wei et al. (2021) and Chung et al. (2022) show that more task clusters systematically improve zero-shot generalization on unseen tasks.

62 task clusters (FLAN 2021)Original FLAN paper: 62 NLP datasets across 12 task clusters

1836 tasks (Flan 2022/Flan-T5)Scaling instruction fine-tuning: Chung et al. 2022

Model scaleCritical

The number of parameters in the pretrained base model. Wei et al. (2021) found that instruction tuning generalization benefits increase with model scale, with smaller models showing minimal improvement.

8B parametersCommon scale for open instruction-tuned models (e.g., Llama-3-8B-Instruct)

137B parametersScale used in the original FLAN experiments (LaMDA-PT)

Number of training examplesHigh

The total number of (instruction, output) examples used for fine-tuning. Instruction tuning can be effective with relatively small datasets (thousands to hundreds of thousands) compared to pretraining.

~13,000InstructGPT SFT dataset (Ouyang et al. 2022)

~52,000Stanford Alpaca dataset

~1,000,000+Large-scale instruction datasets (e.g., FLAN mixture)

Chain-of-Thought Data InclusionHigh

Whether chain-of-thought (CoT) examples are included in the instruction tuning mixture. Chung et al. (2022) found that including CoT data significantly improves reasoning capabilities and zero-shot CoT performance without degrading other benchmarks.

No CoTStandard task-instruction pairs only.

With CoT examplesMix of standard instructions and step-by-step reasoning examples

Learning rateHigh

The step size for gradient updates during SFT. Instruction tuning typically uses smaller learning rates than pretraining to avoid catastrophic forgetting of pretrained knowledge.

1e-5 to 3e-5Typical range for full fine-tuning of large LLMs

2e-4 to 1e-3Typical range for LoRA/PEFT-based instruction tuning

Parallelism

Parallelism level

fully_parallel

Instruction tuning is a standard supervised fine-tuning procedure. Training examples are independent and can be processed in parallel across GPUs/TPUs using data parallelism and tensor parallelism. No sequential dependencies exist between training examples.

Scope

training

Hardware requirements

Primary

Instruction tuning is computationally equivalent to supervised fine-tuning of a large language model. GPU Tensor Cores are the dominant hardware for this workload, supporting mixed-precision (BF16/FP16) matrix multiplications required for transformer fine-tuning.

Good fit

Large-scale instruction tuning experiments (e.g., Flan-PaLM 540B, Flan-T5) were conducted on TPU pods at Google. TPUs provide high throughput for large-scale SFT and are well-supported by JAX/Flax and PyTorch XLA.

Sources

Scaling Instruction-Finetuned Language Models