Training
RFT
2023ActivePublished: 3 May 2026Updated: 3 May 2026Published
Key
innovation
Fine-tunes a pre-trained model on domain-specific tasks using reinforcement learning rewards, improving task accuracy without general RLHF preference alignment.
Category
Training
Abstraction level
Pattern
Use cases
Domain-specific model specializationScientific reasoningCode generation optimizationMedical diagnosis assistance
How it works
The model is run on a set of domain-specific tasks. Each response is evaluated by an objective scorer. A policy gradient (e.g., PPO or GRPO) is computed from the reward signal and used to update the model's weights. The process iterates until convergence.
Problem solved
General RLHF models excel at following instructions but are not optimized for specific tasks with measurable outcomes. RFT addresses the gap between general helpfulness and task-specific accuracy.