Training

RFT

2023ActivePublished: 3 May 2026Updated: 3 May 2026Published

Key innovation

Fine-tunes a pre-trained model on domain-specific tasks using reinforcement learning rewards, improving task accuracy without general RLHF preference alignment.

How it works

The model is run on a set of domain-specific tasks. Each response is evaluated by an objective scorer. A policy gradient (e.g., PPO or GRPO) is computed from the reward signal and used to update the model's weights. The process iterates until convergence.

Problem solved

General RLHF models excel at following instructions but are not optimized for specific tasks with measurable outcomes. RFT addresses the gap between general helpfulness and task-specific accuracy.

Evolution

Original paper · 2024 · arXiv 2024 · Aviral Kumar

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal

Sources

Reinforcement Fine-Tuning | OpenAI API

Documentation

OpenAI