The framework operates at two levels: (1) Macro (TRACE): a trajectory reward r_traj = f(accuracy) − λ · tool_cost, where λ is adaptively increased during training via a Reference Tightening mechanism, enforcing progressive reduction in tool-call count without restricting genuine multi-hop search. (2) Micro (On-Policy Distillation): for failed rollouts, an external teacher model generates step-by-step token-level corrections; these signals are distilled into the agent using KL loss, providing dense learning where outcome rewards are uninformative.
Multimodal agents trained solely on sparse outcome rewards struggle with credit assignment and do not optimize inference efficiency — they generate redundant tool-call rounds. Dual-Grained EA-RL simultaneously addresses both problems through two-level optimization.
Trajectory-level reward whose reference threshold is monotonically tightened during training to suppress superfluous tool calls.
Official
Injects dense token-level corrective signals from an external teacher model for failed rollouts.
Official
Adaptive λ threshold update mechanism — the TRACE reference is tightened after each epoch based on the agent's current efficiency.
Official
If λ grows too fast, the model may restrict legitimate multi-hop search, degrading accuracy.
On-Policy Distillation requires an external teacher — a weak teacher may inject incorrect corrective signals.
The HyperEyes paper (arXiv:2605.07177) presents the framework as its central contribution, achieving +9.9% accuracy and 5.3× reduction in tool-call rounds over the strongest open-source agent.
Adaptive reference threshold tightening mechanism for the TRACE reward — λ is monotonically increased during training.
Weight of tool call cost in the TRACE reward. Adaptively tightened during training.
KL loss weight in On-Policy Distillation — balance between teacher learning and agent's own policy.
RL training with rollout and OPD distillation (235B teacher) requires a GPU cluster — 30B student + 235B teacher do not fit on a single machine.