Training

Dual-Grained EA-RL

2026ResearchPublished

Key innovation

Combines a trajectory-level reward (TRACE — adaptive tool-use cost efficiency) with dense token-level corrective signals (On-Policy Distillation) in a single RL framework, making inference efficiency a first-class training objective for multimodal search agents.

How it works

The framework operates at two levels: (1) Macro (TRACE): a trajectory reward r_traj = f(accuracy) − λ · tool_cost, where λ is adaptively increased during training via a Reference Tightening mechanism, enforcing progressive reduction in tool-call count without restricting genuine multi-hop search. (2) Micro (On-Policy Distillation): for failed rollouts, an external teacher model generates step-by-step token-level corrections; these signals are distilled into the agent using KL loss, providing dense learning where outcome rewards are uninformative.

Problem solved

Multimodal agents trained solely on sparse outcome rewards struggle with credit assignment and do not optimize inference efficiency — they generate redundant tool-call rounds. Dual-Grained EA-RL simultaneously addresses both problems through two-level optimization.

Components

TRACE (Tool-use Reference-Adaptive Cost Efficiency)Macro-level efficiency reward signal

Trajectory-level reward whose reference threshold is monotonically tightened during training to suppress superfluous tool calls.

Official

On-Policy DistillationMicro-level dense credit-assignment learning signal

Injects dense token-level corrective signals from an external teacher model for failed rollouts.

Official

Reference Tightening MechanismAdaptive cost coefficient scheduler

Adaptive λ threshold update mechanism — the TRACE reference is tightened after each epoch based on the agent's current efficiency.

Official

Implementation

Reference implementations

HyperEyes

Python · DeepExperience

Official

Implementation pitfalls

Over-aggressive reference tighteningHigh

If λ grows too fast, the model may restrict legitimate multi-hop search, degrading accuracy.

Fix:Monitor validation accuracy during tightening; use adaptive rather than linear schedule.

Dependence on a strong teacher modelMedium

On-Policy Distillation requires an external teacher — a weak teacher may inject incorrect corrective signals.

Fix:Use a teacher substantially stronger than the student or filter its corrections by a confidence threshold.

Evolution

Original paper · 2026 · arXiv 2026 · Guankai Li

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu

2026

Dual-Grained EA-RL introduced in HyperEyes system

Inflection point

The HyperEyes paper (arXiv:2605.07177) presents the framework as its central contribution, achieving +9.9% accuracy and 5.3× reduction in tool-call rounds over the strongest open-source agent.