TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions.
Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens.
This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts.
We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit.
A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards.
This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit.
We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable -- so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients.
Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline.
Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional $10.4\%$ and $14.8\%$ relative to GRPO.