Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Reinforcement learning from human feedback (RLHF) systems face a compounding alignment challenge: not only are learned reward models uncertain about unseen state-action pairs, but the human preference annotations they are trained on are themselves inconsistent, context-dependent, and noisy.
Existing approaches address these uncertainty sources in isolation - epistemic uncertainty is used to guide exploration, while preference uncertainty is absorbed during reward model training but discarded during policy optimization.
We introduce Uncertainty-Aware Reward Discounting (UARD), a principled framework that jointly models epistemic uncertainty in value estimation via ensemble disagreement and aleatoric uncertainty in human preference annotations via annotator variability, combining these signals through a confidence-adjusted Reliability Filter that adaptively modulates reward weighting during policy optimization.
We prove that this dynamic discounting preserves the contraction property of the Bellman operator, guaranteeing convergence to a unique fixed point, and provide an information-theoretic justification grounded in the Information Bottleneck principle.
Empirically, UARD reduces reward hacking incidents by up to 93.6% across discrete decision-making and continuous control benchmarks (MuJoCo) compared to nine baselines including DQN, Ensemble-DQN, CQL, CPO, TRPO, SAC, EDAC, SUNRISE, and PPO, while maintaining competitive task performance on well-specified rewards.
Under annotation noise ranging from 10% to 30% Gaussian perturbation, UARD retains near-zero safety violations compared to baselines' near-linear degradation.
These results demonstrate that treating uncertainty as an active component of the optimization objective - rather than a passive diagnostic signal - provides a principled pathway toward more reliable and aligned RL systems.