Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Reinforcement learning from human feedback (RLHF) systems face a compounding alignment challenge: not only are learned reward models uncertain about unseen state-action pairs, but the human preference annotations they are trained on are themselves inconsistent, context-dependent, and noisy.

Existing approaches address these uncertainty sources in isolation - epistemic uncertainty is used to guide exploration, while preference uncertainty is absorbed during reward model training but discarded during policy optimization.

We introduce Uncertainty-Aware Reward Discounting (UARD), a principled framework that jointly models epistemic uncertainty in value estimation via ensemble disagreement and aleatoric uncertainty in human preference annotations via annotator variability, combining these signals through a confidence-adjusted Reliability Filter that adaptively modulates reward weighting during policy optimization.

We prove that this dynamic discounting preserves the contraction property of the Bellman operator, guaranteeing convergence to a unique fixed point, and provide an information-theoretic justification grounded in the Information Bottleneck principle.

Empirically, UARD reduces reward hacking incidents by up to 93.6% across discrete decision-making and continuous control benchmarks (MuJoCo) compared to nine baselines including DQN, Ensemble-DQN, CQL, CPO, TRPO, SAC, EDAC, SUNRISE, and PPO, while maintaining competitive task performance on well-specified rewards.

Under annotation noise ranging from 10% to 30% Gaussian perturbation, UARD retains near-zero safety violations compared to baselines' near-linear degradation.

These results demonstrate that treating uncertainty as an active component of the optimization objective - rather than a passive diagnostic signal - provides a principled pathway toward more reliable and aligned RL systems.

전문 보기

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

AI-Model Network: Concept, Current State and Future

When Does Personality Composition Matter for Multi-Agent LLM Teams?

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

arXiv의 다른 기사

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

Towards Reliable and Robust LLM Planning: Symbolic Feedback-Driven Iterative Self-Refinement Framework