Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

arXiv Math

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic.
We study inference for the value of optimal policies in Markov decision processes. In an auxiliary augmented transition-sampling experiment, we characterize the existence of the efficient influence function and show that non-regularity arises when competing optimal policies havedistinct first-order gradients. For the actual i.i.d.-trajectory experiment, we derive the semiparametric efficiency bound and a uniformly weighted estimator that attains it under a unique optimum, while the sequential NSAVE procedure trades efficiency for stability and validity under non-uniqueness.
Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which yields martingale-based inference and retains a double-robustness property under policy-aligned nuisance estimation. We further develop a pointwise smoothing-based approximation under explicit first-stage rates, and a post-selection template with uniform coverage whenever its stated joint calibration condition is satisfied.
Simulation studies support the theoretical results. An application to the Drink Less micro-randomized trial provides confidence intervals for state-adaptive notification policies and their improvement over the randomized behavior policy.

전문 보기

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

AI-Model Network: Concept, Current State and Future

When Does Personality Composition Matter for Multi-Agent LLM Teams?

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

arXiv의 다른 기사

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

Towards Reliable and Robust LLM Planning: Symbolic Feedback-Driven Iterative Self-Refinement Framework