Mean-Field PhiBE: Continuous-Time Mean-Field Reinforcement Learning from Discrete-Time Data

arXiv Math

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

This paper addresses model-free continuous-time mean-field control in a setting where the population dynamics evolve continuously according to an unknown McKean-Vlasov stochastic differential equation, while only discrete-time transition data are available.

In the model-based formulation, policy evaluation is naturally described by a stationary Hamilton-Jacobi-Bellman equation on $\mathcal P_2(\mathbb R^d)$, but this equation involves the drift and diffusion coefficients of the controlled McKean-Vlasov dynamics, which are not identifiable when only discrete-time data are available.

On the other hand, a direct reduction to a time-discrete Bellman equation avoids the non-identifiability issue but loses the differential equation structure.

To bridge these two viewpoints, we introduce a Mean-Field-PhiBE (MF-PhiBE), which incorporates discrete-time transition information into a continuous-time PDE on the Wasserstein space.

The MF-PhiBE replaces the unknown infinitesimal drift and covariance in the policy-evaluation equation by one-step estimators computed from data, while preserving the generator structure of the McKean-Vlasov HJB equation.

We also derive a policy-gradient theorem for entropy-regularized randomized feedback policies, expressing the actor direction through an action-wise infinitesimal advantage and the score of the policy.

Combining these two ingredients yields a model-free actor-critic method.

We prove a first-order consistency estimate showing that the value induced by an optimal MF-PhiBE policy approximates the optimal continuous-time value with an error of order $\Delta t$.

In the linear-quadratic case, we show our approximation achieves second-order accuracy with only one-step data.

Numerical experiments on an LQR benchmark and a crowd-aversion problem illustrate the proposed framework.

전문 보기

Mean-Field PhiBE: Continuous-Time Mean-Field Reinforcement Learning from Discrete-Time Data

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Detecting and Controlling Sycophancy with Cascading Linear Features

Life After Benchmark Saturation: A Case Study of CORE-Bench

Refusal Lives Downstream of Persona in Chat Models

arXiv의 다른 기사

Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking

Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

Governing Actions, Not Agents: Institutional Attestation as a Governance Model for Autonomous AI Systems