Reinforcement learning for policymaking in epidemic control: A scoping review
by Oleksandr Bolshov, Dmytro Chumachenko Background Managing an epidemic demands policies that respond at the pace of the outbreak. Conventional rule‑based interventions struggle to keep up, prompting interest in reinforcement learning (RL) for designing non‑pharmaceutical interventions (NPIs). However, current evidence is fragmented across diverse models and reporting styles. Objectives To systematically map how RL is applied for epidemic NPI design, describe modeling choices, algorithm architectures, evaluation practices, and identify trends and research gaps. Methods Peer-reviewed studies (2014–2025, English) that applied deep RL to select NPIs were retrieved from IEEE Xplore, ACM Digital Library, ScienceDirect, and Scopus, searched on December 23, 2025. Reference list scanning supplemented database results. Predefined data items (bibliographic details, epidemic and RL model characteristics, experiments, validation methods, outcomes) were charted and summarized descriptively. Results Of 512 retrieved records, 10 met the inclusion criteria, and three additional studies were identified via reference-list scanning, yielding 13. Five employed value‑based methods, four policy‑gradient, and four hybrid; one study additionally incorporated model-based planning. Six simulations relied on compartmental models, six on agent‑based models, and one on a hybrid model. Action spaces were predominantly discrete restriction levels. Five studies incorporated sequence-modeling techniques to include temporal context into a state space. Eleven studies designed reward functions as a trade-off between pandemic severity and socio-economic cost. According to the reviewed studies, RL policies across various settings outperform heuristic, rule-based, and historical baselines in reducing infections, deaths, or lockdown duration while limiting economic loss. Conclusions RL shows promise for adaptive epidemic control. Comparison is hampered by simplified economic costs, inconsistent calibration rigor, varied evaluation metrics, and limited uncertainty or policy robustness analysis. Future work should establish common benchmark environments and reporting standards, incorporate empirically grounded economic and behavioral models, adopt uncertainty-aware and probabilistic RL, develop more sophisticated control spaces, investigate more advanced algorithms, and validate learned policies prospectively to enable real-world deployment.