论文标题
半摩托夫离线增强医疗保健学习
Semi-Markov Offline Reinforcement Learning for Healthcare
论文作者
论文摘要
假设在固定的时间间隔做出决策的假设,强化学习(RL)任务通常被构成马尔可夫决策过程(MDP)。但是,包括医疗保健在内的许多非常重要的应用不满足这一假设,但是它们通常在人工重塑数据后将其建模为MDP。此外,大多数医疗保健(和类似)问题本质上是离线的,仅允许回顾性研究。为了应对这两个挑战,我们首先讨论半MDP(SMDP)框架,该框架正式处理可变时机的动作。接下来,我们提出了一种将SMDP修改应用于几乎所有基于给定值的离线RL方法的正式方法。我们使用该理论引入了三种基于SMDP的离线RL算法,即SDQN,SDDQN和SBCQ。然后,我们在实验上证明,只有这些基于SMDP的算法在可变时间环境中学习最佳策略,而其MDP对应物则没有。最后,我们将新算法应用于与华法林剂量有关的预防中性剂量的现实世界中数据集,并显示出相似的结果。
Reinforcement learning (RL) tasks are typically framed as Markov Decision Processes (MDPs), assuming that decisions are made at fixed time intervals. However, many applications of great importance, including healthcare, do not satisfy this assumption, yet they are commonly modelled as MDPs after an artificial reshaping of the data. In addition, most healthcare (and similar) problems are offline by nature, allowing for only retrospective studies. To address both challenges, we begin by discussing the Semi-MDP (SMDP) framework, which formally handles actions of variable timings. We next present a formal way to apply SMDP modifications to nearly any given value-based offline RL method. We use this theory to introduce three SMDP-based offline RL algorithms, namely, SDQN, SDDQN, and SBCQ. We then experimentally demonstrate that only these SMDP-based algorithms learn the optimal policy in variable-time environments, whereas their MDP counterparts do not. Finally, we apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention and demonstrate similar results.