在无限 - 摩恩强化学习中进行混淆的策略评估

论文标题

在无限 - 摩恩强化学习中进行混淆的策略评估

Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning

论文作者

Kallus, Nathan, Zhou, Angela

论文摘要

在批次增强学习（例如教育和医疗保健）的应用中，必须对观察数据的顺序决策政策进行非政策评估。但是，在这种情况下，未观察到的变量混淆了观察到的动作，使对新政策的精确评估不可能，即无法识别。我们开发了一种强大的方法，该方法估计了在无限措施中给定策略的（无法识别的）价值的急剧界限，但鉴于另一个策略的数据，其尚未观察到的混杂，但要遵守灵敏度模型。我们通过优化与所有与新的部分确定的估计方程和灵敏度模型相一致的所有固定状态占用比的集合来考虑固定或基线未观察到的混杂和计算界限。当我们收集更多混杂的数据时，我们证明了趋势界限。尽管检查设置成员资格是一个线性程序，但支持功能由困难的非convex优化问题给出。我们基于非凸的投影梯度下降开发近似值，并以经验证明了所得界限。

Off-policy evaluation of sequential decision policies from observational data is necessary in applications of batch reinforcement learning such as education and healthcare. In such settings, however, unobserved variables confound observed actions, rendering exact evaluation of new policies impossible, i.e., unidentifiable. We develop a robust approach that estimates sharp bounds on the (unidentifiable) value of a given policy in an infinite-horizon problem given data from another policy with unobserved confounding, subject to a sensitivity model. We consider stationary or baseline unobserved confounding and compute bounds by optimizing over the set of all stationary state-occupancy ratios that agree with a new partially identified estimating equation and the sensitivity model. We prove convergence to the sharp bounds as we collect more confounded data. Although checking set membership is a linear program, the support function is given by a difficult nonconvex optimization problem. We develop approximations based on nonconvex projected gradient descent and demonstrate the resulting bounds empirically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题