论文标题
在两人零和马尔可夫游戏中的非政策可利用性评估
Off-Policy Exploitability-Evaluation in Two-Player Zero-Sum Markov Games
论文作者
论文摘要
非政策评估(OPE)是使用从其他策略获得的历史数据评估新政策的问题。在最近的OPE环境中,大多数研究都集中在单人案例上,而不是多人案例。在这项研究中,我们提出了由两人零和马尔可夫游戏中的双重强大和双增强学习估计器构建的OPE估计器。提出的估计器项目可利用性通常用作确定策略配置文件(即政策元组)的近距离度量的指标,即在两人零和零和游戏中的NASH平衡。我们证明了所提出的估计器的可剥削性估计误差界限。然后,我们提出了通过选择从给定的策略配置文件类中最小化估计可利用性的策略配置文件来找到最佳候选策略配置文件的方法。我们证明了我们方法选择的政策概况的遗憾。最后,我们通过实验证明了提出的估计器的有效性和性能。
Off-policy evaluation (OPE) is the problem of evaluating new policies using historical data obtained from a different policy. In the recent OPE context, most studies have focused on single-player cases, and not on multi-player cases. In this study, we propose OPE estimators constructed by the doubly robust and double reinforcement learning estimators in two-player zero-sum Markov games. The proposed estimators project exploitability that is often used as a metric for determining how close a policy profile (i.e., a tuple of policies) is to a Nash equilibrium in two-player zero-sum games. We prove the exploitability estimation error bounds for the proposed estimators. We then propose the methods to find the best candidate policy profile by selecting the policy profile that minimizes the estimated exploitability from a given policy profile class. We prove the regret bounds of the policy profiles selected by our methods. Finally, we demonstrate the effectiveness and performance of the proposed estimators through experiments.