论文标题

使用功能近似和相关的平衡学习零和同时移动马尔可夫游戏

Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

论文作者

Xie, Qiaomin, Chen, Yudong, Wang, Zhaoran, Yang, Zhuoran

论文摘要

我们开发了具有同时动作的两人零和有限型马尔可夫游戏的可证明有效的增强学习算法。为了结合功能近似,我们考虑了马尔可夫游戏系列,其中奖励功能和过渡内核具有线性结构。考虑了问题的离线和在线设置。在离线环境中,我们控制了玩家,并旨在通过最大程度地减少双重性差距来找到NASH平衡。在在线环境中,我们控制一个单人对抗任意对手的球员,并旨在最大程度地减少遗憾。对于这两种设置,我们都提出了最小二乘最小值迭代算法的乐观变体。我们表明,我们的算法在计算上是有效的,并且可以实现$ \ tilde o(\ sqrt {d^3 h^3 t})$上的duality差距和遗憾,其中$ d $是线性尺寸,$ h $ the Horizo​​n the Horizo​​n the Horizo​​n和$ t $ T $ TIMESTEPTEPS的总数。我们的结果不需要对采样模型的其他假设。 我们的设置需要克服马尔可夫决策过程或基于转弯的马尔可夫游戏中缺少的几个新挑战。特别是,为了通过同时动作实现乐观,我们构建了价值函数的上和下置信度范围,然后通过以这些界限作为回报矩阵来求解一般的和矩阵游戏来计算乐观策略。由于找到通用游戏的NASH平衡在计算上很难,因此我们的算法求解了可以有效获得的粗相关平衡(CCE)。据我们所知,这种基于CCE的乐观计划尚未出现在文献中,并且可能本身就引起了人们的关注。

We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an $\tilde O(\sqrt{d^3 H^3 T} )$ upper bound on the duality gap and regret, where $d$ is the linear dimension, $H$ the horizon and $T$ the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turn-based Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a general-sum matrix game with these bounds as the payoff matrices. As finding the Nash Equilibrium of a general-sum game is computationally hard, our algorithm instead solves for a Coarse Correlated Equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCE-based scheme for optimism has not appeared in the literature and might be of interest in its own right.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源