纳什均衡学习通用随机游戏的分散政策梯度

论文标题

纳什均衡学习通用随机游戏的分散政策梯度

Decentralized Policy Gradient for Nash Equilibria Learning of General-sum Stochastic Games

论文作者

Chen, Yan, Li, Tao

论文摘要

我们研究了具有未知过渡概率密度函数的通用随机游戏的NASH平衡学习。特工在当前的环境国家采取行动及其联合行动会影响环境国家及其即时奖励的过渡。每个代理只观察到环境状态及其自身的奖励，并且对他人的行动或立即奖励不知道。我们介绍了以概率1和概率为1的加权渐近纳什平衡的概念。对于具有精确伪梯度的情况，我们通过NASH平衡和变异不等式问题的等效性设计了两环算法。在外循环中，我们通过更新近端参数，同时在内部循环中采用单电池额外的算法来求解构建的变异不平等，从而依次更新一个构建的强烈单调变化不平等。我们表明，如果相关的薄荷变异不等式具有解决方案，则设计的算法将收敛到K^{1/2}加重的渐近nash平衡。此外，对于具有未知伪梯度的情况，我们提出了一种分散的算法，其中g（po）MDP梯度估计器由蒙特 - 卡洛模拟提供。达到了概率的差异渐近nash平衡，趋于差异。

We study Nash equilibria learning of a general-sum stochastic game with an unknown transition probability density function. Agents take actions at the current environment state and their joint action influences the transition of the environment state and their immediate rewards. Each agent only observes the environment state and its own immediate reward and is unknown about the actions or immediate rewards of others. We introduce the concepts of weighted asymptotic Nash equilibrium with probability 1 and in probability. For the case with exact pseudo gradients, we design a two-loop algorithm by the equivalence of Nash equilibrium and variational inequality problems. In the outer loop, we sequentially update a constructed strongly monotone variational inequality by updating a proximal parameter while employing a single-call extra-gradient algorithm in the inner loop for solving the constructed variational inequality. We show that if the associated Minty variational inequality has a solution, then the designed algorithm converges to the k^{1/2}-weighted asymptotic Nash equilibrium. Further, for the case with unknown pseudo gradients, we propose a decentralized algorithm, where the G(PO)MDP gradient estimator of the pseudo gradient is provided by Monte-Carlo simulations. The convergence to the k^{1/4} -weighted asymptotic Nash equilibrium in probability is achieved.

下载PDF全文

下载文献需遵守相关版权规定

论文标题