学习通过最佳响应政策迭代进行无压外交

论文标题

学习通过最佳响应政策迭代进行无压外交

Learning to Play No-Press Diplomacy with Best Response Policy Iteration

论文作者

Anthony, Thomas, Eccles, Tom, Tacchetti, Andrea, Kramár, János, Gemp, Ian, Hudson, Thomas C., Porcel, Nicolas, Lanctot, Marc, Pérolat, Julien, Everett, Richard, Werpachowski, Roman, Singh, Satinder, Graepel, Thore, Bachrach, Yoram

论文摘要

深度强化学习（RL）的最新进展导致了许多2种玩家的零和游戏，例如GO，扑克和星际争霸。此类游戏的纯粹对抗性允许在概念上简单而原则地应用RL方法。但是，现实世界的设置是多个代理的，而代理相互作用是共同利益和竞争方面的复杂混合物。我们考虑外交，这是一款7台棋盘游戏，旨在强调由多个代理互动而引起的困境。它还具有大型组合动作空间和同时动作，这对于RL算法具有挑战性。我们提出了一个简单而有效的最佳响应操作员，旨在处理大型组合动作空间和同时移动。我们还介绍了一种近似虚拟游戏的政策迭代方法。通过这些方法，我们成功地将RL应用于外交：我们表明，我们的代理商令人信服地胜过以前的最先前的最新和游戏理论平衡分析，表明新过程可以取得一致的改进。

Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题