与对抗对手学习马尔可夫游戏：有效的算法和基本限制

论文标题

与对抗对手学习马尔可夫游戏：有效的算法和基本限制

Learning Markov Games with Adversarial Opponents: Efficient Algorithms and Fundamental Limits

论文作者

Liu, Qinghua, Wang, Yuanhao, Jin, Chi

论文摘要

零和游戏中的理想策略不仅应授予玩家的平均奖励，不少于NASH均衡的价值，而且还应在次优时利用（自适应）对手。尽管马尔可夫游戏中的大多数现有作品都专注于以前的目标，但我们是否可以同时实现这两个目标仍然开放。为了解决这个问题，这项工作在马尔可夫游戏中与对抗对手进行了不重新学习，与事后见解最佳固定政策竞争时。沿着这个方向，我们提出了一组新的正面和负面结果：当对手的政策在每个情节的结尾都揭示出来时，我们提出了实现$ \ sqrt {k} $的新的高效算法 - 当（1）基线策略类别很小或（2）对手的政策类时，后悔的界限很小。当两种条件不正确时，这与指数下限相辅相成。当未揭示对手的政策时，即使在上述条件都是正确的情况下，我们也会证明即使在最有利的情况下也是统计硬度的结果。我们的硬度结果比现有的硬度结果要么强得多，后者要么仅涉及计算硬度，要么需要对算法进行进一步限制。

An ideal strategy in zero-sum games should not only grant the player an average reward no less than the value of Nash equilibrium, but also exploit the (adaptive) opponents when they are suboptimal. While most existing works in Markov games focus exclusively on the former objective, it remains open whether we can achieve both objectives simultaneously. To address this problem, this work studies no-regret learning in Markov games with adversarial opponents when competing against the best fixed policy in hindsight. Along this direction, we present a new complete set of positive and negative results: When the policies of the opponents are revealed at the end of each episode, we propose new efficient algorithms achieving $\sqrt{K}$-regret bounds when either (1) the baseline policy class is small or (2) the opponent's policy class is small. This is complemented with an exponential lower bound when neither conditions are true. When the policies of the opponents are not revealed, we prove a statistical hardness result even in the most favorable scenario when both above conditions are true. Our hardness result is much stronger than the existing hardness results which either only involve computational hardness, or require further restrictions on the algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题