未知马尔可夫游戏中的在线学习

论文标题

未知马尔可夫游戏中的在线学习

Online Learning in Unknown Markov Games

论文作者

Tian, Yi, Wang, Yuanhao, Yu, Tiancheng, Sra, Suvrit

论文摘要

我们在未知的马尔可夫游戏中研究在线学习，这是在情节性的多代理增强学习中引起的问题，而对手的行动无法观察到。我们表明，在这个充满挑战的环境中，在统计学上，对事后的最佳反应实现了统一的遗憾。然后，我们通过与游戏的\ emph {minimax value}竞争，并提出了一种实现Sublinear $ \ tilde {\ Mathcal {o}}}（k^{2/3}）$ k $ evodes $ k $ everage，我们将考虑一个较弱的遗憾概念。这是（据我们所知）在未知马尔可夫游戏中进行在线学习的第一个遗憾。重要的是，我们的遗憾束缚与对手行动空间的规模无关。结果，即使对手的行动是完全可观察到的，我们的遗憾也会通过对手数量的指数因素来改善现有分析（例如（Xie等，2020））。

We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game, and present an algorithm that achieves a sublinear $\tilde{\mathcal{O}}(K^{2/3})$ regret after $K$ episodes. This is the first sublinear regret bound (to our knowledge) for online learning in unknown Markov games. Importantly, our regret bound is independent of the size of the opponents' action spaces. As a result, even when the opponents' actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题