对对抗MDP的近乎遗憾的匪徒反馈延迟

论文标题

对对抗MDP的近乎遗憾的匪徒反馈延迟

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

论文作者

Jin, Tiancheng, Lancewicki, Tal, Luo, Haipeng, Mansour, Yishay, Rosenberg, Aviv

论文摘要

加固学习（RL）的标准假设是，代理人立即观察其行为的反馈。但是，实际上，通常会在延迟中观察到反馈。本文在情节马尔可夫决策过程（MDP）中研究了在线学习，其过渡，对抗性变化的成本和无限制的延迟匪徒反馈。更确切地说，仅在$ k + d^k $结束时揭示了$ k $的代理商的反馈，在该第一集$ k + d^k $的结尾处，延迟$ d^k $可以改变情节，并由遗忘的对手选择。我们提出了第一个实现接近最佳$ \ sqrt {k + d} $遗憾的算法，其中$ k $是情节的数量，$ d = \ sum_ {k = 1}^k d^k $是总延迟，显着改善了$（k + d）的最遗憾（k + d）^^2/3} $。

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题