最短的最短路径的遗憾界限

论文标题

最短的最短路径的遗憾界限

Near-optimal Regret Bounds for Stochastic Shortest Path

论文作者

Cohen, Alon, Kaplan, Haim, Mansour, Yishay, Rosenberg, Aviv

论文摘要

随机最短路径（SSP）是计划和控制方面的一个众所周知的问题，在该计划和控制中，代理必须达到目标状态，以最低的总预期成本达到目标状态。在问题的学习公式中，代理不知道环境动态（即过渡功能），并且必须反复播放给定数量的情节，同时推理问题的最佳解决方案。与其他良好的增强学习模型（RL）不同，情节的长度不是预定的（或有限），并且受代理商的行为影响。最近，Tarbouriech等人。（2019年）在遗憾最小化的背景下研究了这个问题，并提供了一种算法，其遗憾束缚与最低瞬时成本的平方根成反比。 In this work we remove this dependence on the minimum cost---we give an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the set of states, $A$ is the set of actions and $K$ is the number of情节。我们还表明，在最坏情况下，任何学习算法都必须至少具有$ω（b_ \ star \ sqrt {| s | | a | k}）$遗憾。

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly play for a given number of episodes while reasoning about the problem's optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent's actions. Recently, Tarbouriech et al. (2019) studied this problem in the context of regret minimization and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost---we give an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the set of states, $A$ is the set of actions and $K$ is the number of episodes. We additionally show that any learning algorithm must have at least $Ω(B_\star \sqrt{|S| |A| K})$ regret in the worst case.

下载PDF全文

下载文献需遵守相关版权规定

论文标题