对数遗憾的范围是连续时间平均奖励马尔可夫决策过程

论文标题

对数遗憾的范围是连续时间平均奖励马尔可夫决策过程

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

论文作者

Gao, Xuefeng, Zhou, Xun Yu

论文摘要

我们考虑在无限 - 胜利，平均奖励环境中连续时间马尔可夫决策过程（MDP）进行加强学习。与离散的时间MDP相反，连续的时间过程移至状态，并在采取措施后待在那里持续持有时间。有了未知的过渡概率和指数保持时间的速率，我们得出了实例依赖的遗憾下限，这些下限在时间范围内是对数。此外，我们设计了一种学习算法，并建立了实现对数增长率的有限时间遗憾。我们的分析基于上置信度的增强学习，对平均保持时间的微妙估计以及点过程的随机比较。

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题