论文标题
对数遗憾的范围是连续时间平均奖励马尔可夫决策过程
Logarithmic regret bounds for continuous-time average-reward Markov decision processes
论文作者
论文摘要
我们考虑在无限 - 胜利,平均奖励环境中连续时间马尔可夫决策过程(MDP)进行加强学习。与离散的时间MDP相反,连续的时间过程移至状态,并在采取措施后待在那里持续持有时间。有了未知的过渡概率和指数保持时间的速率,我们得出了实例依赖的遗憾下限,这些下限在时间范围内是对数。此外,我们设计了一种学习算法,并建立了实现对数增长率的有限时间遗憾。我们的分析基于上置信度的增强学习,对平均保持时间的微妙估计以及点过程的随机比较。
We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.