在单式反馈/全反馈设置中，可证明更有效的Q学习

论文标题

在单式反馈/全反馈设置中，可证明更有效的Q学习

Provably More Efficient Q-Learning in the One-Sided-Feedback/Full-Feedback Settings

论文作者

Gong, Xiao-Yue, Simchi-Levi, David

论文摘要

在经典库存控制问题的情节版本中，我们提出了一种新的基于Q的基于Q学习的算法，基于消除的半Q学习（HQL），该算法在单次反馈背景设置中针对各种问题的现有算法提高了效率。我们还为全反馈设置提供了算法，全Q学习（FQL）的简单变体。 We establish that HQL incurs $ \tilde{\mathcal{O}}(H^3\sqrt{ T})$ regret and FQL incurs $\tilde{\mathcal{O}}(H^2\sqrt{ T})$ regret, where $H$ is the length of each episode and $T$ is the total length of the horizon.遗憾的界限不受可能的巨大状态和行动空间的影响。我们的数值实验表明了HQL和FQL的效率较高，并具有将增强学习与更丰富的反馈模型相结合的潜力。

Motivated by the episodic version of the classical inventory control problem, we propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs $ \tilde{\mathcal{O}}(H^3\sqrt{ T})$ regret and FQL incurs $\tilde{\mathcal{O}}(H^2\sqrt{ T})$ regret, where $H$ is the length of each episode and $T$ is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题