论文标题
两全其美的最好的:对数遗憾和政策切换的强化学习
The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches
论文作者
论文摘要
在本文中,我们研究了无模型和基于模型的设置中的情节增强学习(RL)的遗憾最小化问题。我们专注于使用一般功能类和一般模型类学习,并得出结果,可以随着这些类别的Eluder维度扩展。与主要建立独立于实例的遗憾保证的现有工作主体相反,我们专注于依赖实例的设置,并表明遗憾与地平线$ t $对数缩放为对数,前提是每个州的最佳和第二好的动作之间存在差距。此外,我们表明,具有$ O(\ log T)$开关成本(也称为适应性复杂性)的算法可以实现这种对数遗憾界限。换句话说,这些算法在执行过程中很少更改其政策。最后,我们通过下界补充结果,这表明即使在表格环境中,我们也无法希望后悔保证低于$ O(\ log t)$。
In this paper, we study the problem of regret minimization for episodic Reinforcement Learning (RL) both in the model-free and the model-based setting. We focus on learning with general function classes and general model classes, and we derive results that scale with the eluder dimension of these classes. In contrast to the existing body of work that mainly establishes instance-independent regret guarantees, we focus on the instance-dependent setting and show that the regret scales logarithmically with the horizon $T$, provided that there is a gap between the best and the second best action in every state. In addition, we show that such a logarithmic regret bound is realizable by algorithms with $O(\log T)$ switching cost (also known as adaptivity complexity). In other words, these algorithms rarely switch their policy during the course of their execution. Finally, we complement our results with lower bounds which show that even in the tabular setting, we cannot hope for regret guarantees lower than $o(\log T)$.