在基于模型的强化学习中对易于乐观

论文标题

在基于模型的强化学习中对易于乐观

Towards Tractable Optimism in Model-Based Reinforcement Learning

论文作者

Pacchiano, Aldo, Ball, Philip J., Parker-Holder, Jack, Choromanski, Krzysztof, Roberts, Stephen

论文摘要

面对不确定性的乐观原则在顺序决策中普遍存在，例如多军匪徒和加固学习（RL）。为了取得成功，乐观的RL算法必须过度估计真实的价值函数（乐观），但不要太多，以至于它不准确（估计误差）。在表格设置中，许多最新方法通过缩放到深度RL时棘手的方法产生所需的乐观情绪。我们重新解释了这些可扩展的基于乐观模型的算法，以求解可拖动的噪声增强MDP。这种表述实现了竞争性的遗憾结合：$ \ tilde {\ Mathcal {o}}（| \ Mathcal {s} | h \ sqrt {| \ Mathcal {| \ Mathcal {a} | t}）$在使用高斯噪声增强时，其中$ t $是环境的总数。我们还探讨了这种权衡如何在深度RL设置中发生变化，在经验上，我们表明估计错误更加麻烦。但是，我们还表明，如果减少此错误，基于乐观的模型的RL算法可以匹配连续控制问题的最新性能。

The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题