可证明有效的奖励反应导航，并具有线性价值迭代

论文标题

可证明有效的奖励反应导航，并具有线性价值迭代

Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration

论文作者

Zanette, Andrea, Lazaric, Alessandro, Kochenderfer, Mykel J., Brunskill, Emma

论文摘要

在具有线性函数近似的MDP中，在理论分析上进行了越来越多的进步，但是现有的许多工作已经做出了强有力的假设，可以通过常规探索框架进行探索。通常，这些假设比在批处理设置中找到良好解决方案所需的假设要强。在这项工作中，我们展示了如何在最小固有的贝尔曼错误的标准概念下，通常以最小二乘价值迭代式迭代式算法使用，我们可以在学习近乎最佳的值函数方面提供强大的PAC保证，前提是线性空间足够“可探索”。我们提出了一种可用于无奖励设置的计算算法，并展示如何使用它来学习任何（线性）奖励功能的接近最佳策略，仅一旦学习完成，才能揭示。如果还从在纯探索过程中收集的样本中估算了此奖励功能，我们的结果还提供了相同的PAC保证，以确保此设置的结果策略的性能。

There has been growing progress on theoretical analyses for provably efficient learning in MDPs with linear function approximation, but much of the existing work has made strong assumptions to enable exploration by conventional exploration frameworks. Typically these assumptions are stronger than what is needed to find good solutions in the batch setting. In this work, we show how under a more standard notion of low inherent Bellman error, typically employed in least-square value iteration-style algorithms, we can provide strong PAC guarantees on learning a near optimal value function provided that the linear space is sufficiently "explorable". We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function, which is revealed only once learning has completed. If this reward function is also estimated from the samples gathered during pure exploration, our results also provide same-order PAC guarantees on the performance of the resulting policy for this setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题