统计自举以进行不确定性估计的非政策评估

论文标题

统计自举以进行不确定性估计的非政策评估

Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation

论文作者

Kostrikov, Ilya, Nachum, Ofir

论文摘要

在强化学习中，通常使用经验观察到的过渡并奖励通过基于模型或Q拟合方法来估计策略的价值。尽管很简单，但这些技术一般产量却偏向于政策的真实价值。在这项工作中，我们研究了将统计自举的潜力用作采用这些有偏见的估计并产生策略真实价值的校准置信区间的一种方式。我们确定条件 - 具体来说，有足够的数据大小和足够的覆盖范围 - 在此设置中，统计上的自举可以得出正确的置信区间。在实际情况下，这些条件通常不存在，因此我们讨论并提出了可用于减轻其影响的机制。我们评估我们提出的方法，并表明它可以在各种条件下产生准确的置信区间，包括挑战连续控制环境和小型数据制度。

In reinforcement learning, it is typical to use the empirically observed transitions and rewards to estimate the value of a policy via either model-based or Q-fitting approaches. Although straightforward, these techniques in general yield biased estimates of the true value of the policy. In this work, we investigate the potential for statistical bootstrapping to be used as a way to take these biased estimates and produce calibrated confidence intervals for the true value of the policy. We identify conditions - specifically, sufficient data size and sufficient coverage - under which statistical bootstrapping in this setting is guaranteed to yield correct confidence intervals. In practical situations, these conditions often do not hold, and so we discuss and propose mechanisms that can be employed to mitigate their effects. We evaluate our proposed method and show that it can yield accurate confidence intervals in a variety of conditions, including challenging continuous control environments and small data regimes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题