最小值 - 最佳的非上政策评估，线性函数近似

论文标题

最小值 - 最佳的非上政策评估，线性函数近似

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

论文作者

Duan, Yaqi, Wang, Mengdi

论文摘要

本文研究了通过功能近似的批处理数据加强学习的统计理论。考虑非政策评估问题，即从未知行为策略产生的记录历史记录中估算新目标策略的累积价值。我们研究了一种基于回归的拟合Q迭代方法，并表明它等同于基于模型的方法，该方法估计了过渡操作员的条件平均嵌入。我们证明该方法在理论上是最佳信息，并且估计误差几乎最小。特别是，通过利用马尔可夫过程的收缩特性和Martingale浓度，我们建立了有限样本的实例依赖性误差上限和几乎匹配的minimax下限。策略评估错误急剧取决于目标策略的长期分布与过去数据的分布之间的功能类别的限制$χ^2 $差异。这种限制的$χ^2 $ divergence既取决于实例依赖和函数类别级。它表征了非政策评估的统计限制。此外，我们为政策评估者提供了易于计算的信心，这可能对乐观计划和安全的政策改进有用。

This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $χ^2$-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted $χ^2$-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题