非参数时间差异学习的非反应分析

论文标题

非参数时间差异学习的非反应分析

A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning

论文作者

Berthier, Eloïse, Kobeissi, Ziad, Bach, Francis

论文摘要

时间差异学习是政策评估的流行算法。在本文中，我们研究了独立和马尔可夫观察环境中正规化非参数TD（0）算法的收敛性。特别是，当TD在通用繁殖的内核希尔伯特空间（RKHS）中执行时，即使它不属于RKHS，我们也证明了平均迭代的收敛性。我们提供明确的收敛速率，取决于源条件，将最佳值函数与RKHS的规律性相关。我们在简单的连续状态马尔可夫奖励过程中以数字方式说明了这种收敛。

Temporal-difference learning is a popular algorithm for policy evaluation. In this paper, we study the convergence of the regularized non-parametric TD(0) algorithm, in both the independent and Markovian observation settings. In particular, when TD is performed in a universal reproducing kernel Hilbert space (RKHS), we prove convergence of the averaged iterates to the optimal value function, even when it does not belong to the RKHS. We provide explicit convergence rates that depend on a source condition relating the regularity of the optimal value function to the RKHS. We illustrate this convergence numerically on a simple continuous-state Markov reward process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题