论文标题
非参数非政策政策梯度
A Nonparametric Off-Policy Policy Gradient
论文作者
论文摘要
尽管最近取得了出色的成功,但增强学习(RL)算法仍然具有很高的样本复杂性。在许多广泛流行的策略梯度算法中,尤其观察到与环境进行密集互动的需求,这些算法使用式样品执行更新。在诸如互动驱动的机器人学习之类的现实情况下,这种低效率的价格变得很明显,在互动驱动的机器人学习中,RL的成功是相当有限的。我们通过建立基于政策算法的一般样本效率来解决这个问题。通过非参数回归和密度估计方法,我们以原则性的方式构建非参数钟声方程,这使我们能够获得价值函数的闭合形式估计,并在分析上表达完整的策略梯度。我们对我们的估计值进行了理论分析,以表明它在轻度平稳性假设下是一致的,并从经验上表明,与最先进的策略梯度方法相比,我们的方法具有更好的样品效率。
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The price of such inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods.