稳定的非政策加固学习的表示

论文标题

稳定的非政策加固学习的表示

Representations for Stable Off-Policy Reinforcement Learning

论文作者

Ghosh, Dibya, Bellemare, Marc G.

论文摘要

使用功能近似的增强学习可能是不稳定的，甚至可能是不同的，尤其是在与范围的学习和Bellman更新结合使用时。在深入的强化学习中，这些问题是通过适应和正规化表示形式（尤其是辅助任务）来处理的。这表明表示学习可以提供保证稳定性的手段。在本文中，我们正式表明确实存在非平凡的状态表示，即使在学习范围的情况下，规范的TD算法也是稳定的。我们沿三个轴沿策略的过渡矩阵（例如原始功能）分析了基于策略的过渡矩阵的学习方案：近似误差，稳定性和易于估计。在最普遍的情况下，我们表明Schur基础提供了融合保证，但很难从样本中估算。对于固定的奖励功能，我们发现相应的Krylov子空间的正交基础是一个更好的选择。我们通过经验证明，可以使用随机梯度下降来学习这些稳定的表示，从而为改进的技术提供了用深层网络来表示的技术，可以学习这些稳定的表示。

Reinforcement learning with function approximation can be unstable and even divergent, especially when combined with off-policy learning and Bellman updates. In deep reinforcement learning, these issues have been dealt with empirically by adapting and regularizing the representation, in particular with auxiliary tasks. This suggests that representation learning may provide a means to guarantee stability. In this paper, we formally show that there are indeed nontrivial state representations under which the canonical TD algorithm is stable, even when learning off-policy. We analyze representation learning schemes that are based on the transition matrix of a policy, such as proto-value functions, along three axes: approximation error, stability, and ease of estimation. In the most general case, we show that a Schur basis provides convergence guarantees, but is difficult to estimate from samples. For a fixed reward function, we find that an orthogonal basis of the corresponding Krylov subspace is an even better choice. We conclude by empirically demonstrating that these stable representations can be learned using stochastic gradient descent, opening the door to improved techniques for representation learning with deep networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题