论文标题
适用于PPO和方向舵的参与者批评方法的融合证明
Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER
论文作者
论文摘要
我们在常用的假设下证明了参与者 - 批判性强化学习算法的融合,这些学习算法同时学习了策略功能,演员和价值功能,评论家。这两个功能都可以是任意复杂性的深神经网络。我们的框架允许显示众所周知的近端策略优化(PPO)和最近引入的舵的收敛性。对于融合证明,我们最近采用了两个时间尺度随机近似理论的技术。我们的结果对于使用情节样本的参与者批评方法有效,并且在学习过程中变得更加贪婪。以前的收敛证明假设线性函数近似,无法治疗情节示例,或者不认为策略变得贪婪。后者是相关的,因为最佳政策通常是确定性的。
We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning. Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic.