论文标题
DNA:双网络体系结构的近端策略优化
DNA: Proximal Policy Optimization with a Dual Network Architecture
论文作者
论文摘要
本文探讨了在深度参与者批评的增强学习模型中同时学习价值功能和政策的问题。我们发现,由于这两个任务之间的噪声水平差异差异,共同学习这些功能的常见实践是亚最佳选择。取而代之的是,我们表明独立学习这些任务,但是由于蒸馏阶段有限,可以显着提高性能。此外,我们发现可以使用较低的\ textIt {差异}返回估计值来降低策略梯度噪声水平。而值的学习噪声水平随较低的\ textit {bias}估计值降低。这些见解共同为近端策略优化的扩展提供了信息,我们称为\ textit {dual Network Architection}(DNA),这极大地超过了其前身。即使在更困难的随机控制设置下,DNA也超过了受欢迎的彩虹DQN算法的性能。
This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal, due to an order-of-magnitude difference in noise levels between these two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that the policy gradient noise levels can be decreased by using a lower \textit{variance} return estimate. Whereas, the value learning noise level decreases with a lower \textit{bias} estimate. Together these insights inform an extension to Proximal Policy Optimization we call \textit{Dual Network Architecture} (DNA), which significantly outperforms its predecessor. DNA also exceeds the performance of the popular Rainbow DQN algorithm on four of the five environments tested, even under more difficult stochastic control settings.