论文标题

非政策深度加固学习中的动作噪声:对探索和性能的影响

Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance

论文作者

Hollenstein, Jakob, Auddy, Sayantan, Saveriano, Matteo, Renaudo, Erwan, Piater, Justus

论文摘要

许多深度强化学习(D-RL)算法依赖于简单的探索形式,例如经常在连续控制域中使用的添加剂噪声。通常,该动作噪声的缩放因子被选择为高参数,并在训练过程中保持恒定。在本文中,我们将重点放在非政策深度加强学习中的动作噪声上,以进行连续控制。我们分析了如何受到噪声类型,噪声量表和影响缩放因子减少时间表的影响。我们考虑了两种最突出的动作噪声类型,高斯和Ornstein-Uhlenbeck噪声,并通过系统地改变噪声类型和规模参数,以及测量感兴趣的变量(例如策略的预期返回和勘探期间的状态空间覆盖率)来执行巨大的实验运动。对于后者,我们提出了一种新颖的状态空间覆盖度度量$ \ operatotorname {x} _ {\ Mathcal {U} \ text {rel}} $,它对与以前相比的测度相比,由接近状态空间边界的点所引起的估计工件更强大。较大的噪声尺度通常会增加状态空间的覆盖范围。但是,我们发现使用较大的噪声量表增加空间覆盖范围通常是无益的。相反,在训练过程中降低噪声量表可以减少差异并通常改善学习绩效。我们得出的结论是,最好的噪声类型和尺度取决于环境,并且基于我们的观察结果得出了指导选择动作噪声作为进一步优化的起点的启发式规则。

Many Deep Reinforcement Learning (D-RL) algorithms rely on simple forms of exploration such as the additive action noise often used in continuous control domains. Typically, the scaling factor of this action noise is chosen as a hyper-parameter and is kept constant during training. In this paper, we focus on action noise in off-policy deep reinforcement learning for continuous control. We analyze how the learned policy is impacted by the noise type, noise scale, and impact scaling factor reduction schedule. We consider the two most prominent types of action noise, Gaussian and Ornstein-Uhlenbeck noise, and perform a vast experimental campaign by systematically varying the noise type and scale parameter, and by measuring variables of interest like the expected return of the policy and the state-space coverage during exploration. For the latter, we propose a novel state-space coverage measure $\operatorname{X}_{\mathcal{U}\text{rel}}$ that is more robust to estimation artifacts caused by points close to the state-space boundary than previously-proposed measures. Larger noise scales generally increase state-space coverage. However, we found that increasing the space coverage using a larger noise scale is often not beneficial. On the contrary, reducing the noise scale over the training process reduces the variance and generally improves the learning performance. We conclude that the best noise type and scale are environment dependent, and based on our observations derive heuristic rules for guiding the choice of the action noise as a starting point for further optimization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源