长n步代替代阶段的奖励，以减少复杂问题中深方差学习的差异

论文标题

长n步代替代阶段的奖励，以减少复杂问题中深方差学习的差异

Long N-step Surrogate Stage Reward to Reduce Variances of Deep Reinforcement Learning in Complex Problems

论文作者

Zhong, Junmin, Wu, Ruofan, Si, Jennie

论文摘要

强化学习的高差异表明，阻碍了成功的融合和伤害任务绩效。由于奖励信号在学习行为中起着重要作用，因此已经考虑了多步方法来减轻问题，并且被认为比单步方法更有效。但是，缺乏对这一重要方面的全面和系统的研究，以证明多步方法在解决高度复杂的连续控制问题方面的有效性。在这项研究中，我们介绍了一种新的$ n $ n $ step替代阶段（LNSS）奖励方法，以有效地说明复杂的环境动态，而以前的方法通常是有限数量的步骤。 LNSS方法是简单，计算成本低，适用于基于价值或策略梯度增强学习。我们系统地评估了OpenAI健身房和DeepMind Control Suite的LNSS，以解决一些复杂的基准环境，这些环境总体上挑战了DRL的良好结果。我们证明了LNSS的总奖励，收敛速度和变异系数（CV）的性能提高。我们还提供了有关LNSS如何从相应单步方法中降低Q值方差上的上限的分析见解

High variances in reinforcement learning have shown impeding successful convergence and hurting task performance. As reward signal plays an important role in learning behavior, multi-step methods have been considered to mitigate the problem, and are believed to be more effective than single step methods. However, there is a lack of comprehensive and systematic study on this important aspect to demonstrate the effectiveness of multi-step methods in solving highly complex continuous control problems. In this study, we introduce a new long $N$-step surrogate stage (LNSS) reward approach to effectively account for complex environment dynamics while previous methods are usually feasible for limited number of steps. The LNSS method is simple, low computational cost, and applicable to value based or policy gradient reinforcement learning. We systematically evaluate LNSS in OpenAI Gym and DeepMind Control Suite to address some complex benchmark environments that have been challenging to obtain good results by DRL in general. We demonstrate performance improvement in terms of total reward, convergence speed, and coefficient of variation (CV) by LNSS. We also provide analytical insights on how LNSS exponentially reduces the upper bound on the variances of Q value from a respective single step method

下载PDF全文

下载文献需遵守相关版权规定

论文标题