通过方向性平滑的演变策略加速加强学习

论文标题

通过方向性平滑的演变策略加速加强学习

Accelerating Reinforcement Learning with a Directional-Gaussian-Smoothing Evolution Strategy

论文作者

Zhang, Jiaxing, Tran, Hoang, Zhang, Guannan

论文摘要

在许多具有挑战性的强化学习（RL）任务中，进化策略（ES）已表现出巨大的希望，与其他最先进的深度RL方法媲美。但是，当前的ES实践中有两个局限性可能会阻碍其原本进一步的能力。首先，大多数当前方法依赖于蒙特卡洛类型梯度估计器来暗示搜索方向，通常在策略参数随机采样。由于此类估计器的精度较低，RL训练可能会遇到缓慢的收敛性，并且需要更多的迭代才能达到最佳解决方案。其次，奖励功能的景观可以具有欺骗性，并包含许多局部最大值，从而导致ES算法过早收敛，并且无法探索参数空间的其他部分具有更大的奖励。在这项工作中，我们采用了定向高斯的平滑进化策略（DGS-E）来加速RL训练，这非常适合解决这两个挑战，其能力i）i）提供较高准确性的梯度估计，ii）找到非局部搜索方向，这对奖励功能的大规模差异和discale disnegards able nordegards abence and dienegards abence and discale a disnecal and discale noregards进行了压力。通过本文所述的几个基准RL任务，我们表明DGS-ES具有高度扩展，具有较高的墙壁锁定时间，并获得了与其他流行的政策梯度和ES方法相比的竞争奖励分数。

Evolution strategy (ES) has been shown great promise in many challenging reinforcement learning (RL) tasks, rivaling other state-of-the-art deep RL methods. Yet, there are two limitations in the current ES practice that may hinder its otherwise further capabilities. First, most current methods rely on Monte Carlo type gradient estimators to suggest search direction, where the policy parameter is, in general, randomly sampled. Due to the low accuracy of such estimators, the RL training may suffer from slow convergence and require more iterations to reach optimal solution. Secondly, the landscape of reward functions can be deceptive and contains many local maxima, causing ES algorithms to prematurely converge and be unable to explore other parts of the parameter space with potentially greater rewards. In this work, we employ a Directional Gaussian Smoothing Evolutionary Strategy (DGS-ES) to accelerate RL training, which is well-suited to address these two challenges with its ability to i) provide gradient estimates with high accuracy, and ii) find nonlocal search direction which lays stress on large-scale variation of the reward function and disregards local fluctuation. Through several benchmark RL tasks demonstrated herein, we show that DGS-ES is highly scalable, possesses superior wall-clock time, and achieves competitive reward scores to other popular policy gradient and ES approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题