剪辑：用于基于发行的策略演变的简单而强大的优化器

论文标题

剪辑：用于基于发行的策略演变的简单而强大的优化器

ClipUp: A Simple and Powerful Optimizer for Distribution-based Policy Evolution

论文作者

Toklu, Nihat Engin, Liskowski, Paweł, Srivastava, Rupesh Kumar

论文摘要

基于分布的搜索算法是神经网络控制器进化增强学习的有效方法。在这些算法中，使用从搜索分布中得出的解决方案估算，相对于策略参数的总奖励的梯度估计，然后用于随机梯度上升的策略优化。社区中的一个共同选择是使用ADAM优化算法在梯度上升期间获得适应性行为，因为它在各种监督的学习环境中都取得了成功。作为亚当的替代方案，我们建议通过两种简单的技术来增强基于经典动量的梯度上升：梯度标准化和更新剪辑。我们认为，所得的优化器称为Clipup（“剪辑更新”的缩写）是基于分配的策略演变的更好选择，因为其工作原理简单易懂，并且可以在实践中更直观地调整其超参数。此外，如果奖励量表变化，它消除了重新调整超参数的需求。实验表明，剪贴画与亚当具有简单性，并且有效地挑战连续控制基准，包括基于子弹物理模拟器的人形控制任务。

Distribution-based search algorithms are an effective approach for evolutionary reinforcement learning of neural network controllers. In these algorithms, gradients of the total reward with respect to the policy parameters are estimated using a population of solutions drawn from a search distribution, and then used for policy optimization with stochastic gradient ascent. A common choice in the community is to use the Adam optimization algorithm for obtaining an adaptive behavior during gradient ascent, due to its success in a variety of supervised learning settings. As an alternative to Adam, we propose to enhance classical momentum-based gradient ascent with two simple techniques: gradient normalization and update clipping. We argue that the resulting optimizer called ClipUp (short for "clipped updates") is a better choice for distribution-based policy evolution because its working principles are simple and easy to understand and its hyperparameters can be tuned more intuitively in practice. Moreover, it removes the need to re-tune hyperparameters if the reward scale changes. Experiments show that ClipUp is competitive with Adam despite its simplicity and is effective on challenging continuous control benchmarks, including the Humanoid control task based on the Bullet physics simulator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题