基于样本的分销策略梯度

论文标题

基于样本的分销策略梯度

Sample-based Distributional Policy Gradient

论文作者

Singh, Rahul, Lee, Keuntaek, Chen, Yongxin

论文摘要

分布强化学习（DRL）是最近的强化学习框架，其成功得到了各种经验研究的支持。它依赖于用回报分布替换预期回报的关键思想，后者捕获了长期奖励的内在随机性。关于DRL的大多数现有文献都集中在具有离散动作空间和基于价值的方法的问题上。在这项工作中，由具有连续动作空间控制设置的机器人技术的应用，我们提出了基于样本的分布策略梯度（SDPG）算法。它通过用于生成建模和推理广泛使用的重新聚体化技术使用样品对返回分布进行建模。我们将SDPG与DRL的最先进的策略梯度方法进行了比较，分布式分布式确定性策略梯度（D4PG）已证明了最先进的性能。我们将SDPG和D4PG应用于多个OpenAI健身环境，并观察到我们的算法对大多数任务显示出更好的样品效率以及更高的奖励。

Distributional reinforcement learning (DRL) is a recent reinforcement learning framework whose success has been supported by various empirical studies. It relies on the key idea of replacing the expected return with the return distribution, which captures the intrinsic randomness of the long term rewards. Most of the existing literature on DRL focuses on problems with discrete action space and value based methods. In this work, motivated by applications in robotics with continuous action space control settings, we propose sample-based distributional policy gradient (SDPG) algorithm. It models the return distribution using samples via a reparameterization technique widely used in generative modeling and inference. We compare SDPG with the state-of-art policy gradient method in DRL, distributed distributional deterministic policy gradients (D4PG), which has demonstrated state-of-art performance. We apply SDPG and D4PG to multiple OpenAI Gym environments and observe that our algorithm shows better sample efficiency as well as higher reward for most tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题