论文标题

通过损失功能加权的损失功能,损失功能由时间差异误差加权

Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error

论文作者

Park, Bumgeun, Kim, Taeyoung, Moon, Woohyeon, Vecchietti, Luiz Felipe, Har, Dongsoo

论文摘要

通过非政策深度加固学习(RL)培训代理需要大量记忆,名为重播记忆,该记忆存储了过去用于学习的经验。这些经验是均匀或不均匀的采样,以创建用于训练的批次。在计算损失函数时,非政策算法假定所有样本都具有相同的重要性。在本文中,我们假设可以通过直接在培训目标中根据其时间差异(TD)错误为每种经验分配不同的重要性来增强培训。我们提出了一种新颖的方法,该方法在学习阶段计算损失函数时会引入每种体验的加权因素。除了与均匀采样一起使用时提高收敛速度外,该方法还可以与优先级化的方法结合使用,用于非均匀采样。将提出的方法与优先级方法结合起来,提高了采样效率,同时提高了基于TD的基于Policy的RL算法的性能。在Openai Gym Suite的六个环境中,实验证明了该方法的有效性。实验结果表明,所提出的方法在三种环境下的收敛速度降低了33%〜76%,回报率增加了11%,其他三种环境的成功率增加了3%〜10%。

Training agents via off-policy deep reinforcement learning (RL) requires a large memory, named replay memory, that stores past experiences used for learning. These experiences are sampled, uniformly or non-uniformly, to create the batches used for training. When calculating the loss function, off-policy algorithms assume that all samples are of the same importance. In this paper, we hypothesize that training can be enhanced by assigning different importance for each experience based on their temporal-difference (TD) error directly in the training objective. We propose a novel method that introduces a weighting factor for each experience when calculating the loss function at the learning stage. In addition to improving convergence speed when used with uniform sampling, the method can be combined with prioritization methods for non-uniform sampling. Combining the proposed method with prioritization methods improves sampling efficiency while increasing the performance of TD-based off-policy RL algorithms. The effectiveness of the proposed method is demonstrated by experiments in six environments of the OpenAI Gym suite. The experimental results demonstrate that the proposed method achieves a 33%~76% reduction of convergence speed in three environments and an 11% increase in returns and a 3%~10% increase in success rate for other three environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源