论文标题

目标条件Q学习作为知识蒸馏

Goal-Conditioned Q-Learning as Knowledge Distillation

论文作者

Levine, Alexander, Feizi, Soheil

论文摘要

强化学习的许多应用都可以正式化为目标条件的环境,在每个情节中,都有一个“目标”会影响该情节中获得的奖励,但不会影响动态。已经提出了各种技术来提高目标条件环境的性能,例如自动课程生成和目标重新标记。在这项工作中,我们探讨了在目标条件设置中的非政策加强学习与知识蒸馏之间的联系。特别是:当前的Q值函数和目标Q值估计是目标的函数,我们想训练Q值函数以匹配其所有目标的目标。因此,我们将基于梯度的注意转移(Zagoruyko和Komodakis 2017)(一种知识蒸馏技术)应用于Q功能更新。我们从经验上表明,当目标空间高维时,这可以提高目标条件的非货币强化学习的性能。我们还表明,在多个同时稀疏目标的情况下,可以对该技术进行调整,以允许有效学习,在这种情况下,代理可以通过在测试时间指定的一大批目标来获得奖励。最后,为了提供理论支持,我们提供了一类环境的示例,在这些环境中(在某些假设下)标准的外算法算法(例如DDPG)至少需要O(d^2)重播缓冲液过渡以学习最佳政策,而我们建议的技术仅需要O(D)过渡,其中D是目标和状态空间的d是d的减小性。代码可从https://github.com/alevine0/reengage获得。

Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a "goal" that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Q-value estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms such as DDPG require at least O(d^2) replay buffer transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space. Code is available at https://github.com/alevine0/ReenGAGE.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源