渴望：在语言引导的RL中提出和回答问题以自动奖励成型

论文标题

渴望：在语言引导的RL中提出和回答问题以自动奖励成型

EAGER: Asking and Answering Questions for Automatic Reward Shaping in Language-guided RL

论文作者

Carta, Thomas, Oudeyer, Pierre-Yves, Sigaud, Olivier, Lamprier, Sylvain

论文摘要

众所周知，在漫长的地平线和稀疏的奖励任务中，加强学习（RL）是困难的，需要大量的培训步骤。加快该过程的标准解决方案是利用额外的奖励信号，将其塑造以更好地指导学习过程。在语言条件的RL的背景下，语言输入的抽象和概括属性为更有效的方式塑造了奖励。在本文中，我们利用了这个想法，并提出了一种自动奖励成型方法，代理从一般语言目标中提取辅助目标。这些辅助目标使用问题产生（QG）和问题答案（QA）系统：它们包括导致代理商尝试使用其自己的轨迹重建有关全球目标的部分信息的问题。当它成功时，它会获得与对答案的信心成正比的内在奖励。这激励代理生成轨迹，这些轨迹明确地解释了一般语言目标的各个方面。我们的实验研究表明，这种方法不需要工程师干预来设计辅助目标，可以通过有效指导探索来提高样品效率。

Reinforcement learning (RL) in long horizon and sparse reward tasks is notoriously difficult and requires a lot of training steps. A standard solution to speed up the process is to leverage additional reward signals, shaping it to better guide the learning process. In the context of language-conditioned RL, the abstraction and generalisation properties of the language input provide opportunities for more efficient ways of shaping the reward. In this paper, we leverage this idea and propose an automated reward shaping method where the agent extracts auxiliary objectives from the general language goal. These auxiliary objectives use a question generation (QG) and question answering (QA) system: they consist of questions leading the agent to try to reconstruct partial information about the global goal using its own trajectory. When it succeeds, it receives an intrinsic reward proportional to its confidence in its answer. This incentivizes the agent to generate trajectories which unambiguously explain various aspects of the general language goal. Our experimental study shows that this approach, which does not require engineer intervention to design the auxiliary objectives, improves sample efficiency by effectively directing exploration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题