IL-Flow：使用标准化流从观察中学习的模仿

论文标题

IL-Flow：使用标准化流从观察中学习的模仿

IL-flOw: Imitation Learning from Observation using Normalizing Flows

论文作者

Chang, Wei-Di, Higuera, Juan Camilo Gamboa, Fujimoto, Scott, Meger, David, Dudek, Gregory

论文摘要

我们提出了一种仅从专家国家观察结果的逆增强学习算法（IRL）。我们的方法将奖励建模与政策学习的奖励模型不同，与最先进的对抗方法不同，这些方法需要在政策搜索过程中更新奖励模型，并且已知不稳定且难以优化。我们的方法Il-Flow通过使用对演示轨迹训练的深度密度估计器产生奖励来恢复专家政策，从而避免了对抗性方法的不稳定性问题。我们证明，使用状态过渡对数概要密度作为向前增强学习的奖励信号，可以转化为与专家演示的轨迹分布相匹配，并在实验上表明了真实奖励信号的良好恢复以及从对机车和机器人连续控制任务的观察中的模仿结果的最先进的结果。

We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only. Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods which require updating the reward model during policy search and are known to be unstable and difficult to optimize. Our method, IL-flOw, recovers the expert policy by modelling state-state transitions, by generating rewards using deep density estimators trained on the demonstration trajectories, avoiding the instability issues of adversarial methods. We demonstrate that using the state transition log-probability density as a reward signal for forward reinforcement learning translates to matching the trajectory distribution of the expert demonstrations, and experimentally show good recovery of the true reward signal as well as state of the art results for imitation from observation on locomotion and robotic continuous control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题