通过自适应不服从弥合模仿差距

论文标题

通过自适应不服从弥合模仿差距

Bridging the Imitation Gap by Adaptive Insubordination

论文作者

Weihs, Luca, Jain, Unnat, Liu, Iou-Jen, Salvador, Jordi, Lazebnik, Svetlana, Kembhavi, Aniruddha, Schwing, Alexander

论文摘要

在实践中，只要有可能设计教师提供专家监督，就可以选择模仿学习。但是，我们表明，当教师通过访问学生无法获得的特权信息做出决定时，此信息在模仿学习过程中被边缘化，从而导致“模仿差距”，并且可能会糟糕。先前的工作通过从模仿学习到加强学习的发展来弥合这一差距。虽然经常成功，但对于需要在探索和记忆之间频繁开关的任务而逐渐进展失败。为了更好地解决这些任务并减轻我们提出的“自适应不服从”的模仿差距（顾问）。顾问在训练过程中动态加权模仿和基于奖励的强化学习损失，从而在模仿和探索之间进行直立的切换。在网格世界，多代理粒子环境和高保真3D模拟器中设置的一系列具有挑战性的任务上，我们表明，与顾问的直立切换优于纯粹的模仿，纯净的强化学习，以及它们的顺序和平行组合。

In practice, imitation learning is preferred over pure reinforcement learning whenever it is possible to design a teaching agent to provide expert supervision. However, we show that when the teaching agent makes decisions with access to privileged information that is unavailable to the student, this information is marginalized during imitation learning, resulting in an "imitation gap" and, potentially, poor results. Prior work bridges this gap via a progression from imitation learning to reinforcement learning. While often successful, gradual progression fails for tasks that require frequent switches between exploration and memorization. To better address these tasks and alleviate the imitation gap we propose 'Adaptive Insubordination' (ADVISOR). ADVISOR dynamically weights imitation and reward-based reinforcement learning losses during training, enabling on-the-fly switching between imitation and exploration. On a suite of challenging tasks set within gridworlds, multi-agent particle environments, and high-fidelity 3D simulators, we show that on-the-fly switching with ADVISOR outperforms pure imitation, pure reinforcement learning, as well as their sequential and parallel combinations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题