贝叶斯非参数用于离线技能发现

论文标题

贝叶斯非参数用于离线技能发现

Bayesian Nonparametrics for Offline Skill Discovery

论文作者

Villecroze, Valentin, Braviner, Harry J., Naderian, Panteha, Maddison, Chris J., Loaiza-Ganem, Gabriel

论文摘要

在加强学习中的技能或低级政策是时间扩展的动作，可以加快学习并实现复杂的行为。离线强化学习和模仿学习的最新工作提出了从一系列专家轨迹发现技能发现的几种技术。尽管这些方法是有希望的，但要发现的技能数量始终是固定的超参数，它需要有关环境的先验知识或其他参数搜索来调整它。我们首先提出了一种脱机学习选择（特定技能框架）的方法，以利用变异推理和持续放松的进步。然后，我们重点介绍了贝叶斯非参数和离线技能发现之间未开发的连接，并展示如何获得模型的非参数版本。由于经过精心构造的后验，该版本是可拖动的，并具有动态变化的选项，从而消除了指定K的需求。我们还展示了如何将非参数扩展应用于其他技能框架中，并经验证明我们的方法可以表明，我们的方法可以在各种环境中均优于各种环境。我们的代码可在https://github.com/layer6ai-labs/bnpo上找到。

Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .

下载PDF全文

下载文献需遵守相关版权规定

论文标题