MO2：基于模型的离线选项

论文标题

MO2：基于模型的离线选项

MO2: Model-Based Offline Options

论文作者

Salter, Sasha, Wulfmeier, Markus, Tirumala, Dhruva, Heess, Nicolas, Riedmiller, Martin, Hadsell, Raia, Rao, Dushyant

论文摘要

从过去的经验中发现有用的行为并将其转移到新任务的能力被认为是自然体现智力的核心组成部分。受神经科学的启发，发现在瓶颈状态下切换的行为一直被人们追求，以引起整个任务的最小描述长度的计划。先前的方法仅支持在线，policy，瓶颈状态发现，限制样本效率或离散的州行动域，从而限制适用性。为了解决这个问题，我们介绍了基于模型的离线选项（MO2），这是一个离线事后的框架框架，支持在连续的状态行动空间上发现样品效率高效的瓶颈选项。一旦脱机而在源域上学习了瓶颈选项，它们就会在线转移，以改善转移域的探索和价值估计。我们的实验表明，在复杂的长马上连续控制任务上，具有稀疏，延迟的奖励，MO2的特性至关重要，并且导致性能超过最近的选项学习方法。其他消融进一步证明了对期权可预测性和信用分配的影响。

The ability to discover useful behaviours from past experience and transfer them to new tasks is considered a core component of natural embodied intelligence. Inspired by neuroscience, discovering behaviours that switch at bottleneck states have been long sought after for inducing plans of minimum description length across tasks. Prior approaches have either only supported online, on-policy, bottleneck state discovery, limiting sample-efficiency, or discrete state-action domains, restricting applicability. To address this, we introduce Model-Based Offline Options (MO2), an offline hindsight framework supporting sample-efficient bottleneck option discovery over continuous state-action spaces. Once bottleneck options are learnt offline over source domains, they are transferred online to improve exploration and value estimation on the transfer domain. Our experiments show that on complex long-horizon continuous control tasks with sparse, delayed rewards, MO2's properties are essential and lead to performance exceeding recent option learning methods. Additional ablations further demonstrate the impact on option predictability and credit assignment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题