通过时间变体自学学习密集的奖励

论文标题

通过时间变体自学学习密集的奖励

Learning Dense Reward with Temporal Variant Self-Supervision

论文作者

Wu, Yuning, Luo, Jieliang, Li, Hui

论文摘要

奖励在加强学习中起着至关重要的作用。与具有明确定义的奖励功能的基于规则的游戏环境相比，复杂的现实世界机器人应用程序（例如接触式操纵操纵），缺乏可直接用作奖励的明确和信息的描述。以前的努力表明，可以直接从多模式观测中提取密集的奖励。在本文中，我们旨在通过提出一种更有效，更强大的采样和学习方式来扩展这项工作。特别是，我们的抽样方法利用时间方差来模拟操纵任务的波动状态和动作分布。然后，我们提出了一个网络体系结构，用于自我监督学习，以更好地将时间信息纳入潜在的表示。我们在两个实验设置中测试了我们的方法，即联合组装和开门。初步结果表明，我们的方法在学习致密奖励方面有效，有效，而学习的奖励会比基层更快地收敛。

Rewards play an essential role in reinforcement learning. In contrast to rule-based game environments with well-defined reward functions, complex real-world robotic applications, such as contact-rich manipulation, lack explicit and informative descriptions that can directly be used as a reward. Previous effort has shown that it is possible to algorithmically extract dense rewards directly from multimodal observations. In this paper, we aim to extend this effort by proposing a more efficient and robust way of sampling and learning. In particular, our sampling approach utilizes temporal variance to simulate the fluctuating state and action distribution of a manipulation task. We then proposed a network architecture for self-supervised learning to better incorporate temporal information in latent representations. We tested our approach in two experimental setups, namely joint-assembly and door-opening. Preliminary results show that our approach is effective and efficient in learning dense rewards, and the learned rewards lead to faster convergence than baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题