论文标题
通过自我监管的活动域随机化生成自动课程
Generating Automatic Curricula via Self-Supervised Active Domain Randomization
论文作者
论文摘要
传统上,目标定向的强化学习(RL)考虑了与环境互动的代理商,规定了对代理与成就的代理人的实现奖励。目标指导的RL由于提出目标而轻松地重复使用或创造新的经验,因此在样本效率方面取得了巨大提高。一种方法,即自我播放,可以通过替代设定和实现目标来“对抗”自己,从而创建了一个学识渊博的课程,代理可以通过该课程学习实现更艰难的目标。但是,自我播放仅限于目标课程学习或在单个环境中逐渐艰难的目标学习。关于机器人剂的最新工作表明,在训练过程中改变环境,例如进行域随机化,会导致更健壮的转移。结果,我们将自我播放框架扩展到共同学习目标和环境课程,从而通过自我播放来学习最富有成果的领域随机策略。我们的方法是自我监管的活动域随机化(SS-ADR),生成了一个耦合的目标任务课程,在该课程中,代理通过逐渐艰难的任务和环境变化来学习。通过鼓励代理商尝试在其当前功能之外的任务,SS-ADR构建了一个域随机课程,该课程可以使最新的各种SIM2REAL转移任务结果。我们的结果表明,共同发展环境难度的课程以及在每个环境中设定的目标的难度在测试的目标指导任务中提供了实际好处。
Goal-directed Reinforcement Learning (RL) traditionally considers an agent interacting with an environment, prescribing a real-valued reward to an agent proportional to the completion of some goal. Goal-directed RL has seen large gains in sample efficiency, due to the ease of reusing or generating new experience by proposing goals. One approach,self-play, allows an agent to "play" against itself by alternatively setting and accomplishing goals, creating a learned curriculum through which an agent can learn to accomplish progressively more difficult goals. However, self-play has been limited to goal curriculum learning or learning progressively harder goals within a single environment. Recent work on robotic agents has shown that varying the environment during training, for example with domain randomization, leads to more robust transfer. As a result, we extend the self-play framework to jointly learn a goal and environment curriculum, leading to an approach that learns the most fruitful domain randomization strategy with self-play. Our method, Self-Supervised Active Domain Randomization(SS-ADR), generates a coupled goal-task curriculum, where agents learn through progressively more difficult tasks and environment variations. By encouraging the agent to try tasks that are just outside of its current capabilities, SS-ADR builds a domain randomization curriculum that enables state-of-the-art results on varioussim2real transfer tasks. Our results show that a curriculum of co-evolving the environment difficulty together with the difficulty of goals set in each environment provides practical benefits in the goal-directed tasks tested.