VER：缩放上的RL会导致体现重新排列的导航的出现

论文标题

VER：缩放上的RL会导致体现重新排列的导航的出现

VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement

论文作者

Wijmans, Erik, Essa, Irfan, Batra, Dhruv

论文摘要

我们提供可变的体验推出（VER），这是一种在异质环境中有效缩放批判性批判性的强化学习的技术（其中不同的环境会花费大量不同的时间来产生推出），从而可能居住在许多机器上。 VER结合了同步和异步上的RL方法（分别为SynConrl和AsynConrl）之间的界线的强度和模糊。 VER从政府体验（如SynConrl）中学习，并且没有同步点（例如AsynConrl）。 VER可导致在影片3D模拟环境中的各种体现的导航和移动操纵任务中的显着和一致的速度。具体而言，对于栖息地1.0中的PointGoal导航和对象目标导航，VER比DD-PPO快60-100％（1.6-2x速度），DD-PPO是当前最先进的SynConrl，具有相似的样品效率。对于移动操作任务（开放式冰箱/机柜，栖息地2.0 VER中的挑选/放置对象）在1 GPU上快150％（2.5倍加速），在8 GPU上比DD-PPO快170％（2.7倍速度）。与样品效果（当前最新的异步词）相比，VER在1 GPU上匹配其速度，并且在8 GPU上快70％（1.7倍速度），样品效率更好。我们利用这些加速度来训练链接技能，以完成家庭助理基准（HAB）中的几何重排任务。我们发现无法表面上的技能导航的出现令人惊讶的出现需要任何导航。具体而言，挑选技能涉及从表中挑选对象的机器人。在训练过程中，机器人总是在桌子附近产卵，而不需要导航。但是，我们发现，如果基本移动是动作空间的一部分，机器人将学会导航，然后在50％成功的新环境中选择一个对象，表明出人意料的高分布概括。

We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). VER learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL). VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency. We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation. Specifically, the Pick skill involves a robot picking an object from a table. During training the robot was always spawned close to the table and never needed to navigate. However, we find that if base movement is part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题