仙人掌：可扩展的多任务多场景视觉模仿学习的框架

论文标题

仙人掌：可扩展的多任务多场景视觉模仿学习的框架

CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning

论文作者

Mandi, Zhao, Bharadhwaj, Homanga, Moens, Vincent, Song, Shuran, Rajeswaran, Aravind, Kumar, Vikash

论文摘要

大规模的培训已推动了AI的各个子场（例如计算机视觉和自然语言处理）的重大进展。但是，以相当的规模构建机器人学习系统仍然具有挑战性。为了开发可以执行各种技能并适应新方案的机器人，需要在物理机器人系统上收集大量和多样化的数据的有效方法，以及使用此类数据集培训高容量政策的能力。在这项工作中，我们提出了一个用于扩展机器人学习的框架，在仿真和现实世界中，在厨房环境中的多任务和多场景操作的特定关注。我们提出的框架仙人掌包括四个分别处理的阶段：数据收集，数据增强，视觉表示学习和模仿政策培训，以在机器人学习中启用可伸缩性。我们将最新的生成模型作为数据增强阶段的一部分，并使用预训练的室外视觉表示来提高训练效率。实验结果证明了我们方法的有效性。在真正的机器人设置中，仙人掌可以对单一政策进行有效的培训，该政策可以执行10个涉及厨房对象的操纵任务，并且可以改变分散器的布局。在模拟的厨房环境中，仙人掌训练单个政策，为每个任务执行100个布局变化的18个语义任务。我们将在实际和模拟环境中发布模拟任务基准和增强数据集，以促进未来的研究。

Large-scale training have propelled significant progress in various sub-fields of AI such as computer vision and natural language processing. However, building robot learning systems at a comparable scale remains challenging. To develop robots that can perform a wide range of skills and adapt to new scenarios, efficient methods for collecting vast and diverse amounts of data on physical robot systems are required, as well as the capability to train high-capacity policies using such datasets. In this work, we propose a framework for scaling robot learning, with specific focus on multi-task and multi-scene manipulation in kitchen environments, both in simulation and in the real world. Our proposed framework, CACTI, comprises four stages that separately handle: data collection, data augmentation, visual representation learning, and imitation policy training, to enable scalability in robot learning . We make use of state-of-the-art generative models as part of the data augmentation stage, and use pre-trained out-of-domain visual representations to improve training efficiency. Experimental results demonstrate the effectiveness of our approach. On a real robot setup, CACTI enables efficient training of a single policy that can perform 10 manipulation tasks involving kitchen objects, and is robust to varying layouts of distractors. In a simulated kitchen environment, CACTI trains a single policy to perform 18 semantic tasks across 100 layout variations for each individual task. We will release the simulation task benchmark and augmented datasets in both real and simulated environments to facilitate future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题