论文标题
通过网络的图像文本对改善视觉和语言导航
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
论文作者
论文摘要
遵循导航指令,例如“沿着楼梯走并停在棕色沙发上”,就需要将AI代理体现到通过语言(例如“楼梯”)引用的地面场景元素到环境中的视觉内容(与“楼梯”相对应的像素)。 我们提出以下问题 - 我们能否利用丰富的“无形”网络结束的视觉和语言语料库(例如概念标题)学习视觉基础(“楼梯”是什么样的?具体而言,我们开发了VLN-BERT,这是一种基于Visiol语言变压器的模型,用于评分指令('...停在棕色沙发上)和一系列代理捕获的全景RGB图像。我们证明,在体现的路径指导数据进行微调之前对图像文本对进行了预处理的VLN-BERT显着提高了VLN的性能 - 在成功率上超过了全面观察环境中先前的最新设置的先前最新设置的绝对百分比。我们预处理课程的消融表明每个阶段都具有影响力 - 它们的结合产生了进一步的积极协同作用。
Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN -- outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further positive synergistic effects.