通过语义上更丰富的说明弥合VLN中的视觉差距

论文标题

通过语义上更丰富的说明弥合VLN中的视觉差距

Bridging the visual gap in VLN via semantically richer instructions

论文作者

Ossandón, Joaquin, Earle, Benjamin, Soto, Álvaro

论文摘要

视觉和语言导航（VLN）任务需要仅使用视觉信息来理解文本指令，以导航自然室内环境。对于大多数人来说，这是一项琐碎的任务，但对于AI模型而言，这仍然是一个空旷的问题。在这项工作中，我们假设使用可用的视觉信息的不良使用是当前模型性能低下的核心。为了支持这一假设，我们提供了实验证据，表明当最新模型受到有限甚至没有视觉数据时，不会受到严重影响，这表明对文本指令的过度贴合性。为了鼓励更合适的视觉信息使用，我们提出了一种新的数据增强方法，该方法促进了在文本导航指令的生成中包含更明确的视觉信息。我们的主要直觉是，当前的VLN数据集包括旨在告知专家导航器的文本指令，例如人类，但不是初学者的视觉导航代理，例如随机初始初始化的DL模型。具体来说，为了弥合当前VLN数据集的视觉语义差距，我们利用可用于MatterPort3D数据集的元数据，其中包括有关场景中存在的对象标签的信息。在未见环境上，培训新指令的最先进模型将其绩效提高了8％，这证明了提出的数据增强方法的优势。

The Visual-and-Language Navigation (VLN) task requires understanding a textual instruction to navigate a natural indoor environment using only visual information. While this is a trivial task for most humans, it is still an open problem for AI models. In this work, we hypothesize that poor use of the visual information available is at the core of the low performance of current models. To support this hypothesis, we provide experimental evidence showing that state-of-the-art models are not severely affected when they receive just limited or even no visual data, indicating a strong overfitting to the textual instructions. To encourage a more suitable use of the visual information, we propose a new data augmentation method that fosters the inclusion of more explicit visual information in the generation of textual navigational instructions. Our main intuition is that current VLN datasets include textual instructions that are intended to inform an expert navigator, such as a human, but not a beginner visual navigational agent, such as a randomly initialized DL model. Specifically, to bridge the visual semantic gap of current VLN datasets, we take advantage of metadata available for the Matterport3D dataset that, among others, includes information about object labels that are present in the scenes. Training a state-of-the-art model with the new set of instructions increase its performance by 8% in terms of success rate on unseen environments, demonstrating the advantages of the proposed data augmentation method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题