通过知识图谜语改善视觉模型中的常识

论文标题

通过知识图谜语改善视觉模型中的常识

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

论文作者

Ye, Shuquan, Xie, Yujia, Chen, Dongdong, Xu, Yichong, Yuan, Lu, Zhu, Chenguang, Liao, Jing

论文摘要

本文着重于分析和提高最近流行视觉语言（VL）模型的常识能力。尽管取得了巨大的成功，但我们观察到现有的VL模型仍然缺乏常识性知识/推理能力（例如“柠檬是酸”），这是对人工通用情报的重要组成部分。通过我们的分析，我们发现一个重要的原因是，现有的大规模VL数据集并不包含太多的常识知识，这激发了我们从数据角度提高VL模型的常识。我们没有收集新的VL培训数据集，而是提出了一个更可扩展的策略，即“具有认识能力的知识图线性化的数据增强”（舞蹈）。它可以被视为一种类型的数据增强技术，可以在培训期间将常识性知识注入现有的VL数据集中。更具体地说，我们利用常识知识图（例如概念网），并通过双向子图顺序化在VL数据集中创建文本描述的变体。为了获得更好的常识性评估，我们进一步提出了第一个基于检索的常识性诊断基准。通过对某些代表性VL模型进行广泛的实验，我们证明了我们的舞蹈技术能够显着提高常识能力，同时保持在香草恢复任务上的性能。代码和数据可在https://github.com/pleaseconnectwifi/dance上获得

This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. The code and data are available at https://github.com/pleaseconnectwifi/DANCE

下载PDF全文

下载文献需遵守相关版权规定

论文标题