论文标题
以对象为中心的视觉推理诊断
Object-Centric Diagnosis of Visual Reasoning
论文作者
论文摘要
在回答有关图像的问题时,它不仅需要知道什么 - 了解图像中的细粒含量(例如对象,关系),而且还要告诉为什么 - 对基于视觉提示的理由来推论问题的答案。在过去的几年中,我们在视觉问题回答上看到了很大的进展。尽管随着准确性的增长而令人印象深刻,但仍然落后于知道这些模型是否正在进行视觉推理或仅利用培训数据中的虚假相关性。最近,许多作品试图从诸如接地和稳健性之类的角度回答这个问题。但是,他们中的大多数要么专注于语言方面,要么是研究像素级的注意图。在本文中,通过利用GQA数据集中提供的逐步对象接地注释,我们首先提出了以对象为中心的,对接地和鲁棒性的视觉推理的诊断,尤其是在视觉方面。根据不同模型的广泛比较,我们发现即使是高精度的模型也不擅长地接地对象,也不适合视觉内容扰动。相反,符号模型和模块化模型具有相对更好的接地和鲁棒性,尽管以准确性为代价。为了调和这些不同的方面,我们进一步开发了诊断模型,即图形推理机。我们的模型用概率场景图代替了纯符号的视觉表示,然后将教师训练训练用于视觉推理模块。设计的模型在继承透明度的同时,改善了香草神经符号模型的所有三个指标的性能。进一步的消融研究表明,这种改进主要是由于更准确的图像理解和适当的中间推理监督。
When answering questions about an image, it not only needs knowing what -- understanding the fine-grained contents (e.g., objects, relationships) in the image, but also telling why -- reasoning over grounding visual cues to derive the answer for a question. Over the last few years, we have seen significant progress on visual question answering. Though impressive as the accuracy grows, it still lags behind to get knowing whether these models are undertaking grounding visual reasoning or just leveraging spurious correlations in the training data. Recently, a number of works have attempted to answer this question from perspectives such as grounding and robustness. However, most of them are either focusing on the language side or coarsely studying the pixel-level attention maps. In this paper, by leveraging the step-wise object grounding annotations provided in the GQA dataset, we first present a systematical object-centric diagnosis of visual reasoning on grounding and robustness, particularly on the vision side. According to the extensive comparisons across different models, we find that even models with high accuracy are not good at grounding objects precisely, nor robust to visual content perturbations. In contrast, symbolic and modular models have a relatively better grounding and robustness, though at the cost of accuracy. To reconcile these different aspects, we further develop a diagnostic model, namely Graph Reasoning Machine. Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module. The designed model improves the performance on all three metrics over the vanilla neural-symbolic model while inheriting the transparency. Further ablation studies suggest that this improvement is mainly due to more accurate image understanding and proper intermediate reasoning supervisions.