论文标题
IQ-VQA:智能视觉问题回答
IQ-VQA: Intelligent Visual Question Answering
论文作者
论文摘要
尽管在视觉问题的回答领域取得了巨大进展,但如今的模型仍然往往是不一致和脆弱的。为此,我们提出了一个与模型无关的循环框架,从而提高了任何VQA架构的一致性和鲁棒性。我们训练模型以回答原始问题,基于答案产生含义,然后学会正确地回答生成的含义。作为循环框架的一部分,我们提出了一种新颖的含义生成器,该发生器可以从任何问答对中产生隐含的问题。作为未来一致性工作的基准,我们提供了一个新的人类注释的VQA象征数据集。该数据集由约30k的问题组成,其中包含3种类型的含义 - 逻辑等效性,必要条件和相互排除 - 由VQA V2.0验证数据集制成。我们表明,我们的框架在基于规则的数据集上将VQA模型的一致性提高了约15%,在VQA-Implications数据集上约7%,并且鲁棒性提高了约2%,而不会降低其性能。此外,我们还定量显示了注意图的改善,这突出了对视觉和语言的更好多模式的理解。
Even though there has been tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. To this end, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA architecture. We train our models to answer the original question, generate an implication based on the answer and then also learn to answer the generated implication correctly. As a part of the cyclic framework, we propose a novel implication generator which can generate implied questions from any question-answer pair. As a baseline for future works on consistency, we provide a new human annotated VQA-Implications dataset. The dataset consists of ~30k questions containing implications of 3 types - Logical Equivalence, Necessary Condition and Mutual Exclusion - made from the VQA v2.0 validation dataset. We show that our framework improves consistency of VQA models by ~15% on the rule-based dataset, ~7% on VQA-Implications dataset and robustness by ~2%, without degrading their performance. In addition, we also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.