关于证据的一般价值和双语场景 - 文本视觉问题回答

论文标题

关于证据的一般价值和双语场景 - 文本视觉问题回答

On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

论文作者

Wang, Xinyu, Liu, Yuliang, Shen, Chunhua, Ng, Chun Chet, Luo, Canjie, Jin, Lianwen, Chan, Chee Seng, Hengel, Anton van den, Wang, Liangwei

论文摘要

视觉问题回答（VQA）方法取得了令人难以置信的进步，但无法概括。这是可以看到的，因为它们容易学习数据中的巧合相关性，而不是语言表达的图像内容和思想之间的更深层次的关系。我们提出了一个数据集，该数据集朝着解决此问题迈出了一步，因为它包含了两种语言表达的问题，以及一个评估过程，该问题选择了一个良好理解的基于图像的指标，以反映该方法的推理能力。测量推理直接通过惩罚偶然正确的答案来鼓励概括。数据集反映了VQA问题的场景 - 文本版本，并且推理评估可以看作是引用表达挑战的基于文本的版本。提供了显示数据集值的实验和分析。

Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to generalize. This is visible in the fact that they are vulnerable to learning coincidental correlations in the data rather than deeper relations between image content and ideas expressed in language. We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages, and an evaluation process that co-opts a well understood image-based metric to reflect the method's ability to reason. Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct. The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge. Experiments and analysis are provided that show the value of the dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题