论文标题
视力障碍者要求的视觉问题的基础答案
Grounding Answers for Visual Questions Asked by Visually Impaired People
论文作者
论文摘要
视觉问题回答是回答有关图像问题的任务。我们介绍了Vizwiz-VQA-Indrounding数据集,这是第一个在视觉上为视觉障碍者所问的视觉问题的答案的数据集。我们分析我们的数据集并将其与五个VQA接地数据集进行比较,以演示使其相似和不同的是什么。然后,我们评估了SOTA VQA和VQA接地模型,并证明当前的SOTA算法通常无法识别答案所在的正确视觉证据。当视觉证据占据图像的一小部分,质量更高的图像以及需要文本识别技能的视觉问题时,这些模型会经常挣扎。数据集,评估服务器和排行榜都可以在以下链接上找到:https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/。
Visual question answering is the task of answering questions about images. We introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments. We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different. We then evaluate the SOTA VQA and VQA-Grounding models and demonstrate that current SOTA algorithms often fail to identify the correct visual evidence where the answer is located. These models regularly struggle when the visual evidence occupies a small fraction of the image, for images that are higher quality, as well as for visual questions that require skills in text recognition. The dataset, evaluation server, and leaderboard all can be found at the following link: https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/.