视频中基于知识的视觉问题回答

论文标题

视频中基于知识的视觉问题回答

Knowledge-Based Visual Question Answering in Videos

论文作者

Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta

论文摘要

我们通过融合基于知识的和视频问题的回答来提出一个新颖的视频理解任务。首先，我们介绍了知识VQA，这是一个带有24,282个人类生成的问题的视频数据集，内容涉及一个受欢迎的情景喜剧。数据集将视觉，文本和时间连贯性推理与基于知识的问题相结合，这些问题需要从观看要回答的系列中获得的经验。其次，我们通过将视觉和文本视频内容与有关节目的特定知识相结合，提出了一个视频理解模型。我们的主要发现是：（i）知识的合并在视频中为VQA带来了出色的改进，（ii）知识VQA的性能仍然远远落后于人类的准确性，这表明其在研究当前的视频建模限制方面有用。

We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题