论文标题
学习分散组成推理模块网络以进行视频字幕
Learning to Discretely Compose Reasoning Module Networks for Video Captioning
论文作者
论文摘要
为视频(即视频字幕)生成自然语言描述,本质上需要在生成过程中逐步推理。例如,要生成“一个人正在射击篮球”的句子,我们需要首先找到并描述“男人”的主题,下一个男人是“射击”,然后描述射击的对象“篮球”。但是,现有用于视觉问题回答的视觉推理方法不适合视频字幕,因为它需要在时空上和时间上对视频进行更复杂的视觉推理,以及沿生成过程的动态模块组成。在本文中,我们提出了一种新颖的视觉推理方法,用于视频字幕,名为推理模块网络(RMN),以使现有的编码器框架配备上述推理能力。具体而言,我们的RMN采用1)三个复杂的时空推理模块,以及2)一个动态和离散的模块选择器,该模块选择器通过语言损失训练,并具有gumbel近似。在MSVD和MSR-VTT数据集上进行的广泛实验表明,所提出的RMN在提供明确且可解释的生成过程的同时胜过最先进的方法。我们的代码可在https://github.com/tgc1997/rmn上找到。
Generating natural language descriptions for videos, i.e., video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence "a man is shooting a basketball", we need to first locate and describe the subject "man", next reason out the man is "shooting", then describe the object "basketball" of shooting. However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process. In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal reasoning modules, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation. Extensive experiments on MSVD and MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art methods while providing an explicit and explainable generation process. Our code is available at https://github.com/tgc1997/RMN.