论文标题

从功能和示例角度回答视频问题中的多模式对齐

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

论文作者

Xiao, Shaoning, Chen, Long, Gao, Kaifeng, Wang, Zhao, Yang, Yi, Zhang, Zhimeng, Xiao, Jun

论文摘要

关于视频中因果关系和时间事件关系的推理是视频问题回答的新目的地(videoqa)。实现此目的的主要绊脚石是语言和视频之间的语义差距,因为它们处于不同的抽象水平。现有的努力主要集中于设计复杂的体系结构,同时利用框架或对象级的视觉表示。在本文中,我们从功能和示例角度重新考虑了VideoQA中的多模式对齐问题,以实现更好的性能。从功能的角度来看,我们将视频分解为VideoQA中的轨迹和首次利用轨迹功能,以增强两种方式之间的对齐方式。此外,我们采用异质图架构并设计一个层次结构框架,以使轨迹级别和框架级别的视觉功能与语言功能保持一致。此外,我们发现VideoQA模型在很大程度上取决于语言先验,并且始终忽略视觉语言相互作用。因此,从样本的角度来看,两种有效但便携式训练策略旨在增强我们的模型的跨模式对应能力。广泛的结果表明,我们的方法在具有挑战性的下一QA基准上优于所有最新模型,这证明了该方法的有效性。

Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA).The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction. Existing efforts mainly focus on designing sophisticated architectures while utilizing frame- or object-level visual representations. In this paper, we reconsider the multi-modal alignment problem in VideoQA from feature and sample perspectives to achieve better performance. From the view of feature,we break down the video into trajectories and first leverage trajectory feature in VideoQA to enhance the alignment between two modalities. Moreover, we adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature. In addition, we found that VideoQA models are largely dependent on language priors and always neglect visual-language interactions. Thus, two effective yet portable training augmentation strategies are designed to strengthen the cross-modal correspondence ability of our model from the view of sample. Extensive results show that our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark, which demonstrates the effectiveness of the proposed method.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源