论文标题

视频字幕的神经符号表示:利用视觉和语言的诱导偏见的案例

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

论文作者

Akbari, Hassan, Palangi, Hamid, Yang, Jianwei, Rao, Sudha, Celikyilmaz, Asli, Fernandez, Roland, Smolensky, Paul, Gao, Jianfeng, Chang, Shih-Fu

论文摘要

事实证明,神经符号表示在视觉和语言中的学习结构信息中有效。在本文中,我们提出了一种新的模型架构,用于学习视频字幕的多模式神经符号表示。我们的方法使用基于词典学习的方法,用于在视频及其配对文本描述之间学习关系。我们将这些关系称为相对角色,并利用它们来吸引每个令牌角色感知。这导致了更具结构化和可解释的体系结构,该体系结构包含了特定于字幕任务的特定于模式的归纳偏见。直觉上,该模型能够在给定的一对视频和文本中学习空间,时间和跨模式关系。我们的提案实现的分离使模型更具捕获多模式结构的能力,从而导致视频质量更高的字幕。我们对两个已建立的视频字幕数据集的实验验证了基于自动指标的建议方法的有效性。我们进一步进行人类评估,以衡量生成的字幕的接地和相关性,并观察到拟议模型的一致改进。可以在https://github.com/hassanhub/r3transformer上找到代码和训练的模型

Neuro-symbolic representations have proved effective in learning structure information in vision and language. In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions. We refer to these relations as relative roles and leverage them to make each token role-aware using attention. This results in a more structured and interpretable architecture that incorporates modality-specific inductive biases for the captioning task. Intuitively, the model is able to learn spatial, temporal, and cross-modal relations in a given pair of video and text. The disentanglement achieved by our proposal gives the model more capacity to capture multi-modal structures which result in captions with higher quality for videos. Our experiments on two established video captioning datasets verifies the effectiveness of the proposed approach based on automatic metrics. We further conduct a human evaluation to measure the grounding and relevance of the generated captions and observe consistent improvement for the proposed model. The codes and trained models can be found at https://github.com/hassanhub/R3Transformer

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源