旨在快速适应多通道视频检索的预处理对比模型

论文标题

旨在快速适应多通道视频检索的预处理对比模型

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

论文作者

Lin, Xudong, Tiwari, Simran, Huang, Shiyuan, Li, Manling, Shou, Mike Zheng, Ji, Heng, Chang, Shih-Fu

论文摘要

多渠道视频语言检索要求模型从不同的频道（例如，视频$+$ Question，Video $+$ secember）了解信息，以将视频与文本响应或查询链接起来。幸运的是，对比度多模型模型在图像/视频和文本中的对齐实体（例如剪辑）中非常有效。最近对文本对比模型进行了广泛的研究，以实现其产生歧视性句子嵌入的强大能力，例如Simcse。但是，没有一种清晰的方法可以快速将这两条线适应有限的数据和资源，以将这两条线适应多通道视频语言检索。在本文中，我们确定了具有两个轴的原则模型设计空间：如何表示视频以及如何融合视频和文本信息。根据最近方法的分类，我们研究了使用连续特征向量或离散文本令牌表示视频的选项；对于Fusion方法，我们探讨了多模式变压器或预置的对比文本模型的使用。我们在五个视频语言数据集上广泛评估了四个组合。我们出人意料地发现，离散的文本令牌以及预验证的对比文本模型可产生最佳性能，甚至可以在IVQA上的最先进和HOW2QA数据集都表现出最佳性能，而无需对数百万个视频text数据进行额外培训。进一步的分析表明，这是因为将视频表示为文本令牌可捕获关键的视觉信息和文本令牌自然与文本模型相符，这些文本模型是在对比度验证的过程后是强大的检索器。所有经验分析为未来的负担得起和可升级的多模式智能研究奠定了坚实的基础。

Multi-channel video-language retrieval require models to understand information from different channels (e.g. video$+$question, video$+$speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on affordable and upgradable multimodal intelligence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题