论文标题
X-CLIP:视频检索的端到端多透明的对比度学习
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
论文作者
论文摘要
视频文本检索一直是多模式研究中的至关重要和基本任务。大型多模式对比预训练的发展,视频文本检索的发展已大大推动,这主要集中在粗粒或细粒对比度上。但是,在先前的研究中很少探索过跨粒度的对比,这是粗粒表示和细粒度表示之间的对比。与细粒度或粗粒的对比相比,交叉粒度对比度计算了粗粒粒度特征与每个细粒特征之间的相关性,并且能够在相似性计算过程中滤除由粗粒度特征引导的不必要的细颗粒特征,从而提高了检索的准确性。为此,本文提出了一种新型的多透明对比模型,即X-CLIP,用于视频文本检索。但是,另一个挑战在于相似性聚集问题,该问题旨在将细粒度和跨粒度相似性矩阵与实例级相似性汇总。为了应对这一挑战,我们提出了对相似性矩阵(AOSM)模块的关注,以使模型专注于基本帧和单词之间的对比度,从而降低了不必要的帧和单词对检索结果的影响。 X-CLIP具有多层次的对比度和提议的AOSM模块,可在五个广泛使用的视频文本检索数据集上取得出色的性能,其中包括MSR-VTT(49.3 R@1),MSVD(50.4 R@1),LSMDC(LSMDC(26.1 R@1),DIDEMO(DIDEMO(47.8 r@1)和16和16.46 r@1)它的表现优于先前的最先前, +6.3%, +6.6%, +11.1%, +6.7%, +3.8%的相对改善对这些基准测试,这表明多透明的对比度和AOSM具有优势。
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.