论文标题
堆叠的卷积深度编码网络,用于视频检索
Stacked Convolutional Deep Encoding Network for Video-Text Retrieval
论文作者
论文摘要
跨模式视频文本检索任务的现有主要方法是学习一个联合嵌入空间,以测量跨模式相似性。但是,这些方法很少探索视频帧或文本单词内部的远程依赖性,从而导致文本和视觉细节不足。在本文中,我们提出了一个用于视频检索任务的堆叠卷积深度编码网络,该网络考虑在视频和文本中同时编码长距离和短程依赖性。具体而言,我们方法中的多尺度扩张卷积(MSDC)块能够通过采用不同的内核大小和卷积层的扩张大小来编码视频帧或文本单词之间的短距离时间提示。堆叠结构旨在通过反复采用MSDC块来扩展接受场,从而进一步捕获了这些提示之间的远距离关系。此外,为了获得更强大的文本表示形式,我们在两个阶段中充分利用了名为Transformer的强大语言模型:训练词组和微调短语。在两个不同的基准数据集(MSR-VTT,MSVD)上进行的广泛实验表明,我们所提出的方法的表现优于其他最先进的方法。
Existing dominant approaches for cross-modal video-text retrieval task are to learn a joint embedding space to measure the cross-modal similarity. However, these methods rarely explore long-range dependency inside video frames or textual words leading to insufficient textual and visual details. In this paper, we propose a stacked convolutional deep encoding network for video-text retrieval task, which considers to simultaneously encode long-range and short-range dependency in the videos and texts. Specifically, a multi-scale dilated convolutional (MSDC) block within our approach is able to encode short-range temporal cues between video frames or text words by adopting different scales of kernel size and dilation size of convolutional layer. A stacked structure is designed to expand the receptive fields by repeatedly adopting the MSDC block, which further captures the long-range relations between these cues. Moreover, to obtain more robust textual representations, we fully utilize the powerful language model named Transformer in two stages: pretraining phrase and fine-tuning phrase. Extensive experiments on two different benchmark datasets (MSR-VTT, MSVD) show that our proposed method outperforms other state-of-the-art approaches.