论文标题
无语言培训的零照片视频接地
Language-free Training for Zero-shot Video Grounding
论文作者
论文摘要
给出了一个未修剪的视频和语言查询,描绘了视频中特定的时间时刻,视频接地旨在通过同时理解文本和视频来定位时间间隔。最具挑战性的问题之一是一个非常耗时和成本耗尽的注释收集,包括自然语言形式的视频标题及其相应的时间区域。在本文中,我们提出了一个简单而又新颖的培训框架,用于在零拍设置中进行视频接地,该框架仅在没有任何注释的情况下学习一个网络。受到最新无语言范式的启发,即没有语言数据的培训,我们在不强迫假(伪)文本查询的情况下训练网络。具体而言,我们提出了一种通过选择时间间隔作为假设的正确答案,并考虑我们方法在该间隔中选择的视觉特征来学习视频接地模型的方法。广泛的实验证明了我们无语言培训框架的突出性,表现优于现有的零摄影接地方法,甚至在两个标准数据集上使用较大的利润率,甚至几种弱监督的方法。
Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.