从文本网络监督中学习视频表示

论文标题

从文本网络监督中学习视频表示

Learning Video Representations from Textual Web Supervision

论文作者

Stroud, Jonathan C., Lu, Zhichao, Sun, Chen, Deng, Jia, Sukthankar, Rahul, Schmid, Cordelia, Ross, David A.

论文摘要

互联网上的视频与文本片段（例如标题和描述）配对。本文通常描述视频中最重要的内容，例如场景中的对象和所执行的动作。基于此观察，我们建议将文本用作学习视频表示的方法。为此，我们提出了一个数据收集过程，并使用它来收集70m视频剪辑在Internet上公开共享，然后训练一个模型，将每个视频与其相关文本配对。我们在包括动力学，HMDB-51和UCF-101在内的几个下游动作识别任务上评估了该模型。我们发现这种方法是预训练视频表示的有效方法。具体而言，它的表现优于所有现有方法，用于自我监督和跨模式的视频表示学习。

Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We evaluate the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pre-training video representations. Specifically, it outperforms all existing methods for self-supervised and cross-modal video representation learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题