VLG：具有Web文本知识的一般视频识别

论文标题

VLG：具有Web文本知识的一般视频识别

VLG: General Video Recognition with Web Textual Knowledge

论文作者

Lin, Jintao, Liu, Zhaoyang, Wang, Wenhai, Wu, Wayne, Wang, Limin

论文摘要

在一个开放而充满活力的世界中的视频识别非常具有挑战性，因为我们需要处理诸如封闭式，长尾，很少射击和开放式设置等不同设置。通过利用从Internet爬上的嘈杂文本描述中的语义知识，我们重点介绍一般视频识别（GVR）问题，即在统一框架内求解不同的识别任务。本文的核心贡献是双重的。首先，我们构建了动力学GVR的全面视频识别基准，其中包括四个子任务数据集，以涵盖上述设置。为了促进GVR的研究，我们建议从Internet利用外部文本知识，并为所有动作类提供多源文本描述。其次，受语言表示灵活性的启发，我们提出了一个统一的视觉语言框架（VLG），以通过有效的两阶段训练范式解决GVR问题。我们的VLG首先是在视频和语言数据集中进行的，以学习共享的功能空间，然后设计一个灵活的双模式关注头，以在不同的设置下协作高级语义概念。广泛的结果表明，我们的VLG在四个设置下获得了最先进的性能。优越的性能证明了我们提出的框架的有效性和概括能力。我们希望我们的工作朝着一般视频认可迈出一步，并可以作为未来研究的基准。代码和模型将在https://github.com/mcg-nju/vlg上找到。

Video recognition in an open and dynamic world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The core contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings. The superior performance demonstrates the effectiveness and generalization ability of our proposed framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research. The code and models will be available at https://github.com/MCG-NJU/VLG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题