冷冻剪辑模型是有效的视频学习者

论文标题

冷冻剪辑模型是有效的视频学习者

Frozen CLIP Models are Efficient Video Learners

论文作者

Lin, Ziyi, Geng, Shijie, Zhang, Renrui, Gao, Peng, de Melo, Gerard, Wang, Xiaogang, Dai, Jifeng, Qiao, Yu, Li, Hongsheng

论文摘要

视频识别是由端到端学习范式主导的 - 首先将视频识别模型初始化，并具有验证的图像模型，然后对视频进行端到端培训。这使视频网络能够受益于验证的图像模型。但是，这需要大量的计算和内存资源，以在视频上进行填充以及直接使用验证的图像功能的替代方案，而无需填充图像骨架会导致结果不足。幸运的是，对比的视觉语言预训练（剪辑）的最新进展为视觉识别任务的新途径铺平了道路。这些模型在大型开放式图像文本对数据上进行了预估计，以丰富的语义学习强大的视觉表示。在本文中，我们提出了有效的视频学习（EVL） - 直接训练具有冷冻剪辑功能的高质量视频识别模型的有效框架。具体来说，我们采用轻型变压器解码器并学习查询令牌，从剪辑图像编码器中动态收集帧级空间特征。此外，我们在每个解码器层中采用局部时间模块，以发现相邻帧及其注意图的时间线索。我们表明，尽管有效地使用冷冻的骨干训练，但我们的模型在各种视频识别数据集上学习了高质量的视频表示。代码可从https://github.com/opengvlab/felfficited-video-recognition获得。

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题