蒙版视频蒸馏：重新思考自我监督视频表示的蒙版功能建模

论文标题

蒙版视频蒸馏：重新思考自我监督视频表示的蒙版功能建模

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

论文作者

Wang, Rui, Chen, Dongdong, Wu, Zuxuan, Chen, Yinpeng, Dai, Xiyang, Liu, Mengchen, Yuan, Lu, Jiang, Yu-Gang

论文摘要

受益于掩盖的视觉建模，自我监督的视频表示学习取得了显着的进步。但是，现有方法通过重建诸如原始Pixel RGB值之类的低级功能从头开始学习表示表示。在本文中，我们提出了蒙版视频蒸馏（MVD），这是一个简单而有效的两阶段掩盖的掩盖功能建模框架，用于视频表示学习：首先，我们通过恢复蒙版贴片的低级特征来预先图像（或视频）模型，然后我们将结果用作掩盖功能模型的目标功能。对于选择教师模型，我们观察到，由视频老师教授的学生在暂时的视频任务上表现更好，而图像教师则将更强的空间表示用于空间繁重的视频任务。可视化分析还表明，不同的教师为学生产生不同的学习模式。在这种观察过程中，我们为MVD设计了一种时空的共同教学方法。具体来说，我们通过蒙版功能建模从视频老师和图像老师那里提炼学生模型。广泛的实验结果表明，通过空间 - 时空的共同教学预测的视频变压器优于在众多视频数据集中用一位老师提炼的模型。与以前的几个挑战性视频下游任务上的有监督或自我监督的方法相比，我们的Vanilla vit的MVD达到了最先进的性能。例如，使用VIT-LARGE模型，我们的MVD在Kinetics-400和Something-sose-tose-v2上获得了86.4％和76.7％的TOP-1准确性，分别优于1.2％和2.4％的视频。当采用较大的Vit-Huge模型时，MVD在AVA v2.2上的某种事物V2和41.1 MAP上以77.3％的前1位准确性实现了最先进的性能。代码将在\ url {https://github.com/ruiwang2021/mvd}上找到。

Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning representations from scratch through reconstructing low-level features like raw pixel RGB values. In this paper, we propose masked video distillation (MVD), a simple yet effective two-stage masked feature modeling framework for video representation learning: firstly we pretrain an image (or video) model by recovering low-level features of masked patches, then we use the resulting features as targets for masked feature modeling. For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks. Visualization analysis also indicates different teachers produce different learned patterns for students. Motivated by this observation, we design a spatial-temporal co-teaching method for MVD. Specifically, we distill student models from both video teachers and image teachers by masked feature modeling. Extensive experimental results demonstrate that video transformers pretrained with spatial-temporal co-teaching outperform models distilled with a single teacher on a multitude of video datasets. Our MVD with vanilla ViT achieves state-of-the-art performance compared with previous supervised or self-supervised methods on several challenging video downstream tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 76.7% Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming VideoMAE by 1.2% and 2.4% respectively. When a larger ViT-Huge model is adopted, MVD achieves the state-of-the-art performance with 77.3% Top-1 accuracy on Something-Something-v2 and 41.1 mAP on AVA v2.2. Code will be available at \url{https://github.com/ruiwang2021/mvd}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题