运动敏感的对比度学习，用于自我监督的视频表示

论文标题

运动敏感的对比度学习，用于自我监督的视频表示

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

论文作者

Ni, Jingcheng, Zhou, Nan, Qin, Jie, Wu, Qian, Liu, Junqi, Li, Boxun, Huang, Di

论文摘要

对比学习在视频表示学习中表现出巨大的潜力。但是，现有方法无法充分利用短期运动动态，这对于各种下游视频理解任务至关重要。在本文中，我们提出了运动敏感的对比度学习（MSCL），该学习将光学流捕获的运动信息注入RGB帧中，以增强功能学习。为了实现这一目标，除了剪辑级全球对比度学习外，我们还开发了局部运动对比度学习（LMCL），具有两种模式的框架级对比目标。此外，我们引入流动旋转增强（FRA），以生成额外的运动除件负面样品和运动差采样（MDS）以准确筛选训练样品。对标准基准测试的广泛实验验证了该方法的有效性。以常见的3D RESNET-18为骨干，我们在UCF101上获得了91.5 \％的前1个精度，而对于视频分类的事物V2上，我们可以实现50.3 \％的toper，以及65.6 \％的TOP-1 top-1召回ucf101，以进行视频检索，显着改善了状态。

Contrastive learning has shown great potential in video representation learning. However, existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial to various down-stream video understanding tasks. In this paper, we propose Motion Sensitive Contrastive Learning (MSCL) that injects the motion information captured by optical flows into RGB frames to strengthen feature learning. To achieve this, in addition to clip-level global contrastive learning, we develop Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities. Moreover, we introduce Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples. Extensive experiments on standard benchmarks validate the effectiveness of the proposed method. With the commonly-used 3D ResNet-18 as the backbone, we achieve the top-1 accuracies of 91.5\% on UCF101 and 50.3\% on Something-Something v2 for video classification, and a 65.6\% Top-1 Recall on UCF101 for video retrieval, notably improving the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题