时空对比视频表示学习

论文标题

时空对比视频表示学习

Spatiotemporal Contrastive Video Representation Learning

论文作者

Qian, Rui, Meng, Tianjian, Gong, Boqing, Yang, Ming-Hsuan, Wang, Huisheng, Belongie, Serge, Cui, Yin

论文摘要

我们提出了一种自我保护的对比视频表示学习（CVRL）方法，用于从未标记的视频中学习时空视觉表示。我们的表示是使用对比损失来学习的，其中将两个来自同一简短视频的增强剪辑在嵌入式空间中拉在一起，而来自不同视频的剪辑则将其推开。我们研究了如何增加视频自我监督学习的良好数据增强，并发现空间和时间信息都是至关重要的。我们仔细设计涉及空间和时间提示的数据增强。具体而言，我们提出了一种时间一致的空间增强方法，以在视频的每个帧上施加强大的空间增强，同时保持跨帧的时间一致性。我们还提出了一种基于抽样的时间增强方法，以避免在距离遥远的夹子上过度执行不变性。在动力学-600上，通过3D-Resnet-50（R3D-50）主链培训了由CVRL学到的表示的线性分类器，可实现70.4％的TOP-1准确性，超过15.7％和SIMCLR无效的预培养的预培训超过了250.8％的超过15.7％和SIMCLR的预培训。通过较大的R3D-152（2倍过滤器）主链，CVRL的性能可以进一步提高到72.9％，从而大大缩小了无监督和监督的视频表示学习之间的差距。我们的代码和模型将在https://github.com/tensorflow/models/tree/master/official/上找到。

We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial. We carefully design data augmentations involving spatial and temporal cues. Concretely, we propose a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. We also propose a sampling-based temporal augmentation method to avoid overly enforcing invariance on clips that are distant in time. On Kinetics-600, a linear classifier trained on the representations learned by CVRL achieves 70.4% top-1 accuracy with a 3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated R3D-50. The performance of CVRL can be further improved to 72.9% with a larger R3D-152 (2x filters) backbone, significantly closing the gap between unsupervised and supervised video representation learning. Our code and models will be available at https://github.com/tensorflow/models/tree/master/official/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题