视频分类的转移学习：多个域上的视频SWIN变压器

论文标题

视频分类的转移学习：多个域上的视频SWIN变压器

Transfer-learning for video classification: Video Swin Transformer on multiple domains

论文作者

Oliveira, Daniel A. P., de Matos, David Martins

论文摘要

计算机视觉社区已经看到，用于图像和视频任务的纯变压器体系结构的转变。从零开始培训变压器通常需要大量数据和计算资源。 Video Swin Transformer（VST）是为视频分类开发的纯转换器模型，可实现最新的最先进，从而在几个数据集上提高了准确性和效率。在本文中，我们旨在了解VST是否足够概括以在外域环境中使用。我们在两个大规模数据集上研究VST的性能，即FCVID和使用Threntics-400的转移学习方法的事物，这比Scratch训练要少4倍。然后，我们分解结果，以了解VST在哪里失败的位置，在哪种情况下，转移学习方法是可行的。我们的实验显示了FCVID上的85 \％TOP-1准确性，而没有重新培训整个模型，该模型等于数据集的最新模型，并且对某些事物的最新模型和21 \％的精度。实验还表明，当视频持续时间增加时，VST的性能平均会降低，这似乎是模型设计选择的结果。从结果来看，我们得出的结论是，VST概括足以对室外视频进行分类，而无需重新训练，而目标类别来自与训练模型的类别相同的类型。当我们从Kinetics-400到FCVID进行转移学习时，我们观察到了这种效果，其中大多数数据集主要针对对象。另一方面，如果这些类不是来自同一类型，则预计转移学习方法后的精度很差。当我们从Kinetics-400进行转移学习时，我们观察到了这种效果，该类主要代表对象，到这些类主要代表动作的东西。

The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题