论文标题
视频分类的转移学习:多个域上的视频SWIN变压器
Transfer-learning for video classification: Video Swin Transformer on multiple domains
论文作者
论文摘要
计算机视觉社区已经看到,用于图像和视频任务的纯变压器体系结构的转变。从零开始培训变压器通常需要大量数据和计算资源。 Video Swin Transformer(VST)是为视频分类开发的纯转换器模型,可实现最新的最先进,从而在几个数据集上提高了准确性和效率。在本文中,我们旨在了解VST是否足够概括以在外域环境中使用。我们在两个大规模数据集上研究VST的性能,即FCVID和使用Threntics-400的转移学习方法的事物,这比Scratch训练要少4倍。然后,我们分解结果,以了解VST在哪里失败的位置,在哪种情况下,转移学习方法是可行的。我们的实验显示了FCVID上的85 \%TOP-1准确性,而没有重新培训整个模型,该模型等于数据集的最新模型,并且对某些事物的最新模型和21 \%的精度。实验还表明,当视频持续时间增加时,VST的性能平均会降低,这似乎是模型设计选择的结果。从结果来看,我们得出的结论是,VST概括足以对室外视频进行分类,而无需重新训练,而目标类别来自与训练模型的类别相同的类型。当我们从Kinetics-400到FCVID进行转移学习时,我们观察到了这种效果,其中大多数数据集主要针对对象。另一方面,如果这些类不是来自同一类型,则预计转移学习方法后的精度很差。当我们从Kinetics-400进行转移学习时,我们观察到了这种效果,该类主要代表对象,到这些类主要代表动作的东西。
The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.