论文标题
具有自适应时间特征分辨率的3D CNN
3D CNNs with Adaptive Temporal Feature Resolutions
论文作者
论文摘要
尽管最新的3D卷积神经网络(CNN)在动作识别数据集上取得了非常好的效果,但它们在计算上非常昂贵,需要许多GFLOPS。虽然可以通过减少网络内的时间特征分辨率来减少3D CNN的GFLOPS,但没有针对所有输入夹的最佳设置。因此,在这项工作中,我们引入了可区分的相似性指导采样(SG)模块,可以将其插入任何现有的3D CNN体系结构中。 SGS通过学习时间特征的相似性并将相似功能分组在一起,以赋予3D CNN。结果,时间特征分辨率不再是静态的,但是每个输入视频剪辑都会有所不同。通过将SG作为当前3D CNN中的附加层集成,我们可以将它们转换为具有自适应时间特征分辨率(ATFR)的更有效的3D CNN。我们的评估表明,所提出的模块通过在保留甚至提高准确性的同时将计算成本(GFLOPS)降低一半来改善最先进的方法。我们通过将模块添加到各种数据集上的多个最新的3D CNN中来评估我们的模块,例如Kinetics-600,Kinetics-400,Mini-Kinetics,Something Something V2,UCF101,UCF101和HMDB51。
While state-of-the-art 3D Convolutional Neural Networks (CNN) achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN can be decreased by reducing the temporal feature resolution within the network, there is no setting that is optimal for all input clips. In this work, we therefore introduce a differentiable Similarity Guided Sampling (SGS) module, which can be plugged into any existing 3D CNN architecture. SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together. As a result, the temporal feature resolution is not anymore static but it varies for each input video clip. By integrating SGS as an additional layer within current 3D CNNs, we can convert them into much more efficient 3D CNNs with adaptive temporal feature resolutions (ATFR). Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy. We evaluate our module by adding it to multiple state-of-the-art 3D CNNs on various datasets such as Kinetics-600, Kinetics-400, mini-Kinetics, Something-Something V2, UCF101, and HMDB51.