论文标题
2D还是不是2D?自适应3D卷积选择以进行有效的视频识别
2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition
论文作者
论文摘要
3D卷积网络盛行以供视频识别。尽管在标准基准上实现了出色的识别性能,但它们在一系列具有3D卷积的框架上运行,因此在计算方面要求很高。利用不同视频之间的巨大变化,我们介绍了ADA3D,这是一个有条件的计算框架,该框架学习了实例特定的3D用法策略,以确定要在3D网络中使用的帧和卷积层。这些策略是通过在每个输入视频剪辑上进行的两头轻质选择网络得出的。然后,仅在3D模型中使用选择网络选择的帧和卷积来生成预测。选择网络通过策略梯度方法进行了优化,以最大程度地提高奖励,该奖励鼓励通过有限的计算进行正确的预测。我们对三个视频识别基准进行实验,并证明我们的方法达到了与最先进的3D模型相似的精确度,同时需要在不同数据集的计算中少20%-50%。我们还表明,学到的政策是可转让的,ADA3D与不同的骨干和现代夹子选择方法兼容。我们的定性分析表明,我们的方法为“静态”输入分配了较少的3D卷积和框架,但使用更多用于运动密集型剪辑。
3D convolutional networks are prevalent for video recognition. While achieving excellent recognition performance on standard benchmarks, they operate on a sequence of frames with 3D convolutions and thus are computationally demanding. Exploiting large variations among different videos, we introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network. These policies are derived with a two-head lightweight selection network conditioned on each input video clip. Then, only frames and convolutions that are selected by the selection network are used in the 3D model to generate predictions. The selection network is optimized with policy gradient methods to maximize a reward that encourages making correct predictions with limited computation. We conduct experiments on three video recognition benchmarks and demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets. We also show that learned policies are transferable and Ada3D is compatible to different backbones and modern clip selection approaches. Our qualitative analysis indicates that our method allocates fewer 3D convolutions and frames for "static" inputs, yet uses more for motion-intensive clips.