暂时独特的表示行动识别的学习

论文标题

暂时独特的表示行动识别的学习

Temporal Distinct Representation Learning for Action Recognition

论文作者

Weng, Junwu, Luo, Donghao, Wang, Yabiao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, Jiang, Xudong, Yuan, Junsong

论文摘要

由于图像识别的二维卷积神经网络（2D CNN）先前的成功，研究人员努力利用它来表征视频。但是，应用2D CNN分析视频的一个局限性是，视频的不同帧共享相同的2D CNN内核，这可能会导致重复和冗余的信息利用，尤其是在空间语义上的提取过程中，因此忽略了框架之间的关键变化。在本文中，我们试图通过两种方式解决这个问题。 1）设计一个顺序通道滤波机制，即渐进增强模块（PEM），以逐步从不同框架中激发特征的区分通道，从而避免重复进行信息提取。 2）创建时间多样性损失（TD损失），以迫使内核集中精力并捕获框架之间的变化，而不是外观相似的图像区域。我们的方法在基准的时间推理数据集上进行了评估，其中有点V1和V2，并且对最佳竞争对手的可见改进分别可见2.4％和1.3％。此外，还目睹了大规模数据集动力学上基于2D-CNN的最先进的性能提高。

Motivated by the previous success of Two-Dimensional Convolutional Neural Network (2D CNN) on image recognition, researchers endeavor to leverage it to characterize videos. However, one limitation of applying 2D CNN to analyze videos is that different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization, especially in the spatial semantics extraction process, hence neglecting the critical variations among frames. In this paper, we attempt to tackle this issue through two ways. 1) Design a sequential channel filtering mechanism, i.e., Progressive Enhancement Module (PEM), to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction. 2) Create a Temporal Diversity Loss (TD Loss) to force the kernels to concentrate on and capture the variations among frames rather than the image regions with similar appearance. Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively. Besides, performance improvements over the 2D-CNN-based state-of-the-arts on the large-scale dataset Kinetics are also witnessed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题