基于视频的动作识别的运动驱动视觉节奏学习

论文标题

基于视频的动作识别的运动驱动视觉节奏学习

Motion-driven Visual Tempo Learning for Video-based Action Recognition

论文作者

Liu, Yuanzhong, Yuan, Junsong, Tu, Zhigang

论文摘要

动作视觉节奏表征了动作的动力学和时间尺度，这有助于区分人类行为，这些行为在视觉动力学和外观上具有很高的相似性。先前的方法通过采样多个速率来捕获视觉节奏，该视频需要多个速率，这些视频需要一个昂贵的多层网络来处理每个速率，或者是通过在层次上采样骨干特征，这些骨干特征在很大程度上依赖于错过精细粒度的时间动态的高级特征。在这项工作中，我们提出了一个时间相关模块（TCM），可以以插件和播放方式轻松地嵌入当前的动作识别骨干中，以从单层中的低级别骨干特征中提取动作视觉节奏。具体而言，我们的TCM包含两个主要组件：多尺度的时间动力学模块（MTDM）和一个时间注意模块（TAM）。 MTDM应用相关操作来学习快速节奏和慢节奏的像素细粒的时间动力学。 TAM通过分析各种节奏的全局信息来适应强调表现力的特征，并抑制不必要的特征。对几个动作识别基准进行的广泛实验，例如某些事物V1 $ \＆$ V2，Kinetics-400，UCF-101和HMDB-51已证明，提议的TCM有效地促进了较大利润的现有基于视频的动作识别模型的性能。源代码将在https://github.com/yzfly/tcm上公开发布。

Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, which require a costly multi-layer network to handle each rate, or by hierarchically sampling backbone features, which rely heavily on high-level features that miss fine-grained temporal dynamics. In this work, we propose a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably. Specifically, our TCM contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and a Temporal Attention Module (TAM). MTDM applies a correlation operation to learn pixel-wise fine-grained temporal dynamics for both fast-tempo and slow-tempo. TAM adaptively emphasizes expressive features and suppresses inessential ones via analyzing the global information across various tempos. Extensive experiments conducted on several action recognition benchmarks, e.g. Something-Something V1 $\&$ V2, Kinetics-400, UCF-101, and HMDB-51, have demonstrated that the proposed TCM is effective to promote the performance of the existing video-based action recognition models for a large margin. The source code is publicly released at https://github.com/yzfly/TCM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题