具有多时间尺度的时间动作定位

论文标题

具有多时间尺度的时间动作定位

Temporal Action Localization with Multi-temporal Scales

论文作者

Gao, Zan, Cui, Xinglei, Zhuo, Tao, Cheng, Zhiyong, Liu, An-An, Wang, Meng, Chen, Shenyong

论文摘要

时间动作本地化在视频分析中起着重要作用，该视频分析旨在将动作定位和分类在未修剪视频中。先前的方法通常可以预测单个时间尺度的特征空间上的动作。但是，低级量表的时间特征缺乏足够的语义来进行动作分类，而高级尺度则无法提供动作边界的丰富细节。为了解决这个问题，我们建议预测在多时间尺度的特征空间上的动作。具体而言，我们使用不同尺度的精致特征金字塔将语义从高级尺度传递到低级尺度。此外，为了建立整个视频的较长时间尺度，我们使用空间 - 周期性变压器编码器来捕获视频帧的长距离依赖性。然后，具有远距离依赖性的精制特征被送入分类器以进行粗糙动作预测。最后，为了进一步提高预测准确性，我们建议使用框架级的自我注意模块来完善每个动作实例的分类和边界。广泛的实验表明，所提出的方法可以胜过Thumos14数据集的最先进方法，并在ActivityNet1.3数据集上实现可比性的性能。 Compared with A2Net (TIP20, Avg\{0.3:0.7\}), Sub-Action (CSVT2022, Avg\{0.1:0.5\}), and AFSD (CVPR21, Avg\{0.3:0.7\}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6\%, 17.4\% and分别为2.2 \％

Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. The previous methods often predict actions on a feature space of a single-temporal scale. However, the temporal features of a low-level scale lack enough semantics for action classification while a high-level scale cannot provide rich details of the action boundaries. To address this issue, we propose to predict actions on a feature space of multi-temporal scales. Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Besides, to establish the long temporal scale of the entire video, we use a spatial-temporal transformer encoder to capture the long-range dependencies of video frames. Then the refined features with long-range dependencies are fed into a classifier for the coarse action prediction. Finally, to further improve the prediction accuracy, we propose to use a frame-level self attention module to refine the classification and boundaries of each action instance. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieves comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg\{0.3:0.7\}), Sub-Action (CSVT2022, Avg\{0.1:0.5\}), and AFSD (CVPR21, Avg\{0.3:0.7\}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6\%, 17.4\% and 2.2\%, respectively

下载PDF全文

下载文献需遵守相关版权规定

论文标题