一个简单有效的管道，用于构建端到端的时空动作检测器

论文标题

一个简单有效的管道，用于构建端到端的时空动作检测器

A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector

论文作者

Sui, Lin, Zhang, Chen-Lin, Gu, Lixin, Han, Feng

论文摘要

时空动作检测是视频理解的重要组成部分。当前的时空动作检测方法主要使用对象检测器来获取人候选者并将这些人候选者分为不同的行动类别。所谓的两阶段方法很重，很难在现实世界应用中应用。一些现有的方法构建了一阶段管道，但是需要一阶段管道和额外的分类模块来实现可比性的性能，存在大量的性能下降。在本文中，我们探索了一个简单有效的管道，以建立强大的一阶段时空动作检测器。该管道由两个部分组成：一个是一个简单的端到端空间 - 周期性动作检测器。拟议的端到端检测器对当前基于建议的检测器的架构更改略有变化，并且不会添加额外的动作分类模块。另一部分是一种新颖的标签策略，用于在稀疏注释数据中使用未标记的框架。我们将模型命名为SE-STAD。拟议的SE-STAD可实现约2％的MAP增强，并减少了80％的拖鞋。我们的代码将在https://github.com/4paradigm-cv/se-stad上发布。

Spatial-temporal action detection is a vital part of video understanding. Current spatial-temporal action detection methods mostly use an object detector to obtain person candidates and classify these person candidates into different action categories. So-called two-stage methods are heavy and hard to apply in real-world applications. Some existing methods build one-stage pipelines, But a large performance drop exists with the vanilla one-stage pipeline and extra classification modules are needed to achieve comparable performance. In this paper, we explore a simple and effective pipeline to build a strong one-stage spatial-temporal action detector. The pipeline is composed by two parts: one is a simple end-to-end spatial-temporal action detector. The proposed end-to-end detector has minor architecture changes to current proposal-based detectors and does not add extra action classification modules. The other part is a novel labeling strategy to utilize unlabeled frames in sparse annotated data. We named our model as SE-STAD. The proposed SE-STAD achieves around 2% mAP boost and around 80% FLOPs reduction. Our code will be released at https://github.com/4paradigm-CV/SE-STAD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题