更强，更快，更容易解释：基于骨架的动作识别的图形卷积基线

论文标题

更强，更快，更容易解释：基于骨架的动作识别的图形卷积基线

Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition

论文作者

Song, Yi-Fan, Zhang, Zhang, Shan, Caifeng, Wang, Liang

论文摘要

基于骨架的动作识别的一个基本问题是如何在所有骨架关节上提取区分特征。但是，该任务的最新模型（SOTA）模型的复杂性往往非常复杂且过度参数化，在这种情况下，模型培训和推理的低效率阻碍了该领域的发展，尤其是对于大型动作数据集而言。在这项工作中，我们提出了一个基于图形卷积网络（GCN）的有效但强大的基线，其中三个主要改进是汇总的，即早期融合的多个输入分支（MIB），残留GCN（RESGCN），具有瓶颈结构和零件智能注意力（partatt）块。首先，MIB旨在在早期的融合阶段富集信息丰富的骨骼特征，并保持紧凑的表示。然后，受到卷积神经网络（CNN）Resnet架构的成功的启发，在GCN中引入了一个RESGCN模块，以减轻计算成本并减少模型培训中的学习困难，同时保持模型准确性。最后，提出了一个部分块，以发现整个动作序列上最重要的身体部位，并为不同的骨架动作序列获得更可解释的表示。在两个大尺度数据集（即NTU RGB+D 60和120）上进行了广泛的实验，验证了所提出的基线是否略高于其他SOTA模型，并且在培训和推理过程中所需的参数少得多，例如，在比DGNN中最多要少34倍，其中一种是最好的SOTA方法。

One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets. In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. Firstly, an MIB is designed to enrich informative skeleton features and remain compact representations at an early fusion stage. Then, inspired by the success of the ResNet architecture in Convolutional Neural Network (CNN), a ResGCN module is introduced in GCN to alleviate computational costs and reduce learning difficulties in model training while maintain the model accuracy. Finally, a PartAtt block is proposed to discover the most essential body parts over a whole action sequence and obtain more explainable representations for different skeleton action sequences. Extensive experiments on two large-scale datasets, i.e., NTU RGB+D 60 and 120, validate that the proposed baseline slightly outperforms other SOTA models and meanwhile requires much fewer parameters during training and inference procedures, e.g., at most 34 times less than DGNN, which is one of the best SOTA methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题