论文标题
HAA4D:通过3D时空骨骼比对很少的人类原子动作识别
HAA4D: Few-Shot Human Atomic Action Recognition via 3D Spatio-Temporal Skeletal Alignment
论文作者
论文摘要
人类行动涉及复杂的姿势变化,其2D预测可能是高度模棱两可的。因此,3D时空或4D(即3D+T)人类骨骼(是光度和视点不变)是2D+T骨架/像素的绝佳替代方法,可提高动作识别精度。本文提出了一个新的4D数据集HAA4D,该数据集由300个人类原子能行动类中的3,300多个RGB视频组成。 HAA4D是干净,多样,班级平衡的,在每个班级使用4D骨架的视点平衡,其中每类只有一个4D骨架足以训练深层识别模型。此外,选择原子动作使注释变得更加容易,因为每个视频剪辑只能持续几秒钟。 HAA4D中的所有训练和测试3D骨骼都在全球对齐,使用深层对齐模型与相同的全球空间,使每个骨骼面对负z方向。这样的比对使匹配的骨骼通过减少类内变化而更稳定,因此,每类识别所需的训练样本较少。 Given the high diversity and skeletal alignment in HAA4D, we construct the first baseline few-shot 4D human atomic action recognition network without bells and whistles, which produces comparable or higher performance than relevant state-of-the-art techniques relying on embedded space encoding without explicit skeletal alignment, using the same small number of training samples of unseen classes.
Human actions involve complex pose variations and their 2D projections can be highly ambiguous. Thus 3D spatio-temporal or 4D (i.e., 3D+T) human skeletons, which are photometric and viewpoint invariant, are an excellent alternative to 2D+T skeletons/pixels to improve action recognition accuracy. This paper proposes a new 4D dataset HAA4D which consists of more than 3,300 RGB videos in 300 human atomic action classes. HAA4D is clean, diverse, class-balanced where each class is viewpoint-balanced with the use of 4D skeletons, in which as few as one 4D skeleton per class is sufficient for training a deep recognition model. Further, the choice of atomic actions makes annotation even easier, because each video clip lasts for only a few seconds. All training and testing 3D skeletons in HAA4D are globally aligned, using a deep alignment model to the same global space, making each skeleton face the negative z-direction. Such alignment makes matching skeletons more stable by reducing intraclass variations and thus with fewer training samples per class needed for action recognition. Given the high diversity and skeletal alignment in HAA4D, we construct the first baseline few-shot 4D human atomic action recognition network without bells and whistles, which produces comparable or higher performance than relevant state-of-the-art techniques relying on embedded space encoding without explicit skeletal alignment, using the same small number of training samples of unseen classes.