论文标题
注意力驱动的身体姿势编码人类活动识别
Attention-Driven Body Pose Encoding for Human Activity Recognition
论文作者
论文摘要
本文提出了一种基于注意力的新型身体姿势,以编码人类活动识别,它呈现出丰富的人体姿势的表示。丰富的数据补充了3D主体联合位置数据并改善了模型性能。在本文中,我们提出了一种新颖的方法,该方法从给定的3D身体接头序列中学习增强的特征表示。为了实现此编码,该方法利用了1)空间流,该空间流在每个时间点编码各个身体关节之间的空间关系,以学习空间结构,涉及不同身体关节的空间分布的空间结构2)一个时间流,该时间流了解整个序列中单个身体关节的时间变化,以呈现时间序列,以呈现时间增强的表示。之后,这两个姿势流与多头注意机制融合在一起。 %适用于神经机器翻译。我们还使用Inpection-Resnet-V2模型与多头注意力和双向长期记忆(LSTM)网络相结合的Inpection-Resnet-V2模型从RGB视频流中捕获上下文信息。此外,我们通过多头注意机制提高了性能的我们。最后,将RGB视频流与融合的身体姿势流相结合,为有效的人类活动识别提供了一种新颖的端到端深层模型。
This article proposes a novel attention-based body pose encoding for human activity recognition that presents a enriched representation of body-pose that is learned. The enriched data complements the 3D body joint position data and improves model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this encoding, the approach exploits 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. % adapted from neural machine translation. We also capture the contextual information from the RGB video stream using a Inception-ResNet-V2 model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. %Moreover, we whose performance is enhanced through the multi-head attention mechanism. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition.