单阶段连续识别的多模式融合

论文标题

单阶段连续识别的多模式融合

Multi-modal Fusion for Single-Stage Continuous Gesture Recognition

论文作者

Gammulle, Harshala, Denman, Simon, Sridharan, Sridha, Fookes, Clinton

论文摘要

手势识别是一个研究领域，具有无数的现实应用，包括机器人技术和人机相互作用。当前的手势识别方法集中在识别孤立的手势上，现有的连续手势识别方法仅限于两阶段的方法，其中需要独立的检测和分类模型，而后者的性能受到检测性能的约束。相比之下，我们引入了一个单阶段的连续手势识别框架，称为时间多模式融合（TMMF），该框架可以通过单个模型在视频中检测和分类多个手势。这种方法了解手势和非手势之间的自然转变，而无需进行预处理的分段步骤来检测单个手势。为了实现这一目标，我们引入了一种多模式融合机制，以支持从多模式输入流动的重要信息的集成，并且可扩展到任何数量的模式。此外，我们建议分别提出单峰特征映射（UFM）和多模式特征映射（MFM）模型，以分别映射单模式特征和融合的多模式特征。为了进一步提高性能，我们提出了基于中点的损失函数，鼓励地面真理与预测之间的平稳对齐，从而帮助模型学习自然的手势转变。我们演示了我们提出的框架的实用性，该框架可以处理可变长度的输入视频，并在三个具有挑战性的数据集上胜过最先进的框架：EgogeSture，IPN Hand和Chalearn Lap连续手势数据集（CONGD）。此外，消融实验表明了所提出框架的不同组成部分的重要性。

Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have focused on recognising isolated gestures, and existing continuous gesture recognition methods are limited to two-stage approaches where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that can detect and classify multiple gestures in a video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step to detect individual gestures. To achieve this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance performance, we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction, helping the model to learn natural gesture transitions. We demonstrate the utility of our proposed framework, which can handle variable-length input videos, and outperforms the state-of-the-art on three challenging datasets: EgoGesture, IPN hand, and ChaLearn LAP Continuous Gesture Dataset (ConGD). Furthermore, ablation experiments show the importance of different components of the proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题