论文标题
通过基于视觉姿势的过滤和可穿戴加速度,在拥挤的环境中的无语言状态检测
No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration
论文作者
论文摘要
认识到谁在拥挤的场景中说话是对理解内部社交互动的关键挑战。仅从身体运动中检测说话状态就为分析无法获得个人音频的社交场景打开了大门。视频和可穿戴传感器使其可以以一种不引人注目的,隐私的方式识别说话。在考虑视频方式,在动作识别问题中,传统上使用边界框来本地化和细分目标主体,然后识别其中的动作。但是,交叉污染,遮挡和人体的明确性质使这种方法在拥挤的场景中具有挑战性。在这里,我们利用明确的身体姿势来进行主题定位和随后的语音检测阶段。我们表明,在姿势关键点周围的本地特征选择对概括性能有积极影响,同时也大大减少了所考虑的局部特征数量,从而提高了一种更有效的方法。使用两个受试者观点的野外数据集,我们研究了交叉污染在这种效果中的作用。我们还利用通过可穿戴传感器测量的同一任务测量的加速度,并提出了将这两种方法结合在一起的多模式方法。
Recognizing who is speaking in a crowded scene is a key challenge towards the understanding of the social interactions going on within. Detecting speaking status from body movement alone opens the door for the analysis of social scenes in which personal audio is not obtainable. Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way. When considering the video modality, in action recognition problems, a bounding box is traditionally used to localize and segment out the target subject, to then recognize the action taking place within it. However, cross-contamination, occlusion, and the articulated nature of the human body, make this approach challenging in a crowded scene. Here, we leverage articulated body poses for subject localization and in the subsequent speech detection stage. We show that the selection of local features around pose keypoints has a positive effect on generalization performance while also significantly reducing the number of local features considered, making for a more efficient method. Using two in-the-wild datasets with different viewpoints of subjects, we investigate the role of cross-contamination in this effect. We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods.