态度感知的对比实例学习与自我依据的学习，以进行弱监督的视听检测

论文标题

态度感知的对比实例学习与自我依据的学习，以进行弱监督的视听检测

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

论文作者

Yu, Jiashuo, Liu, Jinyu, Cheng, Ying, Feng, Rui, Zhang, Yuejie

论文摘要

弱监督的视听暴力检测旨在区分包含带有视频级标签的多模式暴力事件的片段。许多先前的作品以早期或中间的方式执行视听整合和互动，但在弱监督的设置上忽略了模态异质性。在本文中，我们分析了多种实例学习（MIL）程序的模式异步和未分化的实例现象，并进一步研究了其对弱监督视听学习的负面影响。为了解决这些问题，我们提出了一种以自我验证（MACIL-SD）策略学习方式吸引的对比实例。具体来说，我们利用轻量级的两流网络来生成音频和视觉袋，其中单峰背景，暴力和普通实例以一种无监督的方式聚集成半袋子。然后，将音频和视觉剧烈的半袋表示形式组装为正对，将暴力半袋与背景和正常实例相结合，作为对比性负对。此外，将一个自distillation模块应用于将单峰视觉知识传输到视听模型，该模型减轻了噪音并缩小单峰和多模式特征之间的语义差距。实验表明，我们的框架在大规模XD-Violence数据集上的复杂性较低。结果还表明，我们提出的方法可以用作插件模块，以增强其他网络。代码可在https://github.com/justinyuu/macil_sd上找到。

Weakly-supervised audio-visual violence detection aims to distinguish snippets containing multimodal violence events with video-level labels. Many prior works perform audio-visual integration and interaction in an early or intermediate manner, yet overlooking the modality heterogeneousness over the weakly-supervised setting. In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning. To address these issues, we propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy. Specifically, we leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised way. Then audio and visual violent semi-bag representations are assembled as positive pairs, and violent semi-bags are combined with background and normal instances in the opposite modality as contrastive negative pairs. Furthermore, a self-distillation module is applied to transfer unimodal visual knowledge to the audio-visual model, which alleviates noises and closes the semantic gap between unimodal and multimodal features. Experiments show that our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset. Results also demonstrate that our proposed approach can be used as plug-in modules to enhance other networks. Codes are available at https://github.com/JustinYuu/MACIL_SD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题