论文标题
使用跨模式信息在媒体内容中无监督的主动扬声器检测
Unsupervised active speaker detection in media content using cross-modal information
论文作者
论文摘要
我们为电视节目和电影等媒体内容中的主动扬声器检测提供了一个跨模式的无监督框架。机器学习的进步使能够从语音和面部图像中识别个人方面令人印象深刻的表现。我们利用言语和面部的说话者身份信息,并将主动的说话者检测作为语音脸部作业任务,从而使主动说话者的脸和潜在的语音识别同一个人(角色)。我们从所有其他语音段的相关说话者身份距离来表达语音段,以捕获视频的相对身份结构。然后,我们从同时出现的面上的每个语音段分配一个主动扬声器的面孔,以使所获得的一组活跃的扬声器面显示相似的相对身份结构。此外,我们提出了一种简单有效的方法,以解决演讲者在屏幕外出现的语音段。我们在三个基准数据集上评估了拟议的系统 - 视觉人员聚类数据集,AVA Active Speaker数据集和哥伦比亚数据集 - 由娱乐和广播媒体的视频组成,并显示出对最先进的完全监督方法的竞争性能。
We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies. Machine learning advances have enabled impressive performance in identifying individuals from speech and facial images. We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker's face and the underlying speech identify the same person (character). We express the speech segments in terms of their associated speaker identity distances, from all other speech segments, to capture a relative identity structure for the video. Then we assign an active speaker's face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure. Furthermore, we propose a simple and effective approach to address speech segments where speakers are present off-screen. We evaluate the proposed system on three benchmark datasets -- Visual Person Clustering dataset, AVA-active speaker dataset, and Columbia dataset -- consisting of videos from entertainment and broadcast media, and show competitive performance to state-of-the-art fully supervised methods.