论文标题
使用跨模式的自我设计的解开语音嵌入
Disentangled Speech Embeddings using Cross-modal Self-supervision
论文作者
论文摘要
本文的目的是学习说话者身份的表示,而无需访问手动注释的数据。为此,我们开发了一个自制的学习目标,该目标利用视频中面部和音频之间的自然跨模式同步。我们方法背后的关键思想是嘲笑语言内容和说话者身份的表示 - 没有注释。我们构建了一个两流体系结构,该体系结构:(1)共享两个表示的低级特征; (2)提供了一种自然机制,可以显式解开这些因素,从而提供了对内容和身份的新型组合的潜力,并最终产生了更强大的说话者身份表征。我们将我们的方法训练在大规模的音频视频数据集“野外”中,并通过评估学识渊博的说话者表示的标准说话者识别性能来证明其功效。
The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart--without annotation--the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads `in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.