论文标题
缪斯:带视觉提示的多模式目标扬声器提取
Muse: Multi-modal target speaker extraction with visual cues
论文作者
论文摘要
说话者提取算法依赖于目标扬声器的语音样本,作为集中注意力的参考点。这种参考语音通常是预先录制的。另一方面,语音和唇部运动之间的时间同步也是一个信息的提示。在这一想法的驱动下,我们研究了一种新型的技术,可以使用语音唇视觉提示在推理期间直接从混合语音中提取参考目标语音,而无需预先录制的参考语音。我们提出了一个名为Muse的多模式扬声器提取网络,该网络仅以唇部图像序列为条件。缪斯不仅在SI-SDR和PESQ方面优于其他竞争基线,而且还显示出跨数据库评估的一致改善。
Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations.