论文标题
多模式多通道目标语音分离
Multi-modal Multi-channel Target Speech Separation
论文作者
论文摘要
目标语音分离是指从同时说话者的重叠音频中提取目标扬声器的声音。以前,将视觉模态用于目标语音分离已经表现出巨大的潜力。这项工作提出了一个通用的多模式框架,用于使用目标扬声器的所有可用信息,包括他/她的空间位置,语音特征和唇部运动,以实现目标语音分离。同样,在此框架下,我们研究了多模式关节建模的融合方法。提出了一种基于注意力的融合方法来汇总嵌入水平多模式的高级语义信息。该方法首先将混合音频分配到一组声学子空间中,然后利用其他模式的目标信息来通过可学习的注意力方案来增强这些子空间声学嵌入。为了验证在实际场景中验证提出的多模式分离模型的鲁棒性,在其中一种模式暂时缺失,无效或损坏的情况下对系统进行了评估。实验是在从YouTube收集(释放)的大规模视听数据集上进行的,该数据集是通过模拟房间脉冲响应(RIR)空间化的。实验结果表明,我们提出的多模式框架显着优于单模式和双模式语音分离方法,同时仍然可以支持实时处理。
Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.