论文标题
实时双耳语音分离,并保留的空间提示
Real-time binaural speech separation with preserved spatial cues
论文作者
论文摘要
深度学习的语音分离算法在提高与混合音频的分离语音的质量和清晰度方面取得了巨大的成功。大多数以前的方法都侧重于为每个目标扬声器生成单渠道输出,从而丢弃空间源本地化所需的空间提示。但是,保留空间信息在许多旨在准确渲染声学场景(例如助听器和增强现实(AR))的应用中很重要。在这里,我们提出了一种语音分离算法,该算法保留了分离的声音源的声明,并且可以以低延迟和高忠诚度实现,从而实现了声学场景的实时修改。基于时间域音频分离网络(TASNET),这是一种可以实时实现的单渠道时间域语音分离系统,我们提出了一个多输入 - 群 - 摩尔蒂输出(MIMO)TASNET的端到端延伸,该端到tasnet的端到端,将双inal型混合音频输入为输入,并同时分离两个频道的目标讲话者。实验结果表明,所提出的端到端MIMO系统能够显着改善分离性能,并在各种声学场景中保持修改源的可感知位置。
Deep learning speech separation algorithms have achieved great success in improving the quality and intelligibility of separated speech from mixed audio. Most previous methods focused on generating a single-channel output for each of the target speakers, hence discarding the spatial cues needed for the localization of sound sources in space. However, preserving the spatial information is important in many applications that aim to accurately render the acoustic scene such as in hearing aids and augmented reality (AR). Here, we propose a speech separation algorithm that preserves the interaural cues of separated sound sources and can be implemented with low latency and high fidelity, therefore enabling a real-time modification of the acoustic scene. Based on the time-domain audio separation network (TasNet), a single-channel time-domain speech separation system that can be implemented in real-time, we propose a multi-input-multi-output (MIMO) end-to-end extension of TasNet that takes binaural mixed audio as input and simultaneously separates target speakers in both channels. Experimental results show that the proposed end-to-end MIMO system is able to significantly improve the separation performance and keep the perceived location of the modified sources intact in various acoustic scenes.