两阶段模型和最佳的SI-SNR，用于嘈杂环境中的单脉Multi-Speaker语音分离

论文标题

两阶段模型和最佳的SI-SNR，用于嘈杂环境中的单脉Multi-Speaker语音分离

Two-stage model and optimal SI-SNR for monaural multi-speaker speech separation in noisy environment

论文作者

Ma, Chao, Li, Dongmei, Jia, Xupeng

论文摘要

在日常聆听环境中，语音总是因背景噪音，房间混响和干扰扬声器而扭曲。随着深度学习方法的发展，对单声道多说话的语音分离已经取得了很多进展。然而，该领域的大多数研究都集中在实验室环境的简单问题设置上，这些问题不考虑背景噪声和房间混响。在本文中，我们提出了一个基于Conv-Tasnet的两阶段模型，以分别处理噪音和干扰扬声器的显着影响，在这种情况下，使用深层扩张的时间卷积网络（TCN）顺序进行增强和分离。此外，我们开发了一种名为“最佳标准不变的信号噪声比率（OSI-SNR）”的新目标函数，在任何情况下，它都比原始SI-SNR更好。通过使用OSI-SNR共同训练两阶段模型，我们的算法大大优于一阶段的基线。

In daily listening environments, speech is always distorted by background noise, room reverberation and interference speakers. With the developing of deep learning approaches, much progress has been performed on monaural multi-speaker speech separation. Nevertheless, most studies in this area focus on a simple problem setup of laboratory environment, which background noises and room reverberations are not considered. In this paper, we propose a two-stage model based on conv-TasNet to deal with the notable effects of noises and interference speakers separately, where enhancement and separation are conducted sequentially using deep dilated temporal convolutional networks (TCN). In addition, we develop a new objective function named optimal scale-invariant signal-noise ratio (OSI-SNR), which are better than original SI-SNR at any circumstances. By jointly training the two-stage model with OSI-SNR, our algorithm outperforms one-stage separation baselines substantially.

下载PDF全文

下载文献需遵守相关版权规定

论文标题