论文标题
神经声码器的功能估计,用于干式语音分离
Neural Vocoder Feature Estimation for Dry Singing Voice Separation
论文作者
论文摘要
唱歌的语音分离(SVS)是一项任务,将歌声音频与乐器音频的混合物分开。先前的SVS研究主要采用了频谱掩蔽方法,该方法需要较大的维度来预测二进制掩模。此外,他们专注于提取具有混响效果的湿声音的声茎。这个结果可能会阻碍孤立的歌声的可重复性。本文通过预测混合音频的干唱声音作为神经声码器的特征并综合了来自神经声码器的唱歌语音波形,从而解决了这些问题。我们尝试了两种分离方法。一个正在预测MEL光谱域中的二进制掩模,而另一个正在直接预测MEL光谱图。此外,我们添加了一个唱歌的语音探测器,以更明确地识别歌声片段。我们从音频,缩放,分离和整体质量方面衡量了模型性能。结果表明,除了音频质量外,我们提出的模型在客观和主观评估中都优于最先进的语音分离模型。
Singing voice separation (SVS) is a task that separates singing voice audio from its mixture with instrumental audio. Previous SVS studies have mainly employed the spectrogram masking method which requires a large dimensionality in predicting the binary masks. In addition, they focused on extracting a vocal stem that retains the wet sound with the reverberation effect. This result may hinder the reusability of the isolated singing voice. This paper addresses the issues by predicting mel-spectrogram of dry singing voices from the mixed audio as neural vocoder features and synthesizing the singing voice waveforms from the neural vocoder. We experimented with two separation methods. One is predicting binary masks in the mel-spectrogram domain and the other is directly predicting the mel-spectrogram. Furthermore, we add a singing voice detector to identify the singing voice segments over time more explicitly. We measured the model performance in terms of audio, dereverberation, separation, and overall quality. The results show that our proposed model outperforms state-of-the-art singing voice separation models in both objective and subjective evaluation except the audio quality.