论文标题
学习无噪声的语音表示,用于嘈杂目标扬声器的高质量语音转换
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers
论文作者
论文摘要
为嘈杂的目标扬声器建立语音转换系统,例如提供嘈杂样本或发现互联网数据的用户,这是一项艰巨的任务,因为在模型培训中使用受污染的语音显然会降低转换性能。在本文中,我们利用了我们最近提出的Glow-Wavegan的进步,并为嘈杂的目标扬声器提出了与噪音无关的语音表示方法,以进行高质量的语音转换。具体来说,我们学习一个潜在特征空间,确保由转换模型建模的目标分布完全来自波形生成器的建模分布。有了这个前提,我们进一步设法使潜在功能具有噪音不变。具体而言,我们引入了一个可控制的噪声波形,该波甘直接通过编码器从波形中学习与噪声无关的声学表示,并通过解码器中的膜模块在隐藏空间中进行噪声控制。至于转换模型,重要的是,我们使用基于流的模型来学习与音素后验与噪声无关但与说话者相关的潜在特征的分布。实验结果表明,所提出的模型在嘈杂的目标扬声器的语音转换中实现了较高的语音质量和说话者的相似性。
Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN and propose a noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers. Specifically, we learn a latent feature space where we ensure that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator. With this premise, we further manage to make the latent feature to be noise-invariant. Specifically, we introduce a noise-controllable WaveGAN, which directly learns the noise-independent acoustic representation from waveform by the encoder and conducts noise control in the hidden space through a FiLM module in the decoder. As for the conversion model, importantly, we use a flow-based model to learn the distribution of noise-independent but speaker-related latent features from phoneme posteriorgrams. Experimental results demonstrate that the proposed model achieves high speech quality and speaker similarity in the voice conversion for noisy target speakers.