多任务对抗培训算法多演讲者神经文本到语音

论文标题

多任务对抗培训算法多演讲者神经文本到语音

Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech

论文作者

Nakai, Yusuke, Saito, Yuki, Udagawa, Kenta, Saruwatari, Hiroshi

论文摘要

我们提出了一种基于多任务对抗训练的多扬声器神经文本到语音（TTS）模型的新型培训算法。传统的基于基于的训练算法的常规生成对抗网络（GAN）通过减少自然语音和合成语音之间的统计差异来显着提高合成语音的质量。但是，该算法不能保证训练有素的TTS模型的概括性能在不包括在培训数据中的未看到的说话者的声音中。我们的算法替代训练两个深神经网络：多任务歧视器和多扬声器神经TTS模型（即GAN的生成器）。对歧视者的训练不仅是为了区分自然语音和合成语音，而且还存在验证输入语音的说话者的存在或不存在（即，通过插入插值的说话者的嵌入向量而新产生）。同时，对发电机进行了训练，以最大程度地减少语音重建损失的加权和欺骗歧视者的对抗性损失，即使目标扬声器看不见，也可以实现高质量的多演讲者TT。实验评估表明，我们的算法比传统的甘斯多克算法更好地提高了合成语音的质量。

We propose a novel training algorithm for a multi-speaker neural text-to-speech (TTS) model based on multi-task adversarial training. A conventional generative adversarial network (GAN)-based training algorithm significantly improves the quality of synthetic speech by reducing the statistical difference between natural and synthetic speech. However, the algorithm does not guarantee the generalization performance of the trained TTS model in synthesizing voices of unseen speakers who are not included in the training data. Our algorithm alternatively trains two deep neural networks: multi-task discriminator and multi-speaker neural TTS model (i.e., generator of GANs). The discriminator is trained not only to distinguish between natural and synthetic speech but also to verify the speaker of input speech is existent or non-existent (i.e., newly generated by interpolating seen speakers' embedding vectors). Meanwhile, the generator is trained to minimize the weighted sum of the speech reconstruction loss and adversarial loss for fooling the discriminator, which achieves high-quality multi-speaker TTS even if the target speaker is unseen. Experimental evaluation shows that our algorithm improves the quality of synthetic speech better than a conventional GANSpeech algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题