论文标题
Adavocoder:自适应声音机自定义语音
AdaVocoder: Adaptive Vocoder for Custom Voice
论文作者
论文摘要
自定义声音是通过将源语音综合模型调整到目标模型中,构建个人语音综合系统。构建自定义语音的解决方案是将自适应声学模型与强大的声音码器相结合。但是,培训强大的Vocoder通常需要一个多扬声器的数据集,其中应包括各种年龄组和各种音色,以便训练有素的Vocoder可用于看不见的扬声器。收集此类多演讲者数据集很困难,并且数据集发行始终与目标扬声器数据集的分布不匹配。本文从另一个小说的角度提出了一种自适应声码器,以解决自定义声音,以解决上述问题。自适应声码编码器主要使用跨域一致性损失来解决基于GAN的神经声码器在转移学习中遇到的过度拟合问题。我们构建了两个自适应声码器Adamelgan和Adahifi-Gan。首先,我们分别在AISHELL3和CSMSC数据集上预先培训源Vocoder模型。然后,将其微调在内部数据集VXI-Children上,并具有很少的适应性数据。经验结果表明,可以通过将自适应声学模型与自适应声码编码相结合来构建高质量的自定义语音系统。
Custom voice is to construct a personal speech synthesis system by adapting the source speech synthesis model to the target model through the target few recordings. The solution to constructing a custom voice is to combine an adaptive acoustic model with a robust vocoder. However, training a robust vocoder usually requires a multi-speaker dataset, which should include various age groups and various timbres, so that the trained vocoder can be used for unseen speakers. Collecting such a multi-speaker dataset is difficult, and the dataset distribution always has a mismatch with the distribution of the target speaker dataset. This paper proposes an adaptive vocoder for custom voice from another novel perspective to solve the above problems. The adaptive vocoder mainly uses a cross-domain consistency loss to solve the overfitting problem encountered by the GAN-based neural vocoder in the transfer learning of few-shot scenes. We construct two adaptive vocoders, AdaMelGAN and AdaHiFi-GAN. First, We pre-train the source vocoder model on AISHELL3 and CSMSC datasets, respectively. Then, fine-tune it on the internal dataset VXI-children with few adaptation data. The empirical results show that a high-quality custom voice system can be built by combining a adaptive acoustic model with a adaptive vocoder.