多演讲者多式语音综合与音色和样式脱离

论文标题

多演讲者多式语音综合与音色和样式脱离

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

论文作者

Song, Wei, Yue, Yanghao, Zhang, Ya-jie, Zhang, Zhengchen, Wu, Youzheng, He, Xiaodong

论文摘要

扬声器的音色和样式的解开对于多演讲者多式文本到语音（TTS）方案中的样式转移非常重要。随着时间和样式的散布，TTS系统可以用培训语料库中看到的任何样式为给定的扬声器综合表达语音。但是，目前关于音色和样式脱节的研究仍然存在一些缺点。当前的方法要么需要单扬声器多式录音，这些录音很难收集，要么使用复杂的网络和复杂的训练方法，这很难复制和控制样式转移行为。为了提高音色和样式的分离有效性，并消除了对单扬声器多式语料库的依赖，本文提出了一种简单但有效的音色和风格的分离方法。 FastSpeech2网络用作骨干网络，具有明确的持续时间，音高和能量轨迹来代表样式。每个说话者的数据都被视为单独且孤立的样式，然后将扬声器嵌入和样式嵌入添加到FastSpeech2网络中，以学习分离的表示形式。话语水平的音高和能量归一化可用于改善解耦效果。实验结果表明，所提出的模型可以将语音与训练中具有高风格相似性的任何样式合成，同时保持非常高的扬声器相似性。

Disentanglement of a speaker's timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and energy trajectory to represent the style. Each speaker's data is considered as a separate and isolated style, then a speaker embedding and a style embedding are added to the FastSpeech2 network to learn disentangled representations. Utterance level pitch and energy normalization are utilized to improve the decoupling effect. Experimental results demonstrate that the proposed model could synthesize speech with any style seen during training with high style similarity while maintaining very high speaker similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题