Wesinger 2：完全平行的歌声综合通过多弹奏条件对抗训练

论文标题

Wesinger 2：完全平行的歌声综合通过多弹奏条件对抗训练

WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training

论文作者

Zhang, Zewang, Zheng, Yibin, Li, Xinhui, Lu, Li

论文摘要

本文旨在引入强大的歌声综合（SVS）系统，通过利用对抗性训练策略来有效地产生非常自然和现实的歌声。一方面，我们设计了简单但通用的随机区域条件歧视器来帮助监督声学模型，该模型可以有效地避免过度光滑的光谱图预测并提高SV的表现力。另一方面，我们将频谱图与框架级线性间隔的F0序列巧妙地结合在一起，作为神经vosoder的输入，然后在波形域中的多个对抗条件鉴别器的帮助下，在波形域和频域中的多尺度距离函数中进行了优化。实验结果和消融研究得出的结论是，与我们以前的自动回归工作相比，我们的新系统可以通过微调覆盖几分钟到几个小时的不同唱歌数据集来有效地产生高质量的唱歌声音。可以在线提供大量带有不同时间表的合成歌曲https://zzw922cn.github.io/wesinger2，我们强烈建议读者聆听它们。

This paper aims to introduce a robust singing voice synthesis (SVS) system to produce very natural and realistic singing voices efficiently by leveraging the adversarial training strategy. On one hand, we designed simple but generic random area conditional discriminators to help supervise the acoustic model, which can effectively avoid the over-smoothed spectrogram prediction and improve the expressiveness of SVS. On the other hand, we subtly combined the spectrogram with the frame-level linearly-interpolated F0 sequence as the input for the neural vocoder, which is then optimized with the help of multiple adversarial conditional discriminators in the waveform domain and multi-scale distance functions in the frequency domain. The experimental results and ablation studies concluded that, compared with our previous auto-regressive work, our new system can produce high-quality singing voices efficiently by fine-tuning different singing datasets covering from several minutes to a few hours. A large number of synthesized songs with different timbres are available online https://zzw922cn.github.io/wesinger2 and we highly recommend readers to listen to them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题