论文标题

多演讲者端到端语音综合的预处理策略,波形模型选择和声学配置

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

论文作者

Cooper, Erica, Wang, Xin, Zhao, Yi, Yasuda, Yusuke, Yamagishi, Junichi

论文摘要

我们探讨了预处理的策略,包括选择基本语料库,目的是为零发出的多演讲者端到端合成选择最佳策略。我们还检查了神经声码器的选择用于波形合成,以及用于MEL频谱图和最终音频输出的声学配置。我们发现,从发现的有声读物数据中对多扬声器模型进行微调模型,该模型通过简单的质量阈值可以提高自然性和相似性,而不是看到合成语音的目标扬声器。此外,我们发现听众可以辨别16kHz和24kHz采样率,并且Wavernn产生的输出波形与WaveNet具有可比的质量,并且推理时间更快。

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源