tts-by-tts：快速和高质量语音合成的TTS驱动数据增强

论文标题

tts-by-tts：快速和高质量语音合成的TTS驱动数据增强

TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis

论文作者

Hwang, Min-Jae, Yamamoto, Ryuichi, Song, Eunwoo, Kim, Jae-Min

论文摘要

在本文中，我们提出了一种文本到语音（TTS）驱动的数据增强方法，用于提高非自动回旋（AR）TTS系统的质量。最近提出的非AR模型（例如FastSpeech 2）已成功实现了快速的语音合成系统。但是，它们的质量并不令人满意，尤其是当培训数据数量不足时。为了解决这个问题，我们建议使用精心设计的AR TTS系统提出了一种有效的数据增强方法。在这种方法中，AR TTS系统生成了大规模的合成语料库，其中包括具有音素持续时间的文本波形对，然后用于训练目标非AR模型。感知听力测试结果表明，该提出的方法显着提高了非AR TTS系统的质量。特别是，我们将培训数据库的五个小时增加到合成数据库的179小时。使用这些数据库，我们的TTS系统由FastSpeech 2声学模型和平行的Wavegan Vocoder组成，平均意见分数为3.74，比传统方法高40％。

In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR models, such as FastSpeech 2, have successfully achieved fast speech synthesis system. However, their quality is not satisfactory, especially when the amount of training data is insufficient. To address this problem, we propose an effective data augmentation method using a well-designed AR TTS system. In this method, large-scale synthetic corpora including text-waveform pairs with phoneme duration are generated by the AR TTS system and then used to train the target non-AR model. Perceptual listening test results showed that the proposed method significantly improved the quality of the non-AR TTS system. In particular, we augmented five hours of a training database to 179 hours of a synthetic one. Using these databases, our TTS system consisting of a FastSpeech 2 acoustic model with a Parallel WaveGAN vocoder achieved a mean opinion score of 3.74, which is 40% higher than that achieved by the conventional method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题