使用基于VQ-VAE的潜在持续时间的端到端文本到语音

论文标题

使用基于VQ-VAE的潜在持续时间的端到端文本到语音

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

论文作者

Yasuda, Yusuke, Wang, Xin, Yamagishi, Junichi

论文摘要

显式持续时间建模是在文本到语音合成（TTS）中实现稳健有效的一致性的关键。我们使用明确的持续时间建模提出了一个新的TTS框架，该框架将持续时间纳入TTS的离散潜在变量，并从头开始实现关节优化整个模块。我们根据条件VQ-VAE制定方法，以处理各种自动编码器中的离散持续时间，并提供理论解释以证明我们的方法合理。在我们的框架中，基于连接的时间分类（CTC）的力对准器充当近似后部，并且文本对持续时间在变化自动编码器中作为先验。我们通过听力测试评估了我们提出的方法，并将其与基于软注意或显式持续时间建模的其他TTS方法进行了比较。结果表明，我们的系统在基于软注意的方法（变压器-TTS，Tacotron2）和基于显式持续时间建模的方法（FastSpeech）之间进行了评级。

Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variational autoencoder and provide a theoretical explanation to justify our method. In our framework, a connectionist temporal classification (CTC) -based force aligner acts as the approximate posterior, and text-to-duration works as the prior in the variational autoencoder. We evaluated our proposed method with a listening test and compared it with other TTS methods based on soft-attention or explicit duration modeling. The results showed that our systems rated between soft-attention-based methods (Transformer-TTS, Tacotron2) and explicit duration modeling-based methods (Fastspeech).

下载PDF全文

下载文献需遵守相关版权规定

论文标题