论文标题
使用基于VQ-VAE的潜在持续时间的端到端文本到语音
End-to-End Text-to-Speech using Latent Duration based on VQ-VAE
论文作者
论文摘要
显式持续时间建模是在文本到语音合成(TTS)中实现稳健有效的一致性的关键。我们使用明确的持续时间建模提出了一个新的TTS框架,该框架将持续时间纳入TTS的离散潜在变量,并从头开始实现关节优化整个模块。我们根据条件VQ-VAE制定方法,以处理各种自动编码器中的离散持续时间,并提供理论解释以证明我们的方法合理。在我们的框架中,基于连接的时间分类(CTC)的力对准器充当近似后部,并且文本对持续时间在变化自动编码器中作为先验。我们通过听力测试评估了我们提出的方法,并将其与基于软注意或显式持续时间建模的其他TTS方法进行了比较。结果表明,我们的系统在基于软注意的方法(变压器-TTS,Tacotron2)和基于显式持续时间建模的方法(FastSpeech)之间进行了评级。
Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variational autoencoder and provide a theoretical explanation to justify our method. In our framework, a connectionist temporal classification (CTC) -based force aligner acts as the approximate posterior, and text-to-duration works as the prior in the variational autoencoder. We evaluated our proposed method with a listening test and compared it with other TTS methods based on soft-attention or explicit duration modeling. The results showed that our systems rated between soft-attention-based methods (Transformer-TTS, Tacotron2) and explicit duration modeling-based methods (Fastspeech).