论文标题

基于使用自发对话的语言样式的潜在表示,端到端文本到语音

End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

论文作者

Mitsui, Kentaro, Zhao, Tianyu, Sawada, Kei, Hono, Yukiya, Nankaku, Yoshihiko, Tokuda, Keiichi

论文摘要

最近的文本到语音(TTS)的质量与人类的质量相当。但是,其在口语对话中的应用尚未得到广泛研究。这项研究旨在实现与人类对话非常相似的TT。首先,我们记录并转录实际的自发对话。 Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model.从语音中提取潜在的口语表现形式的样式编码器与TT共同培训。在第二阶段,对风格的预测指标进行了训练,以预测从对话历史中综合的说话风格。在推断期间,通过将风格预测器预测的语言样式表示为VAE/gmvae-vits,可以以适合对话环境的样式合成语音。主观评估结果表明,所提出的方法在对话级别的自然性方面优于原始VIT。

The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源