通过代币级的声学潜在潜在的预测表达语音综合，学习话语级的表示

论文标题

通过代币级的声学潜在潜在的预测表达语音综合，学习话语级的表示

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

论文作者

Nikitaras, Karolos, Klapsas, Konstantinos, Ellinas, Nikolaos, Maniati, Georgia, Sung, June Sig, Hwang, Inchul, Raptis, Spyros, Chalamandaris, Aimilios, Tsiakoulis, Pirros

论文摘要

本文提出了一种表现力的语音综合模型，该模型利用令牌级的潜在韵律变量来捕获和控制语音级别的属性，例如角色表演和说话风格。当前的作品旨在将这种细粒度和话语级的语音属性显式分配到由在相应级别上运行的模块提取的不同表示形式中。我们表明，细粒度的潜在空间还捕获了粗粒的信息，这更明显，因为潜在空间的尺寸增加，以捕获多种韵律表示。因此，在代币级别的多样性和话语级别的代表及其分离之间产生了权衡。我们通过首先将丰富的语音属性捕获到令牌级的潜在空间中来缓解这个问题，然后单独训练先前的网络，该网络鉴于输入文本，学习了话语级表示，以预测上一步中提取的音素级别的后代。定性和定量评估都用于证明所提出方法的有效性。音频样本可在我们的演示页面中找到。

This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step. Both qualitative and quantitative evaluations are used to demonstrate the effectiveness of the proposed approach. Audio samples are available in our demo page.

下载PDF全文

下载文献需遵守相关版权规定

论文标题