emodiff：强度可控制的情感文本到语音，并具有软标签指导

论文标题

emodiff：强度可控制的情感文本到语音，并具有软标签指导

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

论文作者

Guo, Yiwei, Du, Chenpeng, Chen, Xie, Yu, Kai

论文摘要

尽管当前的神经文本到语音（TTS）模型能够产生高质量的语音，但强度可控制的情绪TT仍然是一项艰巨的任务。大多数现有方法需要外部优化进行强度计算，从而导致次优的结果或降解质量。在本文中，我们提出了Emodiff，这是一种基于扩散的TTS模型，可以通过提出的源自分类器指南的软标签指导技术来操纵情感强度。具体而言，Emodiff的指导方式分别为指定的情感，而不是用一个速率向量引导，而是指出了指定情感的价值和\ textit {中性}的价值，分别设置为$α$和$ 1-α$。这里的$α$代表情感强度，可以从0到1选择。我们的实验表明，Emodiff可以精确控制情绪强度，同时保持高声音质量。此外，可以通过在反向降级过程中进行采样来产生具有指定情感强度的多种语音。

Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $α$ and $1-α$ respectively. The $α$ here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题