论文标题
非自动回归语音综合的分层韵律建模
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis
论文作者
论文摘要
韵律建模是现代文本到语音(TTS)框架中的重要组成部分。通过向TTS模型提供韵律特征,可以控制合成的话语的风格。但是,预测推理时自然而合理的韵律是具有挑战性的。在这项工作中,我们分析了在不同的韵律模型设置下非自动回归TTS模型的行为,并提出了分层体系结构,其中音素级韵律特征的预测是根据单词级别的韵律特征进行调节的。在我们的客观和主观评估中,该方法在音频质量和自然性方面优于其他竞争对手。
Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work, we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features. The proposed method outperforms other competitors in terms of audio quality and prosody naturalness in our objective and subjective evaluation.