论文标题
MaskedSpeech:具有掩蔽策略的上下文感知语音综合
MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy
论文作者
论文摘要
人类经常以连续的方式讲话,从而导致邻近话语之间的连贯和一致的韵律特性。但是,大多数最新的语音合成系统仅考虑每个句子中的信息,而忽略上下文的语义和声学特征。这使得产生高质量的段落级别的演讲不足,这需要高表现力和自然性。为了综合段落的自然语音和表达性语音,本文提出了一个名为MaskedSpeech的上下文感知语音合成系统,该系统既考虑上下文的语义和声学特征。受到语音编辑研究的掩盖策略的启发,当前句子的声学特征被掩盖并与上下文语音的声音相连,并进一步用作附加模型输入。音素编码器从附近的句子中获取串联的音素序列,并从上下文文本中学习细颗粒的语义信息。此外,还采用了杂化粗粒语义特征来改善韵律产生。该模型经过训练,可以通过增强上下文语义和声学特征来重建蒙面的声学特征。实验结果表明,所提出的蒙面式语音在自然性和表现力方面显着优于基线系统。
Humans often speak in a continuous manner which leads to coherent and consistent prosody properties across neighboring utterances. However, most state-of-the-art speech synthesis systems only consider the information within each sentence and ignore the contextual semantic and acoustic features. This makes it inadequate to generate high-quality paragraph-level speech which requires high expressiveness and naturalness. To synthesize natural and expressive speech for a paragraph, a context-aware speech synthesis system named MaskedSpeech is proposed in this paper, which considers both contextual semantic and acoustic features. Inspired by the masking strategy in the speech editing research, the acoustic features of the current sentence are masked out and concatenated with those of contextual speech, and further used as additional model input. The phoneme encoder takes the concatenated phoneme sequence from neighboring sentences as input and learns fine-grained semantic information from contextual text. Furthermore, cross-utterance coarse-grained semantic features are employed to improve the prosody generation. The model is trained to reconstruct the masked acoustic features with the augmentation of both the contextual semantic and acoustic features. Experimental results demonstrate that the proposed MaskedSpeech outperformed the baseline system significantly in terms of naturalness and expressiveness.