论文标题
E2E系统的基于HMM的数据扩展用于构建对话语音综合系统
HMM-based data augmentation for E2E systems for building conversational speech synthesis systems
论文作者
论文摘要
本文提出了一种使用数据增强来构建技术域的高质量文本对语音(TTS)系统的方法。端到端(E2E)系统在基于隐藏的马尔可夫模型(HMM)的综合语音上进行了训练,并通过工作室录制的TTS数据进行了微调,以改善合成语音的音色。工作背后的动机是,HMM系统中通常不存在单词跳过和重复的问题,因为它们可以准确地对音素的持续时间进行建模。与上下文相关的五角酮建模,以及基于树的聚类和状态趋势,可以照顾看不见的上下文和播音外词。还采用语言模型来进一步减少综合错误。主观评估表明,在结合HMM和E2E框架的补充属性时,使用所提出的系统产生的语音优于基线E2E合成方法。进一步的分析强调了拟议方法在低资源场景中的功效。
This paper proposes an approach to build a high-quality text-to-speech (TTS) system for technical domains using data augmentation. An end-to-end (E2E) system is trained on hidden Markov model (HMM) based synthesized speech and further fine-tuned with studio-recorded TTS data to improve the timbre of the synthesized voice. The motivation behind the work is that issues of word skips and repetitions are usually absent in HMM systems due to their ability to model the duration distribution of phonemes accurately. Context-dependent pentaphone modeling, along with tree-based clustering and state-tying, takes care of unseen context and out-of-vocabulary words. A language model is also employed to reduce synthesis errors further. Subjective evaluations indicate that speech produced using the proposed system is superior to the baseline E2E synthesis approach in terms of intelligibility when combining complementing attributes from HMM and E2E frameworks. The further analysis highlights the proposed approach's efficacy in low-resource scenarios.