主持人：一种具有单调边界搜索的非自动回归TTS的神经对准模型

论文标题

主持人：一种具有单调边界搜索的非自动回归TTS的神经对准模型

MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search

论文作者

Li, Naihan, Liu, Shujie, Liu, Yanqing, Zhao, Sheng, Liu, Ming, Zhou, Ming

论文摘要

为了加快神经言语综合的推论，非自动回归模型最近受到了越来越多的关注。在非自动性模型中，需要额外的文本令牌持续时间以在编码器和解码器之间进行硬对齐。基于持续时间的对齐起着至关重要的作用，因为它控制文本令牌和频谱框架之间的对应关系，并确定合成音频的节奏和速度。为了获得更好的基于持续时间的对准并提高非自动回归语音综合的质量，在本文中，我们提出了一种新型的神经对准模型，名为Moboaligner。鉴于文本和MEL频谱的对，主持人试图根据在神经语义空间中具有端到端框架的神经语义空间中的令牌框架相似性来识别给定的MEL频谱帧中文本令牌的边界。通过这些边界，可以将持续时间提取并用于训练非自动回归TTS模型。与Transformertts提取的持续时间相比，主持人对MOS上的非自动回归TTS模型的改进（3.74与FastSpeech的3.44相比）。此外，主持人是任务指定且轻巧的，这将参数编号降低了45％，训练时间耗尽了30％。

To speed up the inference of neural speech synthesis, non-autoregressive models receive increasing attention recently. In non-autoregressive models, additional durations of text tokens are required to make a hard alignment between the encoder and the decoder. The duration-based alignment plays a crucial role since it controls the correspondence between text tokens and spectrum frames and determines the rhythm and speed of synthesized audio. To get better duration-based alignment and improve the quality of non-autoregressive speech synthesis, in this paper, we propose a novel neural alignment model named MoboAligner. Given the pairs of the text and mel spectrum, MoboAligner tries to identify the boundaries of text tokens in the given mel spectrum frames based on the token-frame similarity in the neural semantic space with an end-to-end framework. With these boundaries, durations can be extracted and used in the training of non-autoregressive TTS models. Compared with the duration extracted by TransformerTTS, MoboAligner brings improvement for the non-autoregressive TTS model on MOS (3.74 comparing to FastSpeech's 3.44). Besides, MoboAligner is task-specified and lightweight, which reduces the parameter number by 45% and the training time consuming by 30%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题