Glow-TTS：通过单调对准搜索进行文本到语音的生成流程

论文标题

Glow-TTS：通过单调对准搜索进行文本到语音的生成流程

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

论文作者

Kim, Jaehyeon, Kim, Sungwon, Kong, Jungil, Yoon, Sungroh

论文摘要

最近，已经提出了文本对语音（TTS）模型，例如FastSpeech和Paranet，以并行从文本中生成MEL-SEXPROGRAM。尽管有优势，但如果没有自回归TTS模型作为其外部对齐器的指导，则不能对平行的TTS模型进行培训。在这项工作中，我们提出了Glow-TTS，这是一种基于流动的生成模型，用于并行TT，不需要任何外部对齐器。通过结合流量和动态编程的属性，提出的模型可以搜索文本和语音潜在语音之间最可能的单调比对。我们证明，执行硬单调的对准可以实现强大的TT，从而概括了长时间的话语，并且采用生成流动可以快速，多样化和可控制的语音综合。 Glow-TTS在合成时以相当的语音质量而获得了自回归模型TaCotron 2的速度速度加速。我们进一步表明，我们的模型可以轻松扩展到多演讲者的设置。

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题