论文标题
FastSpeech 2:快速和高质量的端到端文本到语音
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
论文作者
论文摘要
诸如FastSpeech之类的非自动回归文本(TTS)模型(TTS)模型可以比以前具有可比质量的自回归模型合成语音的速度要快得多。快速语音模型的培训依赖于持续时间预测的自回归教师模型(提供更多信息为输入)和知识蒸馏(以简化输出中的数据分布),这可以缓解一对多的映射问题(即,多个语音变化对应于TTS中的同一文本)。但是,FastSpeech有几个缺点:1)教师学生的蒸馏管线很复杂且耗时,2)从教师模型中提取的持续时间不够准确,并且由于数据简化而导致的信息损失的目标MEL光谱散发出的目标MEL光谱图,这两者都限制了语音质量。在本文中,我们提出了FastSpeech 2,它解决了FastSpeech中的问题,并更好地解决了TTS中的一对一映射问题,1)直接通过基本真相目标直接训练该模型,而不是教师的简化输出,以及2)引入更多语音变化信息(例如,音高,能量和更多准确的持续时间和更多准确的持续时间)。具体而言,我们从语音波形中提取持续时间,音高和能量,并将其直接作为训练中的条件输入,并在推理中使用预测的值。我们进一步设计了FastSpeech 2s,这是并行直接从文本中直接生成语音波形的尝试,从而享受完全端到端推断的好处。实验结果表明,1)FastSpeech 2在FastSpeech上实现了3倍的训练,而FastSpeech 2s的推理速度甚至更快; 2)FastSpeech 2和2S的语音质量胜过快速速度,而FastSpeech 2甚至可以超越自回归型号。音频样本可在https://speechresearch.github.io/fastspeech2/上找到。
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.