论文标题
HIFI-GAN:生成的对抗网络,可高效且高保真语音综合
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
论文作者
论文摘要
关于语音合成的最新工作已采用生成的对抗网络(GAN)来产生原始波形。尽管此类方法提高了采样效率和记忆使用量,但它们的样本质量尚未达到自回归和基于流动的生成模型的质量。在这项工作中,我们提出了HIFI-GAN,它可以实现有效和高保真的语音综合。由于语音音频由具有各个时期的正弦信号组成,因此我们证明了音频的建模周期性模式对于提高样品质量至关重要。单个扬声器数据集的主观人类评估(平均意见分数,MOS)表明,我们提出的方法与人类质量相似,同时在单个V100 GPU上产生比实时快的22.05 kHz高保真音频167.9倍。我们进一步展示了Hifi-GAN的通用性与看不见的说话者和端到端语音综合的旋转光谱反演。最后,HIFI-GAN的小足迹版本的样品比实时的CPU生成13.4倍,其质量与自动回归对应物的质量可比。
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.