论文标题
使用光谱包膜和基于小波的F0的高斯 - 马尔科夫模型迈向参数语音合成
Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0
论文作者
论文摘要
基于神经网络的文本到语音已大大提高了综合语音的质量。突出的方法(例如Tacotron2,FastSpeech,FastPitch)通常从文本中生成MEL-SPECTROGRAM,然后使用Vocoder(例如WaveNet,WaveGlow,Hifigan)合成语音。与传统的参数方法(例如,直和世界)相比,基于神经声码器的端到端模型的推理速度缓慢,综合语音通常不健壮且缺乏可控性。在这项工作中,我们提出了一个新颖的更新的Vocoder,这是一个简单的信号模型,可以训练且易于生成波形。我们使用高斯 - 马尔科夫模型来鲁棒学习光谱包膜和基于小波的统计信号处理来表征和分解F0特征。它可以保留精美的光谱包膜并实现自然语音的高可控性。实验结果表明,我们提出的声码器比传统的直发器比传统的直发相比,获得了重建语音的自然性,比wavenet稍好,并且比Wavernn差一些。
Neural network-based Text-to-Speech has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron2, FastSpeech, FastPitch) usually generate Mel-spectrogram from text and then synthesize speech using vocoder (e.g., WaveNet, WaveGlow, HiFiGAN). Compared with traditional parametric approaches (e.g., STRAIGHT and WORLD), neural vocoder based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust and lack of controllability. In this work, we propose a novel updated vocoder, which is a simple signal model to train and easy to generate waveforms. We use the Gaussian-Markov model toward robust learning of spectral envelope and wavelet-based statistical signal processing to characterize and decompose F0 features. It can retain the fine spectral envelope and achieve high controllability of natural speech. The experimental results demonstrate that our proposed vocoder achieves better naturalness of reconstructed speech than the conventional STRAIGHT vocoder, slightly better than WaveNet, and somewhat worse than the WaveRNN.