用颤音建模和潜在能量表示唱歌语音合成

论文标题

用颤音建模和潜在能量表示唱歌语音合成

Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation

论文作者

Song, Yingjie, Song, Wei, Zhang, Wei, Zhang, Zhengchen, Zeng, Dan, Liu, Zhi, Yu, Yang

论文摘要

本文通过引入明确的颤音建模和潜在能量表示，提出了一种表现力的语音综合系统。由于人类唱歌的固有特征，颤音对于合成声音的自然性至关重要。因此，本文引入了一个基于深度学习的颤音模型，以控制颤音的可能性，速率，深度和唱片，其中颤音的可能性代表了颤音的存在概率，这将有助于提高唱歌声音的自然性。实际上，在现有的唱歌语料库中没有关于颤音的带有带注释的标签。我们采用一种新型的颤音可能性标记方法来自动标记颤音的可能性。同时，音频的功率谱包含丰富的信息，可以提高唱歌的表现力。提出了一种基于自动编码器的潜在能量瓶颈功能，用于表达语音综合。开放数据集NUS48E的实验结果表明，颤音建模和潜在能量表示都可以显着提高唱歌声音的表现力。音频样本显示在演示网站中。

This paper proposes an expressive singing voice synthesis system by introducing explicit vibrato modeling and latent energy representation. Vibrato is essential to the naturalness of synthesized sound, due to the inherent characteristics of human singing. Hence, a deep learning-based vibrato model is introduced in this paper to control the vibrato's likeliness, rate, depth and phase in singing, where the vibrato likeliness represents the existence probability of vibrato and it would help improve the singing voice's naturalness. Actually, there is no annotated label about vibrato likeliness in existing singing corpus. We adopt a novel vibrato likeliness labeling method to label the vibrato likeliness automatically. Meanwhile, the power spectrogram of audio contains rich information that can improve the expressiveness of singing. An autoencoder-based latent energy bottleneck feature is proposed for expressive singing voice synthesis. Experimental results on the open dataset NUS48E show that both the vibrato modeling and the latent energy representation could significantly improve the expressiveness of singing voice. The audio samples are shown in the demo website.

下载PDF全文

下载文献需遵守相关版权规定

论文标题