论文标题
ControlVC:零拍的语音转换与时间变化的控件在音高和速度上
ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
论文作者
论文摘要
神经言语综合和职业化的最新发展引发了对语音转换(VC)的新兴趣。除了音色传输之外,在许多应用程序方案中部署风险投资系统至关重要,可以在para语言参数(例如音高和速度)上实现可控性。但是,现有的研究要么仅提供语音级别的全球控制,要么对控制措施缺乏解释性。在本文中,我们提出了ControlVC,这是第一个在音高和速度上实现时变控制的神经语音转换系统。 ControlVC使用预先训练的编码器来计算源音出的音调和语言嵌入,并从目标话语中嵌入扬声器。然后将这些嵌入串联并使用Vocoder转换为语音。它通过在源说话上预处理TD-PSOLA实现速度控制,并通过操纵音高轮廓在将其喂入沥青编码器之前通过操纵音高来实现音高控制。进行系统的主观和客观评估以评估语音质量和可控性。结果表明,在非平行和零摄像转换任务上,ControlVC在语音质量方面显着优于另外两个自我构造的基线,并且可以成功实现时间变化的音高和速度控制。
Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the controls. In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed. ControlVC uses pre-trained encoders to compute pitch and linguistic embeddings from the source utterance and speaker embeddings from the target utterance. These embeddings are then concatenated and converted to speech using a vocoder. It achieves speed control through TD-PSOLA pre-processing on the source utterance, and achieves pitch control by manipulating the pitch contour before feeding it to the pitch encoder. Systematic subjective and objective evaluations are conducted to assess the speech quality and controllability. Results show that, on non-parallel and zero-shot conversion tasks, ControlVC significantly outperforms two other self-constructed baselines on speech quality, and it can successfully achieve time-varying pitch and speed control.