基于世界合成器的可区分神经声码器，并应用于端到端音频传输

论文标题

基于世界合成器的可区分神经声码器，并应用于端到端音频传输

Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

论文作者

Nercessian, Shahan

论文摘要

在本文中，我们提出了一个可区分的世界合成器，并展示了其在端到端音频风格的传输任务中的用途，例如（唱歌）语音转换和DDSP Timbre Timbre传输任务。因此，我们的基线可分化合成器没有模型参数，但可以产生足够的合成质量。我们可以通过附加轻巧的黑盒邮政注册来扩展基线合成器，这些邮票将进一步的处理应用于基线输出以提高忠诚度。另一种可区分的方法考虑了直接提取源激励光谱的提取，这可以改善自然性，尽管较窄的样式转移应用程序都可以提高自然性。我们方法使用的声学特征参数化具有额外的好处，即它自然地散布了音调和音图信息，因此可以单独建模它们。此外，由于存在一种强大的方法来估算单声音频源的这些声学特征，因此它允许将参数丢失项添加到端到端目标函数中，这可以帮助收敛和/或进一步稳定（对抗性）训练。

In this paper, we propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks such as (singing) voice conversion and the DDSP timbre transfer task. Accordingly, our baseline differentiable synthesizer has no model parameters, yet it yields adequate synthesis quality. We can extend the baseline synthesizer by appending lightweight black-box postnets which apply further processing to the baseline output in order to improve fidelity. An alternative differentiable approach considers extraction of the source excitation spectrum directly, which can improve naturalness albeit for a narrower class of style transfer applications. The acoustic feature parameterization used by our approaches has the added benefit that it naturally disentangles pitch and timbral information so that they can be modeled separately. Moreover, as there exists a robust means of estimating these acoustic features from monophonic audio sources, it allows for parameter loss terms to be added to an end-to-end objective function, which can help convergence and/or further stabilize (adversarial) training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题