VoiceGrad：通过退火Langevin Dynamics进行非并行任何对数的语音转换

论文标题

VoiceGrad：通过退火Langevin Dynamics进行非并行任何对数的语音转换

VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

论文作者

Kameoka, Hirokazu, Kaneko, Takuhiro, Tanaka, Kou, Hojo, Nobukatsu, Seki, Shogo

论文摘要

在本文中，我们提出了一种称为VoiceGrad的非并行任何与人的语音转换（VC）方法。受Wavegrad的启发，是一种最近引入的小说Woveform生成方法，基于得分匹配和Langevin Dynamics的概念。它使用加权的denoising得分匹配来训练分数近似器，该分数近似网络具有U-NET结构的完全卷积网络，旨在预测多个扬声器的语音特征序列的对数密度的梯度，并通过使用退火的Langevin Dynamics进行VC进行VC，从而将输入功能序列更新为基于训练的近似值网络的目标分配的输入特征序列，以更新输入功能序列。由于这个概念的性质，VoiceGrad启用了任何一对一的VC，这是VC场景，其中输入语音的发言人可以是任意的，并且允许进行非平行训练，这不需要并行的话语或转录。

In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题