基于序列到序列的语音转换的日本电力性语音增强的两阶段训练方法

论文标题

基于序列到序列的语音转换的日本电力性语音增强的两阶段训练方法

Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion

论文作者

Ma, Ding, Violeta, Lester Phillip, Kobayashi, Kazuhiro, Toda, Tomoki

论文摘要

与常规的VC模型相比，序列到序列（SEQ2SEQ）语音转换（VC）模型在将电脑语音（EL）语音转换为正常语音（EL2SP）方面具有更大的潜力。但是，基于SEQ2SEQ VC的EL2SP需要足够大量的平行数据来进行模型培训，并且当训练数据的数量不足时，它会遭受重大性能降解。为了解决这个问题，我们建议一种新颖的两阶段策略，以优化基于SEQ2SEQ VC的EL2SP上的性能，当可用的并行数据集可用时。与以前的研究中利用高质量的数据增强相反，我们首先将EL和正常语音的大量不完美的合成平行数据与原始数据集结合在一起，将原始数据集与VC培训中。然后，仅使用原始并行数据集进行第二阶段训练。结果表明，提出的方法逐渐改善了基于SEQ2SEQ VC的EL2SP的性能。

Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insufficient. To address this issue, we suggest a novel, two-stage strategy to optimize the performance on EL2SP based on seq2seq VC when a small amount of the parallel dataset is available. In contrast to utilizing high-quality data augmentations in previous studies, we first combine a large amount of imperfect synthetic parallel data of EL and normal speech, with the original dataset into VC training. Then, a second stage training is conducted with the original parallel dataset only. The results show that the proposed method progressively improves the performance of EL2SP based on seq2seq VC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题