基于感知损失的语音denoing以音频模式识别和自我保护模型的合奏

论文标题

基于感知损失的语音denoing以音频模式识别和自我保护模型的合奏

Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

论文作者

Kataria, Saurabh, Villalba, Jesús, Dehak, Najim

论文摘要

基于深度学习的语音denodising仍然面临着提高增强信号感知质量的挑战。我们介绍了一个普遍的框架，称为感知集合正规化损失（PERL）建立在感知损失的想法上。感知损失不鼓励某些语音特性失真，我们使用六个大规模的预训练模型对其进行分析：扬声器分类，声学模型，扬声器嵌入，情感分类和两个自助式语音编码器（Pase+，WAV2VEC 2.0）。我们首先在流行的增强基准测试中，使用构象变压器网络构建了强大的基线（W/O PERL），称为VCTK按需。一次使用辅助模型一个，我们发现声学事件和自我监督的模型PASE+是最有效的。我们的最佳模型（PERL-AE）仅使用声学事件模型（利用音频集）来超过主要感知指标的最先进方法。为了探索denoising是否可以利用全框架，我们使用所有网络，但发现我们的七损失配方遭受了多任务学习的挑战。最后，我们报告了一个批判性的观察结果，即最先进的多任务重量学习方法不能超越手工调整，这可能是由于域失配的挑战和损失的互补性较弱。

Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题