论文标题

基于感知损失的语音denoing以音频模式识别和自我保护模型的合奏

Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

论文作者

Kataria, Saurabh, Villalba, Jesús, Dehak, Najim

论文摘要

基于深度学习的语音denodising仍然面临着提高增强信号感知质量的挑战。我们介绍了一个普遍的框架,称为感知集合正规化损失(PERL)建立在感知损失的想法上。感知损失不鼓励某些语音特性失真,我们使用六个大规模的预训练模型对其进行分析:扬声器分类,声学模型,扬声器嵌入,情感分类和两个自助式语音编码器(Pase+,WAV2VEC 2.0)。我们首先在流行的增强基准测试中,使用构象变压器网络构建了强大的基线(W/O PERL),称为VCTK按需。一次使用辅助模型一个,我们发现声学事件和自我监督的模型PASE+是最有效的。我们的最佳模型(PERL-AE)仅使用声学事件模型(利用音频集)来超过主要感知指标的最先进方法。为了探索denoising是否可以利用全框架,我们使用所有网络,但发现我们的七损失配方遭受了多任务学习的挑战。最后,我们报告了一个批判性的观察结果,即最先进的多任务重量学习方法不能超越手工调整,这可能是由于域失配的挑战和损失的互补性较弱。

Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源