POCONET：频率嵌入，半监督对话数据和有偏见的损失的语音增强

论文标题

POCONET：频率嵌入，半监督对话数据和有偏见的损失的语音增强

PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

论文作者

Isik, Umut, Giri, Ritwik, Phansalkar, Neerad, Valin, Jean-Marc, Helwani, Karim, Krishnaswamy, Arvindh

论文摘要

神经网络应用程序通常受益于大型模型，但是对于当前的语音增强模型，较大规模的网络通常会遭受鲁棒性降低，超出了培训数据中遇到的现实世界中的各种用例。我们介绍了几项创新，这些创新导致更好的大型神经网络以增强语音。新型的Poconet架构是一个卷积神经网络，使用频率置换嵌入，能够在早期层中更有效地构建频率依赖性特征。一种半监督的方法有助于通过预先增强嘈杂的数据集来增加对话训练数据的量，从而提高了真实录音的性能。偏向于保持语音质量的新损失功能有助于优化更好地匹配人类对语音质量的感知意见。消融实验以及客观和人类意见指标显示了拟议改进的好处。

Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers. A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets, improving performance on real recordings. A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality. Ablation experiments and objective and human opinion metrics show the benefits of the proposed improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题