多通道语音增强的资源有效语音掩盖估算

论文标题

多通道语音增强的资源有效语音掩盖估算

Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

论文作者

Pfeifenberger, Lukas, Zöhrer, Matthias, Schindler, Günther, Roth, Wolfgang, Fröning, Holger, Pernkopf, Franz

论文摘要

尽管机器学习技术传统上是资源密集的，但目前我们目睹了对硬件和节能方法的兴趣越来越大。对资源有效的机器学习的需求主要是由对嵌入式系统的需求及其在无处不在的计算和物联网应用程序中的使用驱动的。在本文中，我们为基于深神经网络（DNNS）的多通道语音增强提供了一种资源有效的方法。特别是，我们使用降低精确的DNN来估算噪声，多通道麦克风观察的语音掩模。该语音掩码用于获得最小差异无扭曲响应（MVDR）或广义特征值（GEV）束缚器。在极端的二进制重量和降低的精度激活的情况下，可以大大减少执行时间和内存足迹，同时仍然可以使用WSJ0语音语料库获得单个扬声器场景的单个扬声器方案，而单个扬声器方案仍然几乎可以获得音频质量。

While machine learning techniques are traditionally resource intensive, we are currently witnessing an increased interest in hardware and energy efficient approaches. This need for resource-efficient machine learning is primarily driven by the demand for embedded systems and their usage in ubiquitous computing and IoT applications. In this article, we provide a resource-efficient approach for multi-channel speech enhancement based on Deep Neural Networks (DNNs). In particular, we use reduced-precision DNNs for estimating a speech mask from noisy, multi-channel microphone observations. This speech mask is used to obtain either the Minimum Variance Distortionless Response (MVDR) or Generalized Eigenvalue (GEV) beamformer. In the extreme case of binary weights and reduced precision activations, a significant reduction of execution time and memory footprint is possible while still obtaining an audio quality almost on par to single-precision DNNs and a slightly larger Word Error Rate (WER) for single speaker scenarios using the WSJ0 speech corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题