在真实嘈杂的对话环境中，方向感知的自适应在线神经演讲增强，并具有增强的现实耳机

论文标题

在真实嘈杂的对话环境中，方向感知的自适应在线神经演讲增强，并具有增强的现实耳机

Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments

论文作者

Sekiguchi, Kouhei, Nugraha, Aditya Arie, Du, Yicheng, Bando, Yoshiaki, Fontaine, Mathieu, Yoshii, Kazuyoshi

论文摘要

本文介绍了在线语音增强的实际响应和性能感知的开发，以增强现实（AR）耳机，该耳机可帮助用户了解在真实嘈杂的回声环境中进行的对话（例如，鸡尾酒会）。人们可以使用称为快速多通道非负矩阵分解（FastMNMF）的最先进的盲源分离方法，该方法在各种环境中都可以在各种环境中运行良好。但是，其繁重的计算成本阻止了其在实时处理中的应用。相比之下，使用深层神经网络（DNN）来估计语音和噪声的空间信息很容易适合实时处理，但在不匹配条件不匹配的条件下剧烈的性能降低遭受了巨大的侵害。鉴于这种互补的特征，我们提出了一种基于基于DNN的波束形成的双过程强大的在线语音增强方法，并具有FastMNMF引导的适应性。 FastMNMF（后端）以迷你批次样式进行，嘈杂和增强的语音对与原始的并行训练数据一起使用，以更新方向感知的DNN（前端），并在可计算上可允许的间隔内进行反向传播。该方法与称为加权预测误差（WPE）的盲遗产方法一起使用，用于转录说话者的嘈杂的回响语音，该语音可以从视频中检测到，或以用户的手势或眼睛凝视以流方式进行，并以AR技术的方式显示转录。我们的实验表明，仅使用十二分钟的观察，随着运行时间的适应，单词错误率提高了10点以上。

This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party). One may use a state-of-the-art blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) that works well in various environments thanks to its unsupervised nature. Its heavy computational cost, however, prevents its application to real-time processing. In contrast, a supervised beamforming method that uses a deep neural network (DNN) for estimating spatial information of speech and noise readily fits real-time processing, but suffers from drastic performance degradation in mismatched conditions. Given such complementary characteristics, we propose a dual-process robust online speech enhancement method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF (back end) is performed in a mini-batch style and the noisy and enhanced speech pairs are used together with the original parallel training data for updating the direction-aware DNN (front end) with backpropagation at a computationally-allowable interval. This method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker, which can be detected from video or selected by a user's hand gesture or eye gaze, in a streaming manner and spatially showing the transcriptions with an AR technique. Our experiment showed that the word error rate was improved by more than 10 points with the run-time adaptation using only twelve minutes of observation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题