神经传感器的锚定语音识别

论文标题

神经传感器的锚定语音识别

Anchored Speech Recognition with Neural Transducers

论文作者

Raj, Desh, Jia, Junteng, Mahadeokar, Jay, Wu, Chunyang, Moritz, Niko, Zhang, Xiaohui, Kalinli, Ozlem

论文摘要

神经传感器已在标准语音识别基准上实现了人类水平的表现。但是，在串扰词的存在下，它们的性能会大大降低，尤其是当主要扬声器的信噪比较低时。锚性语音识别是指使用锚节段（例如唤醒字）的信息的一类方法，以识别以设备为导向的语音，同时忽略干扰背景语音。在本文中，我们调查了锚定语音识别，以使神经传感器对背景语音的强大。我们使用微小的辅助网络从锚节段中提取上下文信息，并使用编码器偏置和木匠门控来指导传感器对目标语音进行。此外，为了提高上下文嵌入提取的鲁棒性，我们提出了辅助培训目标，以将词汇内容与口语风格相关。我们评估了包括多个SNR和重叠条件的基于合成的基于库的混合物的方法；当在所有条件下平均时，它们在强大的基线上将相对单词错误率提高了19.6％。

Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech. In this paper, we investigate anchored speech recognition to make neural transducers robust to background speech. We extract context information from the anchor segment with a tiny auxiliary network, and use encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentangle lexical content from speaking style. We evaluate our methods on synthetic LibriSpeech-based mixtures comprising several SNR and overlap conditions; they improve relative word error rates by 19.6% over a strong baseline, when averaged over all conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题