无失真的多通道目标语音增强语音识别

论文标题

无失真的多通道目标语音增强语音识别

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

论文作者

Wu, Bo, Yu, Meng, Chen, Lianwu, Xu, Yong, Weng, Chao, Su, Dan, Yu, Dong

论文摘要

基于深度学习的语音增强技术已对语音质量和清晰度的改善产生了重大提高。然而，通过客观指标衡量的语音质量的大幅提高，例如对语音质量的感知评估（PESQ），并不一定会导致由于在增强阶段的语音扭曲而导致的语音识别表现提高。在本文中，提出了基于卷积网络的多渠道扩张网络建模，以增强远场，嘈杂和多访问者条件的目标扬声器。我们研究了用于重叠语音识别的三种无失真波形的方法：估计具有无限范围的复杂理想比率掩盖，并将Fbank损失纳入了多目标学习中，并通过声学模型将增强模型列出。实验结果证明了所有三种方法在减少语音扭曲和提高识别准确性方面的有效性。特别是，该联合调整的增强模型与实际测试数据上的其他独立声学模型非常有效。

Speech enhancement techniques based on deep learning have brought significant improvement on speech quality and intelligibility. Nevertheless, a large gain in speech quality measured by objective metrics, such as perceptual evaluation of speech quality (PESQ), does not necessarily lead to improved speech recognition performance due to speech distortion in the enhancement stage. In this paper, a multi-channel dilated convolutional network based frequency domain modeling is presented to enhance target speaker in the far-field, noisy and multi-talker conditions. We study three approaches towards distortionless waveforms for overlapped speech recognition: estimating complex ideal ratio mask with an infinite range, incorporating the fbank loss in a multi-objective learning and finetuning the enhancement model by an acoustic model. Experimental results proved the effectiveness of all three approaches on reducing speech distortions and improving recognition accuracy. Particularly, the jointly tuned enhancement model works very well with other standalone acoustic model on real test data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题