令人沮丧的声音训练声学模型

论文标题

令人沮丧的声音训练声学模型

Frustratingly Easy Noise-aware Training of Acoustic Models

论文作者

Raj, Desh, Villalba, Jesus, Povey, Daniel, Khudanpur, Sanjeev

论文摘要

环境噪音和混响对自动语音识别（ASR）系统的性能产生不利影响。基于神经网络的声学模型的多条件培训用于解决此问题，但需要大量的数据扩展，从而增加了训练时间。在本文中，我们提出了话语级噪声向量，用于混合ASR中声学模型的噪声训练。我们的噪声向量是通过在话语中结合语音框架和沉默框架的方式来获得的，在该语音中，可以从训练ASR对齐的GMM-HMM模型中获得语音/沉默标签，因此除了平均特征向量的平均值之外，不需要额外的计算。我们通过对AMI和Aurora-4的实验表明，这种简单的适应技术可能会导致相对改善6-7％。我们实施了文献中提出的几种基于嵌入的适应基准，并表明我们的方法在两个数据集上都优于它们。最后，我们通过使用帧级最大可能性来将方法扩展到在线ASR设置。

Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of acoustic models in hybrid ASR. Our noise vectors are obtained by combining the means of speech frames and silence frames in the utterance, where the speech/silence labels may be obtained from a GMM-HMM model trained for ASR alignments, such that no extra computation is required beyond averaging of feature vectors. We show through experiments on AMI and Aurora-4 that this simple adaptation technique can result in 6-7% relative WER improvement. We implement several embedding-based adaptation baselines proposed in literature, and show that our method outperforms them on both the datasets. Finally, we extend our method to the online ASR setting by using frame-level maximum likelihood for the mean estimation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题