论文标题
利用跨域的声学到明显的倒置特征,以识别无序的语音识别
Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition
论文作者
论文摘要
关节特征本质上是声信号失真的不变,并且已成功地纳入了正常语音的自动语音识别(ASR)系统中。他们在言语识别障碍中的实际应用通常受到从受损的说话者收集此类专家数据的困难。本文介绍了一种跨域的声学到关节(A2A)反转方法,该方法在模型训练中使用了15小时的Torgo copus的平行声学数据,然后再用于跨域,适用于102.7小时的Uapeech coptus,并产生明显的特征。使用了基于混合密度网络的神经A2A反转模型。跨域特征适应网络还用于减少Torgo和UASTEECH数据之间的声学不匹配。在这两个任务上,结合了A2A生成的关节功能始终优于基线混合DNN/TDNN,CTC和基于构象异构体的端到端系统,该系统仅使用声学特征构建。最佳的多模式系统结合了视频模式和跨域的关节功能,以及数据增强和学习隐藏单元贡献(LHUC)扬声器的适应性,在16个质心扬声器上产生了最低已发表的单词错误率(WER)为24.82%,该单词错误率为24.82%。
Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems for normal speech. Their practical application to disordered speech recognition is often limited by the difficulty in collecting such specialist data from impaired speakers. This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training before being cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features. Mixture density networks based neural A2A inversion models were used. A cross-domain feature adaptation network was also used to reduce the acoustic mismatch between the TORGO and UASpeech data. On both tasks, incorporating the A2A generated articulatory features consistently outperformed the baseline hybrid DNN/TDNN, CTC and Conformer based end-to-end systems constructed using acoustic features only. The best multi-modal system incorporating video modality and the cross-domain articulatory features as well as data augmentation and learning hidden unit contributions (LHUC) speaker adaptation produced the lowest published word error rate (WER) of 24.82% on the 16 dysarthric speakers of the benchmark UASpeech task.