端到端构象异构体和混合TDNN ASR系统的两通解码和交叉适应的系统组合

论文标题

端到端构象异构体和混合TDNN ASR系统的两通解码和交叉适应的系统组合

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

论文作者

Cui, Mingyu, Deng, Jiajun, Hu, Shoukang, Xie, Xurong, Wang, Tianzi, Hu, Shujie, Geng, Mengzhe, Xue, Boyang, Liu, Xunying, Meng, Helen

论文摘要

混合动力和端到端（E2E）自动语音识别（ASR）系统之间的基本建模差异在其中创造了巨大的多样性和互补性。本文研究了混合TDNN和构象体E2E ASR系统的基于多通恢复和交叉适应的系统组合方法。在多通恢复，最先进的混合LF-MMI训练有素训练的CNN-TDNN系统中，具有速度扰动，规格和贝叶斯学习隐藏单元贡献（LHUC）扬声器的适应器用于在使用2级交叉系统的扬声器适应符号系统恢复之前生成初始的N-pesters。在交叉适应中，混合CNN-TDNN系统适用于构象异构体系统的1好的输出，反之亦然。在300小时的总机语料库上进行的实验表明，使用两个系统组合方法中的任何一个得出的组合系统都超过了单个系统。在NIST HUB5'00，RT03和RT02评估数据上，使用多通逆转录获得的最佳组合系统降低了统计学意义的单词错误率（WER）降低2.5％至3.9％（相对22.5％至28.9％）。

Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题