对端到端语音识别的基于电话的子词单元的调查

论文标题

对端到端语音识别的基于电话的子词单元的调查

An investigation of phone-based subword units for end-to-end speech recognition

论文作者

Wang, Weiran, Wang, Guangsen, Bhatnagar, Aadyot, Zhou, Yingbo, Xiong, Caiming, Socher, Richard

论文摘要

电话及其与上下文相关的变体一直是传统语音识别系统的标准建模单元，而字符和子词证明了它们对端到端识别系统的有效性。我们研究了基于电话的子词的使用，特别是字节对编码器（BPE）作为端到端语音识别的建模单元。此外，我们还基于发音字典开发了基于多语言模型的多层次模型解码算法。除了使用容易获得的词典外，我们的系统还避免了需要从常规系统中获得其他专家知识或处理步骤。实验结果表明，基于电话的BPE倾向于产生比基于角色的同行更准确的识别系统。此外，可以通过一种新型的一通凝管搜索解码器获得进一步的改进，该解码器有效地结合了基于电话和角色的BPE系统。对于总机，我们的基于电话的BPE系统在测试集的“打电筒/调用”部分上达到6.8 \％/14.4 \％单词错误率（WER），而关节解码可实现6.3 \％/13.3 \％wer。在Fisher +总机上，联合解码可导致4.9 \％/9.5 \％\％\％，为电话语音识别设定了新的里程碑。

Phones and their context-dependent variants have been the standard modeling units for conventional speech recognition systems, while characters and subwords have demonstrated their effectiveness for end-to-end recognition systems. We investigate the use of phone-based subwords, in particular, byte pair encoder (BPE), as modeling units for end-to-end speech recognition. In addition, we also developed multi-level language model-based decoding algorithms based on a pronunciation dictionary. Besides the use of the lexicon, which is easily available, our system avoids the need of additional expert knowledge or processing steps from conventional systems. Experimental results show that phone-based BPEs tend to yield more accurate recognition systems than the character-based counterpart. In addition, further improvement can be obtained with a novel one-pass joint beam search decoder, which efficiently combines phone- and character-based BPE systems. For Switchboard, our phone-based BPE system achieves 6.8\%/14.4\% word error rate (WER) on the Switchboard/CallHome portion of the test set while joint decoding achieves 6.3\%/13.3\% WER. On Fisher + Switchboard, joint decoding leads to 4.9\%/9.5\% WER, setting new milestones for telephony speech recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题