使用基于词素的语言模型改善使用解码器的Uyghur ASR系统

论文标题

使用基于词素的语言模型改善使用解码器的Uyghur ASR系统

Improving Uyghur ASR systems with decoders using morpheme-based language models

论文作者

Qiu, Zicheng, Jiang, Wei, Mamut, Turghunjan

论文摘要

Uyghur是一种少数语言，其自动语音识别（ASR）研究的资源始终不足。 Thuyg-20目前是Uyghur演讲的唯一开源数据集。自第一个版本以来，其干净且无噪音的语音测试任务的最新结果尚未更新，这表明主流语言和Uyghur之间ASR的发展存在很大的差距。在本文中，我们试图通过最终优化ASR系统，并开发基于词素的解码器MLDG-Decoder（词素晶格动态生成Uyghur DNN-HMM Systems解码器）来弥合差距。我们已经开源了解码器。 MLDG-DECODER采用了一种算法，称为“与Febabos的直立作曲”，以允许后退状态和过渡来扮演接力站的角色。该算法使动态生成的图能够像静态和完全组成的图一样有效地约束晶格中的词素序列，而当使用了4克词素语言模型（LM）。我们已经训练了更深层和更广泛的神经网络声学模型，并尝试了三种解码方案。实验结果表明，基于静态和完全组成的图的解码可在Thuyg-20中的清洁且无噪声的语音测试任务上降低最新的单词错误率（WER）至14.24％。 MLDG-Decoder将WER降低到14.54％，同时保持内存消耗合理。根据开源的MLDG码头编码器，读者可以轻松地重现本文的实验结果。

Uyghur is a minority language, and its resources for Automatic Speech Recognition (ASR) research are always insufficient. THUYG-20 is currently the only open-sourced dataset of Uyghur speeches. State-of-the-art results of its clean and noiseless speech test task haven't been updated since the first release, which shows a big gap in the development of ASR between mainstream languages and Uyghur. In this paper, we try to bridge the gap by ultimately optimizing the ASR systems, and by developing a morpheme-based decoder, MLDG-Decoder (Morpheme Lattice Dynamically Generating Decoder for Uyghur DNN-HMM systems), which has long been missing. We have open-sourced the decoder. The MLDG-Decoder employs an algorithm, named as "on-the-fly composition with FEBABOS", to allow the back-off states and transitions to play the role of a relay station in on-the-fly composition. The algorithm empowers the dynamically generated graph to constrain the morpheme sequences in the lattices as effectively as the static and fully composed graph does when a 4-Gram morpheme-based Language Model (LM) is used. We have trained deeper and wider neural network acoustic models, and experimented with three kinds of decoding schemes. The experimental results show that the decoding based on the static and fully composed graph reduces state-of-the-art Word Error Rate (WER) on the clean and noiseless speech test task in THUYG-20 to 14.24%. The MLDG-Decoder reduces the WER to 14.54% while keeping the memory consumption reasonable. Based on the open-sourced MLDG-Decoder, readers can easily reproduce the experimental results in this paper.

下载PDF全文

下载文献需遵守相关版权规定

论文标题