通过联合语言识别流媒体端到端多语言语音识别

论文标题

通过联合语言识别流媒体端到端多语言语音识别

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

论文作者

Zhang, Chao, Li, Bo, Sainath, Tara, Strohman, Trevor, Mavandadi, Sepand, Chang, Shuo-yiin, Haghani, Parisa

论文摘要

语言识别对于自动语音识别（ASR）中许多下游任务至关重要，并且有益于将多种语言端到端的ASR集成为附加任务。在本文中，我们建议通过集成每个框架语言标识符（LID）预测器来修改基于级联编码器的复发神经网络传感器（RNN-T）模型的结构。带有级联编码器的RNN-T可以使用不右键的第一通用解码来实现较低延迟的流asr，并使用使用较长的右文本使用第二通通解码来实现较低的单词错误率（WERS）。通过利用当前文本的这种差异和统计池的流式实现，该建议的方法可以实现准确的流盖预测，而几乎没有额外的测试时间成本。带有9个语言语言语言的语音搜索数据集上的实验结果表明，所提出的方法平均达到96.2％的盖子预测准确性，而与输入中的Oracle盖子相同的二次通用方法。

Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.

下载PDF全文

下载文献需遵守相关版权规定

论文标题