论文标题

通过非自动回忆基于插入的模型流媒体ASR流媒体

Toward Streaming ASR with Non-Autoregressive Insertion-based Model

论文作者

Fujita, Yuya, Wang, Tianzi, Watanabe, Shinji, Omachi, Motoi

论文摘要

神经端到端(E2E)模型已成为实现实际自动语音识别(ASR)系统的有前途的技术。当意识到这样的系统时,一个重要的问题是对流量输入或较长记录的音频进行分割。音频分割后,优选具有较小实时因子(RTF)的ASR模型,因为系统的延迟可以更快。最近,基于非自动回旋模型的E2E ASR成为一种有前途的方法,因为它可以解码$ n $ lengength的令牌序列,其效率小于$ n $迭代。我们提出了一个系统来连接音频分割和非自动回旋ASR,以实现高准确性和低RTF ASR。作为非自动进取的ASR,使用了基于插入的模型。此外,我们推出了一种新的体系结构,而不是将分割和ASR的分离模型连接起来,它通过单个神经网络实现了音频分割和非自动向上的ASR。日语和英语数据集的实验结果表明,与基线自回旋变压器和连接派时间分类相比,该方法在准确性和RTF之间实现了合理的权衡。

Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems. When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording. After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster. Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an $N$-length token sequence with less than $N$ iterations. We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR. As a non-autoregressive ASR, the insertion-based model is used. In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and non-autoregressive ASR by a single neural network. Experimental results on Japanese and English dataset show that the method achieved a reasonable trade-off between accuracy and RTF compared with baseline autoregressive Transformer and connectionist temporal classification.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源