通过非自动回忆基于插入的模型流媒体ASR流媒体

论文标题

通过非自动回忆基于插入的模型流媒体ASR流媒体

Toward Streaming ASR with Non-Autoregressive Insertion-based Model

论文作者

Fujita, Yuya, Wang, Tianzi, Watanabe, Shinji, Omachi, Motoi

论文摘要

神经端到端（E2E）模型已成为实现实际自动语音识别（ASR）系统的有前途的技术。当意识到这样的系统时，一个重要的问题是对流量输入或较长记录的音频进行分割。音频分割后，优选具有较小实时因子（RTF）的ASR模型，因为系统的延迟可以更快。最近，基于非自动回旋模型的E2E ASR成为一种有前途的方法，因为它可以解码$ n $ lengength的令牌序列，其效率小于$ n $迭代。我们提出了一个系统来连接音频分割和非自动回旋ASR，以实现高准确性和低RTF ASR。作为非自动进取的ASR，使用了基于插入的模型。此外，我们推出了一种新的体系结构，而不是将分割和ASR的分离模型连接起来，它通过单个神经网络实现了音频分割和非自动向上的ASR。日语和英语数据集的实验结果表明，与基线自回旋变压器和连接派时间分类相比，该方法在准确性和RTF之间实现了合理的权衡。

Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems. When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording. After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster. Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an $N$-length token sequence with less than $N$ iterations. We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR. As a non-autoregressive ASR, the insertion-based model is used. In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and non-autoregressive ASR by a single neural network. Experimental results on Japanese and English dataset show that the method achieved a reasonable trade-off between accuracy and RTF compared with baseline autoregressive Transformer and connectionist temporal classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题