论文标题
双模式ASR:通过全文建模统一和改进流媒体ASR
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
论文作者
论文摘要
流动语音识别(ASR)的目的是尽可能快速,准确地发射每个假设的单词,而全文ASR等待完成完整的语音发言,然后再发出完成的假设。在这项工作中,我们提出了一个统一的框架,即双模式ASR,以培训具有共享权重的单个端到端ASR模型,以供流媒体和完整的语音识别。我们表明,流媒体ASR的潜伏期和准确性大大受益于全文ASR的重量共享和联合培训,尤其是在培训期间的Intloplace知识蒸馏中。双模式ASR框架可以应用于最新的基于最新的基于卷积和基于变压器的ASR网络。我们在两个数据集上使用两个最先进的ASR网络,ContextNet和Conformer进行了广泛的实验,这是一个广泛使用的公共数据集LibrisPeech和一个大型数据集多域。实验和消融研究表明,双模式ASR不仅简化了训练和部署流和全文ASR模型的工作流程,而且还显着提高了流媒体ASR的发射潜伏期和识别精度。使用双模式ASR,我们在准确性和延迟方面在Librispeech和多域中实现了新的最新流媒体ASR结果。
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.