UFO2：在线和离线语音识别的统一培训框架

论文标题

UFO2：在线和离线语音识别的统一培训框架

UFO2: A unified pre-training framework for online and offline speech recognition

论文作者

Fu, Li, Li, Siqi, Li, Qingtao, Deng, Liping, Li, Fangzhu, Fan, Lu, Chen, Meng, He, Xiaodong

论文摘要

在本文中，我们提出了一个统一的在线和离线前培训框架（UFO2）自动语音识别（ASR），其中1）简化了两个单独的在线和离线模式的培训工作流程，并提高单词错误率（WER）绩效有限的说服性能。具体而言，我们将常规的离线模式自我监督学习（SSL）的ASR方法扩展到统一的方式，其中模型训练均在全封闭式和动态型嵌入式输入上进行条件。为了增强预训练的表示模型，使用定格梯度操作将在线模式目标转移到量化器中。此外，在训练前和下游微调阶段，都提出了联合损失，以训练两种模式的全重量共享统一模型。 Librispeech数据集的实验结果表明，UFO2的表现分别优于基于SSL的基线方法29.7％和18.2％的离线和在线模式相对降低。

In this paper, we propose a Unified pre-training Framework for Online and Offline (UFO2) Automatic Speech Recognition (ASR), which 1) simplifies the two separate training workflows for online and offline modes into one process, and 2) improves the Word Error Rate (WER) performance with limited utterance annotating. Specifically, we extend the conventional offline-mode Self-Supervised Learning (SSL)-based ASR approach to a unified manner, where the model training is conditioned on both the full-context and dynamic-chunked inputs. To enhance the pre-trained representation model, stop-gradient operation is applied to decouple the online-mode objectives to the quantizer. Moreover, in both the pre-training and the downstream fine-tuning stages, joint losses are proposed to train the unified model with full-weight sharing for the two modes. Experimental results on the LibriSpeech dataset show that UFO2 outperforms the SSL-based baseline method by 29.7% and 18.2% relative WER reduction in offline and online modes, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题