论文标题
USTED:通过统一的语音和文本编码器改进ASR
USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder
论文作者
论文摘要
通过合并外部文本数据来改善端到端的语音识别一直是一个长期的研究主题。最近一直关注培训E2E ASR模型,这些模型可获得外部文本数据的性能优势,而不会在推理时间内评估外部语言模型的额外成本。在这项工作中,我们建议与一组文本到文本辅助任务共同提出培训ASR模型,并与之共享编码器和部分编码器。 When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model.当我们在LibrisPeech数据上训练蒙版语言模型或将机器翻译用作辅助任务时,我们就可以实现进一步的改进,而无需显着牺牲任务本身的性能。
Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.