使用未配对的语音和文本的互补联合培训方法，用于低资源自动语音识别

论文标题

使用未配对的语音和文本的互补联合培训方法，用于低资源自动语音识别

A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition

论文作者

Du, Ye-Qian, Zhang, Jie, Zhu, Qiu-Shi, Dai, Li-Rong, Wu, Ming-Hui, Fang, Xin, Yang, Zhou-Wang

论文摘要

未配对的数据已显示对低资源自动语音识别（ASR）有益，这可以参与具有多任务培训或依赖语言模型的预训练的混合模型的设计。在这项工作中，我们利用未配对的数据来训练一般序列到序列模型。在模型培训之前，以数据对的形式使用了不规则的语音和文本，以数据对的形式使用。受声音特征和语言特征的互补性和综合文本对的互补性的启发，我们提出了一种互补的关节培训〜（CJT）方法，该方法可以使用两个数据对训练模型。此外，提出了针对合成音频的伪标签和梯度限制的标签掩模，以进一步应对与实际数据的偏差，称为CJT ++。实验结果表明，与仅语音训练相比，所提出的基本CJT可以在清洁/其他测试集上取得巨大的性能提高，而CJT ++重新训练可以进一步提高性能。显然，所提出的方法的表现优于具有相同模型大小和梁大小的WAV2VEC2.0模型，尤其是在极端低资源的情况下。

Unpaired data has shown to be beneficial for low-resource automatic speech recognition~(ASR), which can be involved in the design of hybrid models with multi-task training or language model dependent pre-training. In this work, we leverage unpaired data to train a general sequence-to-sequence model. Unpaired speech and text are used in the form of data pairs by generating the corresponding missing parts in prior to model training. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair in both acoustic features and linguistic features, we propose a complementary joint training~(CJT) method that trains a model alternatively with two data pairs. Furthermore, label masking for pseudo-labels and gradient restriction for synthesized audio are proposed to further cope with the deviations from real data, termed as CJT++. Experimental results show that compared to speech-only training, the proposed basic CJT achieves great performance improvements on clean/other test sets, and the CJT++ re-training yields further performance enhancements. It is also apparent that the proposed method outperforms the wav2vec2.0 model with the same model size and beam size, particularly in extreme low-resource cases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题