论文标题
接头CTC转换器上的蒙版预先训练的编码器基础
Masked Pre-trained Encoder base on Joint CTC-Transformer
论文作者
论文摘要
这项研究(这项工作是在Tencent AI实验室实习期间完成的)解决了半监督的声学建模,即通过无监督的音频数据获得高级表示,并通过监督数据对预先训练的模型的参数进行微调。拟议的方法采用了两个阶段的训练框架,该框架由掩盖的预训练编码器(MPE)和联合CTC转换器(JCT)组成。在MPE框架中,在编码器上使用大量无监督数据对输入帧的一部分进行掩盖和重建。在JCT框架中,与原始变压器相比,声学特征被用作输入而不是纯文本。 CTC损耗作为编码器顶部的预测目标执行,解码器块保持不变。本文介绍了两阶段训练方法与完全监督的JCT之间的比较。此外,本文还研究了我们方法对培训数据的不同伏击的鲁棒性。对两阶段训练方法的实验比完全监督的模型提供了更好的性能。具有两个阶段训练的单词错误率(WER)仅利用WSJ标记的数据的30%降低了17 \%的降低,而该培训比以完全监督的方式受到了50 \%WSJ的培训。此外,将MPE的未标记数据从WSJ(81H)增加到LibrisPeech(960H)的增加约为22 \%。
This study (The work was accomplished during the internship in Tencent AI lab) addresses semi-supervised acoustic modeling, i.e. attaining high-level representations from unsupervised audio data and fine-tuning the parameters of pre-trained model with supervised data. The proposed approach adopts a two-stage training framework, consisting of masked pre-trained encoder (MPE) and Joint CTC-Transformer (JCT). In the MPE framework, part of input frames are masked and reconstructed after the encoder with massive unsupervised data. In JCT framework, compared with original Transformer, acoustic features are applied as input instead of plain text. CTC loss performs as the prediction target on top of the encoder, and decoder blocks remain unchanged. This paper presents a comparison between two-stage training method and the fully supervised JCT. In addition, this paper investigates the our approach's robustness against different volumns of training data. Experiments on the two-stage training method deliver much better performance than fully supervised model. The word error rate (WER) with two-stage training which only exploits 30\% of WSJ labeled data achieves 17\% reduction than which trained by 50\% of WSJ in a fully supervised way. Moreover, increasing unlabeled data for MPE from WSJ (81h) to Librispeech (960h) attains about 22\% WER reduction.