论文标题
TESSP:文本增强的自我监督语音预训练
TESSP: Text-Enhanced Self-Supervised Speech Pre-training
论文作者
论文摘要
自我监督的语音预训练使模型具有语音信号中固有的上下文结构,而自我监督的文本预训练则可以通过语言信息赋予模型。他们俩对ASR等下游语音任务都是有益的。但是,独特的训练预训练目标使得在同一模型中共同优化语音和文本表示是具有挑战性的。为了解决这个问题,我们提出了文本增强的自我监督语音预训练(TESSP),旨在将语言信息纳入语音预训练中。我们的模型由三个部分,即语音编码器,文本编码器和共享编码器组成。该模型分别将无监督的语音和文本数据作为输入和分别利用共同的Hubert和MLM损失。我们还提出了音素上采样和表示交换,以实现语音和文本信息的联合建模。具体而言,为了解决语音和文本数据之间的长度不匹配问题,我们将文本序列和示例示例置于音素中,并使用从一小部分监督数据中提取的对齐信息进行示例样本。此外,为了缩小学到的语音和文本表示之间的差距,我们将文本表示形式与各个私人编码者根据对齐信息提取的语音表示形式交换。 Librispeech数据集上的实验显示,与测试清洁和测试集的WAVLM相比,所提出的TESSP模型可提高10%以上。我们还在精湛的基准上评估了我们的模型,这表明我们的模型在音素识别,声音识别和语音翻译上与WAVLM相比具有更好的性能。
Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swapping to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10% improvement compared with WavLM on the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM.