自我监督的音频语音表示通过多模式自我验证学习

论文标题

自我监督的音频语音表示通过多模式自我验证学习

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

论文作者

Zhang, Jing-Xuan, Wan, Genshun, Ling, Zhen-Hua, Pan, Jia, Gao, Jianqing, Liu, Cong

论文摘要

在这项工作中，我们提出了一种名为AV2VEC的新颖方法，该方法是通过多模式自我验证来学习视听语音表示的。 AV2VEC有一个学生和一个教师模块，其中学生使用老师在线生成的多模式目标功能执行蒙面的潜在特征回归任务。教师模型的参数是学生的动量更新。由于我们的目标功能是在线生成的，因此AV2VEC不需要像AV-Hubert这样的迭代步骤，并且总培训时间成本降至五分之一。我们在这项研究中进一步提出了AV2VEC-MLM，该研究使用多任务学习增强了使用蒙版语言模型（MLM）式损失的AV2VEC。我们的实验结果表明，AV2VEC的性能与AV-Hubert基线相当。当与MLM式损失结合使用时，AV2VEC-MLM的表现优于基准，并在下游任务上取得了最佳性能。

In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题