融合自我监督的学习模型用于MOS预测

论文标题

融合自我监督的学习模型用于MOS预测

Fusion of Self-supervised Learned Models for MOS Prediction

论文作者

Yang, Zhengdong, Zhou, Wangjin, Chu, Chenhui, Li, Sheng, Dabre, Raj, Rubino, Raphael, Zhao, Yi

论文摘要

我们参加了2022年的平均意见分数（MOS）预测挑战。这一挑战旨在预测两个轨道上的合成语音得分，即主要轨道和更具挑战性的子轨道：室外（OOD）。为了提高预测分数的准确性，我们探索了几种与融合融合相关的策略，并提出了一个融合框架，其中七个经过预认证的自我监督学（SSL）模型已参与。这些预处理的SSL模型来自三个ASR框架，包括WAV2VEC，HUBERT和WAVLM。对于OOD轨道，我们遵循在主轨道上选择的7个SSL模型，并采用了半监督的学习方法来利用未标记的数据。根据官方分析结果，我们的系统在16个指标中的6个中获得了第一个排名，并且是13个指标中13个系统的第三名系统之一。具体而言，我们在主要轨道上的系统级别上实现了最高的LCC，SRCC和KTAU得分，以及在OOD轨道上的话语级别上LCC，SRCC和KTAU评估指标上的最佳性能。与基本SSL模型相比，融合系统的预测准确性得到了很大改善，尤其是在OOD子轨道上。

We participated in the mean opinion score (MOS) prediction challenge, 2022. This challenge aims to predict MOS scores of synthetic speech on two tracks, the main track and a more challenging sub-track: out-of-domain (OOD). To improve the accuracy of the predicted scores, we have explored several model fusion-related strategies and proposed a fused framework in which seven pretrained self-supervised learned (SSL) models have been engaged. These pretrained SSL models are derived from three ASR frameworks, including Wav2Vec, Hubert, and WavLM. For the OOD track, we followed the 7 SSL models selected on the main track and adopted a semi-supervised learning method to exploit the unlabeled data. According to the official analysis results, our system has achieved 1st rank in 6 out of 16 metrics and is one of the top 3 systems for 13 out of 16 metrics. Specifically, we have achieved the highest LCC, SRCC, and KTAU scores at the system level on main track, as well as the best performance on the LCC, SRCC, and KTAU evaluation metrics at the utterance level on OOD track. Compared with the basic SSL models, the prediction accuracy of the fused system has been largely improved, especially on OOD sub-track.

下载PDF全文

下载文献需遵守相关版权规定

论文标题