C3-Dino：与说话者验证的联合对比度和非对抗性自我监督学习

论文标题

C3-Dino：与说话者验证的联合对比度和非对抗性自我监督学习

C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification

论文作者

Zhang, Chunlei, Yu, Dong

论文摘要

自我监督的学习（SSL）在语音处理领域引起了人们的关注。最近的研究表明，对比学习能够以一种自我监督的方式学习歧视说话者的嵌入。但是，基本对比度自我监督学习（CSSL）假设从锚固实例的视图产生的对以及其他实例的任何观点都是负面的，这在构建损失函数时引入了许多假负对。该问题称为$ class $ - $ collision $，这仍然是阻碍基于CSSL的扬声器验证（SV）系统无法实现更好性能的一个主要问题。同时，研究表明，无负样本SSL框架在学习扬声器或图像表示方面表现良好。在这项研究中，我们研究了导致SV性能提高的SSL技术。我们首先分析假阴对在CSSL系统中的影响。然后，提出了一种多阶段类别校正（C3）方法，这导致基于CSSL的最新扬声器嵌入系统。根据预验证的CSSL模型，我们进一步建议采用无负样品SSL目标（即Dino）来微调说话者嵌入网络。所得的扬声器嵌入系统（C3-DINO）在Voxceleb1测试集上使用简单的余弦距离评分方法实现2.5％EER，这表现优于先前的SOTA SSL系统（4.86％）的相对相对改善的显着 +45％。借助Voxceleb2训练集的扬声器聚类和伪标记，在C3-Dino扬声器嵌入式上应用的LDA/CD后端可以进一步将EER推向2.2％。对Voxceleb基准和我们的内部数据集的全面实验研究证明了我们提出的方法的有效性，以及SSL SV和受监督的对应物之间的性能差距进一步缩小。

Self-supervised learning (SSL) has drawn an increased attention in the field of speech processing. Recent studies have demonstrated that contrastive learning is able to learn discriminative speaker embeddings in a self-supervised manner. However, base contrastive self-supervised learning (CSSL) assumes that the pairs generated from a view of anchor instance and any view of other instances are all negative, which introduces many false negative pairs in constructing the loss function. The problem is referred as $class$-$collision$, which remains as one major issue that impedes the CSSL based speaker verification (SV) systems from achieving better performances. In the meanwhile, studies reveal that negative sample free SSL frameworks perform well in learning speaker or image representations. In this study, we investigate SSL techniques that lead to an improved SV performance. We first analyse the impact of false negative pairs in the CSSL systems. Then, a multi-stage Class-Collision Correction (C3) method is proposed, which leads to the state-of-the-art CSSL based speaker embedding system. On the basis of the pretrained CSSL model, we further propose to employ a negative sample free SSL objective (i.e., DINO) to fine-tune the speaker embedding network. The resulting speaker embedding system (C3-DINO) achieves 2.5% EER with a simple Cosine Distance Scoring method on Voxceleb1 test set, which outperforms the previous SOTA SSL system (4.86%) by a significant +45% relative improvement. With speaker clustering and pseudo labeling on Voxceleb2 training set, a LDA/CDS back-end applying on the C3-DINO speaker embeddings is able to further push the EER to 2.2%. Comprehensive experimental investigations of the Voxceleb benchmarks and our internal dataset demonstrate the effectiveness of our proposed methods, and the performance gap between the SSL SV and the supervised counterpart narrows further.

下载PDF全文

下载文献需遵守相关版权规定

论文标题