在培训目标和激活功能上，用于在文本依赖的说话者验证中进行深层表示学习

论文标题

在培训目标和激活功能上，用于在文本依赖的说话者验证中进行深层表示学习

On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

论文作者

Sarkar, Achintya kr., Tan, Zheng-Hua

论文摘要

深度表示学习在推进依赖文本依赖的说话者验证（TD-SV）系统方面已获得了巨大的动力。当设计深神网络（DNN）以提取瓶颈功能时，关键考虑因素包括训练目标，激活功能和损失功能。在本文中，我们系统地研究了这些选择对TD-SV性能的影响。对于培训目标，我们考虑说话者的身份，时间对抗性学习（TCL）和自动回归预测编码，其中第一个被监督，最后两个是自我监督的。此外，当使用说话者身份作为训练目标时，我们研究了一系列损失功能。在激活功能方面，我们研究了广泛使用的Sigmoid函数，整流线性单元（Relu）和高斯误差线性单元（GELU）。我们从实验上表明，与乙状结肠相比，GELU能够显着降低TD-SV的错误率，而与训练目标无关。在三个训练目标中，TCL表现最好。在各种损失函数中，跨熵，联合柔软和局灶性损失函数的表现优于其他功能。最后，不同系统的得分级融合也能够降低错误率。实验是在Reddots 2016挑战数据库的TD-SV上使用简短话语进行的。对于扬声器分类，使用了众所周知的高斯混合模型 - 通用背景模型（GMM-UBM）和I-Vector技术。

Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL) and auto-regressive prediction coding with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid, irrespective of training target. Among the three training targets, TCL performs the best. Among the various loss functions, cross entropy, joint-softmax and focal loss functions outperform the others. Finally, score-level fusion of different systems is also able to reduce the error rates. Experiments are conducted on the RedDots 2016 challenge database for TD-SV using short utterances. For the speaker classifications, the well-known Gaussian mixture model-universal background model (GMM-UBM) and i-vector techniques are used.

下载PDF全文

下载文献需遵守相关版权规定

论文标题